Search Aggregations#

Aggregations let you summarize and group the items returned by a search, so that instead of (or in addition to) a list of results you get counts, statistics, and breakdowns over the matching data set.

They are the mechanism behind most Squirro dashboard widgets: the facet list on the side of a search page, the bar chart of top countries, the timeline of publications per week, the “relevant communities” panel, and so on.

This page describes how to request aggregations from the Squirro search API and how to read the response.

When to Use Them#

Aggregations answer questions like:

  • How many items match this query, broken down by source?

  • How did the volume of items evolve over the last 12 months?

  • Which countries are most represented, and within each country, which news providers?

  • What is the average sentiment score across the result set?

  • Which terms are unusually frequent in this subset compared to the whole project? (via significant_terms)

If you only need the list of items themselves, you do not need aggregations. If you need numbers about the matching items, you do.

Request Syntax#

Aggregations are passed as the aggregations parameter of the search API. Using the Python SDK (SquirroClient), that is the aggregations keyword argument of query():

client.query(
    project_id,
    query="market risk",
    count=0,  # only aggregations, not items
    aggregations={
        "by_country": {
            "fields": "country",
            "method": "terms",
            "size": 10,
        },
    },
)

The value is an object where each key is a label you choose and each value describes one aggregation:

{
    "<label>": {
        "fields": "<field>",  # or ["<field>", ...]
        "method": "<method>",
        "size": <number>,
        "interval": "<interval>",
        "aggregation": {...},  # optional nested aggregation
    },
}

The label appears in the response and lets you run several independent aggregations in one call. If you omit fields, the label itself is used as the field name. For example, {"language": {}} is shorthand for aggregating on the language field.

Field Names#

fields uses Squirro field names, not raw Elasticsearch field names:

  • language, source, provider, starred, read are common built-in fields.

  • $item_created_at, $item_id, and similar special fields are prefixed with $.

  • Any other value is mapped to the matching keyword facet (for example, country, author, or a custom keyword you ingest with your items).

Methods#

The method parameter selects the kind of aggregation. If you omit it, terms is used.

Bucket Methods#

These group items into buckets and count how many items fall into each.

terms

Group by distinct values of a field. This is the most common case, used for facets and bar charts. Accepts size (default 10).

significant_terms

Like terms, but returns values that are statistically over-represented in the result set compared to a background query. Useful for discovering characteristic terms. Accepts background_query, min_doc_count (default 3), and size.

histogram

Bucket numeric or date fields into equal-width intervals. For date fields this produces a time series (the server automatically switches to date_histogram). Requires interval. For dates, the allowed units are second, minute, hour, day, week, month, quarter, year (for example, "1w" or "month"). If omitted on a date field, Squirro derives a sensible interval from the created_after and created_before query parameters.

top_hits

Instead of counts, return the top size items per bucket. Most useful as a sub-aggregation (“show me the top 3 items per source”).

Metric Methods#

These compute a single number over the matching items (or per bucket, when nested inside a bucket aggregation).

avg, sum, min, max

Obvious numeric aggregates over fields.

stats

Returns count, min, max, avg, and sum in one call.

extended_stats

Like stats plus variance and standard deviation.

percentiles

Percentile distribution of the field values.

cardinality

Approximate count of distinct values, for example “how many distinct authors are in this result set?”.

value_count

Total number of values observed for the field.

Squirro-Specific Methods#

relevant_communities

Returns the communities that best match the current query, based on the community types configured for the project. Accepts community_types and exclude_communities.

top_k_aggregation

Paginated terms aggregation used internally by the communities widget. Accepts communities and pagination.

Nesting#

Any bucket aggregation can contain a sub-aggregation under the aggregation key. This is how you build two-dimensional breakdowns such as “per country, per source” or “per week, number of distinct authors”:

aggregations = {
    "timeline": {
        "fields": "$item_created_at",
        "method": "histogram",
        "interval": "1w",
        "aggregation": {
            "fields": "country",
            "method": "cardinality",
        },
    },
}

Sub-aggregations can themselves contain further sub-aggregations, but each level adds cost, so keep nesting shallow.

Response Format#

The response contains an aggregations object mirroring your request:

{
    "aggregations": {
        "by_country": {
            "country": {
                "values": [
                    {"key": "United States of America", "value": 390},
                    {"key": "Australia", "value": 144},
                    {"key": "Canada", "value": 89}
                ],
                "display_name": "Country",
                "sampled_docs": 1357
            }
        }
    }
}

For each labeled aggregation there is an inner object keyed by field name, containing a values list. Each entry has:

key

The bucket value (for example, a country name or an ISO timestamp for date histograms).

value

The count for bucket methods, or the numeric result for metric methods.

values

Present when the aggregation has a sub-aggregation. Contains the nested buckets for this parent bucket.

total_ratio

For terms, the share of matching items in this bucket.

Date histograms additionally return interval_seconds so clients know the bucket width without parsing the interval string.

If you only need the aggregation output and not the items themselves, pass count=0 in the query. The response still contains total (the number of matching items) and the requested aggregations.

Examples#

Facet List (Top 10 Sources)#

client.query(project_id, count=0, aggregations={
    "sources": {"fields": "source", "size": 10},
})

Multiple Independent Facets in One Call#

client.query(project_id, count=0, aggregations={
    "language": {},
    "provider": {},
    "top_countries": {"fields": "country", "size": 20},
})

Timeline of Items per Week#

client.query(
    project_id,
    query="climate",
    count=0,
    created_after="2024-01-01",
    created_before="2024-12-31",
    aggregations={
        "timeline": {
            "fields": "$item_created_at",
            "method": "histogram",
            "interval": "1w",
        },
    },
)

Nested Breakdown: Country × Source#

aggregations = {
    "country_by_source": {
        "fields": "country",
        "size": 10,
        "aggregation": {
            "fields": "source",
            "method": "terms",
            "size": 5,
        },
    },
}

Metric over the Result Set#

aggregations = {
    "avg_sentiment": {
        "fields": "sentiment_score",
        "method": "avg",
    },
}

Significant Terms#

aggregations = {
    "characteristic_terms": {
        "fields": "keywords",
        "method": "significant_terms",
        "size": 10,
    },
}

Performance Notes#

  • Aggregations run on the search shards, and cost grows with cardinality and nesting depth. Prefer a realistic size limit on terms aggregations rather than requesting very large top-N lists.

  • For dashboards that only display aggregated numbers, set count=0 so the search does not also materialize the item list.

  • Squirro applies a random-sampler optimization on aggregations for very large result sets. The sample size used is reported as sampled_docs in the response.