Search Aggregations#
Aggregations let you summarize and group the items returned by a search, so that instead of (or in addition to) a list of results you get counts, statistics, and breakdowns over the matching data set.
They are the mechanism behind most Squirro dashboard widgets: the facet list on the side of a search page, the bar chart of top countries, the timeline of publications per week, the “relevant communities” panel, and so on.
This page describes how to request aggregations from the Squirro search API and how to read the response.
When to Use Them#
Aggregations answer questions like:
How many items match this query, broken down by source?
How did the volume of items evolve over the last 12 months?
Which countries are most represented, and within each country, which news providers?
What is the average sentiment score across the result set?
Which terms are unusually frequent in this subset compared to the whole project? (via
significant_terms)
If you only need the list of items themselves, you do not need aggregations. If you need numbers about the matching items, you do.
Request Syntax#
Aggregations are passed as the aggregations parameter of the search API.
Using the Python SDK (SquirroClient), that is the
aggregations keyword argument of query():
client.query(
project_id,
query="market risk",
count=0, # only aggregations, not items
aggregations={
"by_country": {
"fields": "country",
"method": "terms",
"size": 10,
},
},
)
The value is an object where each key is a label you choose and each value describes one aggregation:
{
"<label>": {
"fields": "<field>", # or ["<field>", ...]
"method": "<method>",
"size": <number>,
"interval": "<interval>",
"aggregation": {...}, # optional nested aggregation
},
}
The label appears in the response and lets you run several independent
aggregations in one call. If you omit fields, the label itself is used as
the field name. For example, {"language": {}} is shorthand for aggregating
on the language field.
Field Names#
fields uses Squirro field names, not raw Elasticsearch field names:
language,source,provider,starred,readare common built-in fields.$item_created_at,$item_id, and similar special fields are prefixed with$.Any other value is mapped to the matching keyword facet (for example,
country,author, or a custom keyword you ingest with your items).
Methods#
The method parameter selects the kind of aggregation. If you omit it,
terms is used.
Bucket Methods#
These group items into buckets and count how many items fall into each.
termsGroup by distinct values of a field. This is the most common case, used for facets and bar charts. Accepts
size(default10).significant_termsLike
terms, but returns values that are statistically over-represented in the result set compared to a background query. Useful for discovering characteristic terms. Acceptsbackground_query,min_doc_count(default3), andsize.histogramBucket numeric or date fields into equal-width intervals. For date fields this produces a time series (the server automatically switches to
date_histogram). Requiresinterval. For dates, the allowed units aresecond,minute,hour,day,week,month,quarter,year(for example,"1w"or"month"). If omitted on a date field, Squirro derives a sensible interval from thecreated_afterandcreated_beforequery parameters.top_hitsInstead of counts, return the top
sizeitems per bucket. Most useful as a sub-aggregation (“show me the top 3 items per source”).
Metric Methods#
These compute a single number over the matching items (or per bucket, when nested inside a bucket aggregation).
avg,sum,min,maxObvious numeric aggregates over
fields.statsReturns
count,min,max,avg, andsumin one call.extended_statsLike
statsplus variance and standard deviation.percentilesPercentile distribution of the field values.
cardinalityApproximate count of distinct values, for example “how many distinct authors are in this result set?”.
value_countTotal number of values observed for the field.
Squirro-Specific Methods#
relevant_communitiesReturns the communities that best match the current query, based on the community types configured for the project. Accepts
community_typesandexclude_communities.top_k_aggregationPaginated terms aggregation used internally by the communities widget. Accepts
communitiesandpagination.
Nesting#
Any bucket aggregation can contain a sub-aggregation under the aggregation
key. This is how you build two-dimensional breakdowns such as
“per country, per source” or “per week, number of distinct authors”:
aggregations = {
"timeline": {
"fields": "$item_created_at",
"method": "histogram",
"interval": "1w",
"aggregation": {
"fields": "country",
"method": "cardinality",
},
},
}
Sub-aggregations can themselves contain further sub-aggregations, but each level adds cost, so keep nesting shallow.
Response Format#
The response contains an aggregations object mirroring your request:
{
"aggregations": {
"by_country": {
"country": {
"values": [
{"key": "United States of America", "value": 390},
{"key": "Australia", "value": 144},
{"key": "Canada", "value": 89}
],
"display_name": "Country",
"sampled_docs": 1357
}
}
}
}
For each labeled aggregation there is an inner object keyed by field name,
containing a values list. Each entry has:
keyThe bucket value (for example, a country name or an ISO timestamp for date histograms).
valueThe count for bucket methods, or the numeric result for metric methods.
valuesPresent when the aggregation has a sub-aggregation. Contains the nested buckets for this parent bucket.
total_ratioFor
terms, the share of matching items in this bucket.
Date histograms additionally return interval_seconds so clients know the
bucket width without parsing the interval string.
If you only need the aggregation output and not the items themselves, pass
count=0 in the query. The response still contains total (the number
of matching items) and the requested aggregations.
Examples#
Facet List (Top 10 Sources)#
client.query(project_id, count=0, aggregations={
"sources": {"fields": "source", "size": 10},
})
Multiple Independent Facets in One Call#
client.query(project_id, count=0, aggregations={
"language": {},
"provider": {},
"top_countries": {"fields": "country", "size": 20},
})
Timeline of Items per Week#
client.query(
project_id,
query="climate",
count=0,
created_after="2024-01-01",
created_before="2024-12-31",
aggregations={
"timeline": {
"fields": "$item_created_at",
"method": "histogram",
"interval": "1w",
},
},
)
Nested Breakdown: Country × Source#
aggregations = {
"country_by_source": {
"fields": "country",
"size": 10,
"aggregation": {
"fields": "source",
"method": "terms",
"size": 5,
},
},
}
Metric over the Result Set#
aggregations = {
"avg_sentiment": {
"fields": "sentiment_score",
"method": "avg",
},
}
Significant Terms#
aggregations = {
"characteristic_terms": {
"fields": "keywords",
"method": "significant_terms",
"size": 10,
},
}
Performance Notes#
Aggregations run on the search shards, and cost grows with cardinality and nesting depth. Prefer a realistic
sizelimit ontermsaggregations rather than requesting very large top-N lists.For dashboards that only display aggregated numbers, set
count=0so the search does not also materialize the item list.Squirro applies a random-sampler optimization on aggregations for very large result sets. The sample size used is reported as
sampled_docsin the response.