All about aggregations

@talevy
All About Aggregations
Tal Levy, Software Engineer

- http://localhost:9200
{ }
{ “tagline”: “You Know, for Search” }

3
• Originally built on Lucene for text-based
searching
• Lucene and Elasticsearch work together to
provide new storage formats and data types
specific for numeric and keyword metrics.
• Aggregations alongside searching
More than search

Searching & Aggregating
7
price color make sold
10000 red honda 10/28/2016
20000 red honda 11/05/2016
30000 green ford 05/08/2016
15000 blue toyota 07/02/2016
12000 green toyota 08/19/2016
20000 red honda 11/05/2016
80000 red bmw 01/01/2016
25000 blue ford 02/12/2016

Data Structures For Field Values on Shards
8
color
red
red
green
blue
green
red
red
blue
• Two considerations for our data
• Fast querying by values
• Fast aggregating by values

Inverted Index: terms-to-documents
9
color doc1 doc2 doc3
red ◉ ◉ ◉
blue ◉ ◉ ◉
green ◉ ◉ ◉
purple ◉ ◉ ◉
orange ◉ ◉ ◉
white ◉ ◉ ◉
black ◉ ◉ ◉
brown ◉ ◉ ◉

Doc Values: documents-to-terms
10
1
value
per
document
1 column per field
price color make sold
10000 red honda 10/28/2016
20000 red honda 11/05/2016
30000 green ford 05/08/2016
15000 blue toyota 07/02/2016
12000 green toyota 08/19/2016
20000 red honda 11/05/2016
80000 red bmw 01/01/2016
25000 blue ford 02/12/2016

How Distributed Aggregations Work?
11
Data nodes
Coordinating node
• inline with search query
• Executed in isolation on
each shard
• 4 phases
• Parse
• Collect
• Combine
• Reduce

Phase 1: Parse
12
Data nodes
Coordinating node
• Coordinating node splits
the request into shard
requests
• Shards parse
aggregations and
initialize data-structures

Phase 2,3: Collect, Combine
13
Data nodes
Coordinating node
• Shards process all
matching documents
• Once done, they combine
aggregated data into
an aggregation

Phase 4: Reduce
14
Data nodes
Coordinating node
• Shards send their
aggregations to the
coordinating node
• Which reduces them
into a single aggregation

Designed for speed
15
Single network round-trip
Single pass through data on shards
Aggregates are computed in memory
Trades accuracy for speed
Only pay for documents that match query
Can be composed (average response time — broken by day)

Types of Aggregations
16
• Bucket
• Terms
• (Date) Histograms
• Filter
• Range
• …
• Metric
• Stats
• Percentiles
• Cardinality (unique counts)
• Top Hits
• Scripted
• …

Example Terms Aggregation Query
17
GET products/_search
{
"size" : 0,
"query": {"match_all": {} },
"aggs" : {
"my_produce_ids” : {
"terms": {
"field": "pid",
"size": 3
}
}
}
}

Example Terms Aggregation Response
18
{
"hits": {…},
"aggregations": {
"my_product_ids”: {
"doc_count_error_upper_bound": 3302,
"sum_other_doc_count": 8879020,
"buckets": [
{ "key": "030758836X", "doc_count": 7440 },
{ "key": "0439023483", "doc_count": 6717 },
{ "key": "0375831002", "doc_count": 4864 }
]
}}}

Things To Consider
19
{
"hits": {…},
"aggregations": {
"my_product_ids”: {
"doc_count_error_upper_bound": 3302,
"sum_other_doc_count": 8879020,
"buckets": [
{ "key": "030758836X", "doc_count": 7440 },
{ "key": "0439023483", "doc_count": 6717 },
{ "key": "0375831002", "doc_count": 4864 }
]
}}}
Upper bound on error on counts for each term
number of docs not included in buckets

Locality Bias: Top N(1)
20
A
COUNT
RED 5
GREEN 4
BLUE 2
COUNT
RED 2
GREEN 4
BLUE 1
B
COUNT
RED 7
GREEN 8
BLUE 3
A B
Node A’s Counts Node B’s Counts Global Counts

Shard Size: Top 3
21
Data nodes
Coordinating node
• How many buckets to
return per shard?
• “shard_size”
15
15
15
15
3

Example Terms Aggregation Query
22
GET products/_search
{
"size" : 0,
"query": {"match_all": {} },
"aggs" : {
"my_produce_ids” : {
"terms": {
"field": "pid",
"size": 3,
“shard_size”: 999999
}
}
}
}

Summary
23
Aggregations are powerful & fast
Need to trade accuracy for speed/memory in some cases
Use `shard_size` to help manage accuracy with terms aggregation
Leverage Kibana to help write aggregations!
Profile your aggregations using the Query Profiler

What We Missed
24
Pipeline Aggregations: Aggregations of Aggregations
Using `requests.cache` to cache complex static aggregations
Matrix Aggregations: covariance and correlation
New aggregation types introduced all the time

What to expect?
26
Efficient sparse doc-value reading and writing
index-time sorting
Removal of types
Cross-cluster search
Upgrading to 6.0 with rolling restarts!
and so much more!

• Elastic Discussion Forums:
https://discuss.elastic.co/
• Aggregation Documentation:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-
aggregations.html
• Terms Aggregation Approximation: https://www.elastic.co/guide/en/elasticsearch/
reference/current/search-aggregations-bucket-terms-aggregation.html#search-
aggregations-bucket-terms-aggregation-approximate-counts
• Similar Deck From my colleagues Adrien and Colin! https://www.elastic.co/elasticon/
2015/sf/all-about-aggregations
Resources
27

All about aggregations

Recommended

Recommended

More Related Content

Similar to All about aggregations

Similar to All about aggregations (20)

More from Fan Robbin

More from Fan Robbin (10)

Recently uploaded

Recently uploaded (20)

All about aggregations