3. 3
• Originally built on Lucene for text-based
searching
• Lucene and Elasticsearch work together to
provide new storage formats and data types
specific for numeric and keyword metrics.
• Aggregations alongside searching
More than search
7. Searching & Aggregating
7
price color make sold
10000 red honda 10/28/2016
20000 red honda 11/05/2016
30000 green ford 05/08/2016
15000 blue toyota 07/02/2016
12000 green toyota 08/19/2016
20000 red honda 11/05/2016
80000 red bmw 01/01/2016
25000 blue ford 02/12/2016
8. Data Structures For Field Values on Shards
8
color
red
red
green
blue
green
red
red
blue
• Two considerations for our data
• Fast querying by values
• Fast aggregating by values
10. Doc Values: documents-to-terms
10
1
value
per
document
1 column per field
price color make sold
10000 red honda 10/28/2016
20000 red honda 11/05/2016
30000 green ford 05/08/2016
15000 blue toyota 07/02/2016
12000 green toyota 08/19/2016
20000 red honda 11/05/2016
80000 red bmw 01/01/2016
25000 blue ford 02/12/2016
11. How Distributed Aggregations Work?
11
Data nodes
Coordinating node
• inline with search query
• Executed in isolation on
each shard
• 4 phases
• Parse
• Collect
• Combine
• Reduce
12. Phase 1: Parse
12
Data nodes
Coordinating node
• Coordinating node splits
the request into shard
requests
• Shards parse
aggregations and
initialize data-structures
13. Phase 2,3: Collect, Combine
13
Data nodes
Coordinating node
• Shards process all
matching documents
• Once done, they combine
aggregated data into
an aggregation
14. Phase 4: Reduce
14
Data nodes
Coordinating node
• Shards send their
aggregations to the
coordinating node
• Which reduces them
into a single aggregation
15. Designed for speed
15
Single network round-trip
Single pass through data on shards
Aggregates are computed in memory
Trades accuracy for speed
Only pay for documents that match query
Can be composed (average response time — broken by day)
19. Things To Consider
19
{
"hits": {…},
"aggregations": {
"my_product_ids”: {
"doc_count_error_upper_bound": 3302,
"sum_other_doc_count": 8879020,
"buckets": [
{ "key": "030758836X", "doc_count": 7440 },
{ "key": "0439023483", "doc_count": 6717 },
{ "key": "0375831002", "doc_count": 4864 }
]
}}}
Upper bound on error on counts for each term
number of docs not included in buckets
20. Locality Bias: Top N(1)
20
A
COUNT
RED 5
GREEN 4
BLUE 2
COUNT
RED 2
GREEN 4
BLUE 1
B
COUNT
RED 7
GREEN 8
BLUE 3
A B
Node A’s Counts Node B’s Counts Global Counts
21. Shard Size: Top 3
21
Data nodes
Coordinating node
• How many buckets to
return per shard?
• “shard_size”
15
15
15
15
3
23. Summary
23
Aggregations are powerful & fast
Need to trade accuracy for speed/memory in some cases
Use `shard_size` to help manage accuracy with terms aggregation
Leverage Kibana to help write aggregations!
Profile your aggregations using the Query Profiler
24. What We Missed
24
Pipeline Aggregations: Aggregations of Aggregations
Using `requests.cache` to cache complex static aggregations
Matrix Aggregations: covariance and correlation
New aggregation types introduced all the time
26. What to expect?
26
Efficient sparse doc-value reading and writing
index-time sorting
Removal of types
Cross-cluster search
Upgrading to 6.0 with rolling restarts!
and so much more!
27. • Elastic Discussion Forums:
https://discuss.elastic.co/
• Aggregation Documentation:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-
aggregations.html
• Terms Aggregation Approximation: https://www.elastic.co/guide/en/elasticsearch/
reference/current/search-aggregations-bucket-terms-aggregation.html#search-
aggregations-bucket-terms-aggregation-approximate-counts
• Similar Deck From my colleagues Adrien and Colin! https://www.elastic.co/elasticon/
2015/sf/all-about-aggregations
Resources
27