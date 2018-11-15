Successfully reported this slideshow.
Optimizing Elastic for Search Jared McQueen October 25, 2018
"You Know, for Search"
Elasticsearch is already optimized
8 Optimizing Elastic for SEARCH At a sufficiently low level, understand how distributed search works How architecture cons...
Optimizing Elastic for SEARCH
Distributed Search 101 5 minute crash-course
11 Client Kibana cURL app / web app custom code B4J C++ Clojure ColdFusion (CFML) Erlang Go Groovy Haskell Java JavaScript...
?q=user:jmcqueen 12 Index API endpoint Query GET my_index/_search (Collection of similar documents) REST Request URI Elast...
{ "query": { "match": { "user": "jmcqueen" } } } 13 GET my_index/_search REST Request Body Elasticsearch Reference » Searc...
14 Node A Client Cluster: A cluster is a collection of one or more nodes (servers) that together holds your entire data an...
15 Node A Client Node B Node C Node D search request Cluster
16 Node A Client Cluster Node B Node C Node D search request (coordinating)
17 Node A Cluster Node B Node C Node D (coordinating) Scatter Phase The coordinating node forwards the request to the data...
18 Index Shard Shard Segments Segments An index is a collection of documents that have somewhat similar characteristics In...
19 Shards contain mini-indices in Lucene where data is actually stored Shard: Index Shard Shard Segments Segments Node A
20 Where data in Lucene is flushed, merged, and stored on disk Segments: Index Shard Shard Segments Segments Node A
21 A sufficient amount of data is collected to satisfy the query and sent back to the coordinating node Index Shard Shard ...
22 Node A Cluster Node B Node C Node D (coordinating) Gather Phase The coordinating node reduces each data node’s results ...
23 Node A Client Cluster Node B Node C Node D (coordinating) search response Coordinating node sends results back to client
Crash Course Complete!
how to void your warranty Knobs and Dials
26 GET _cluster/settings { "persistent": {}, "transient": {} }
27 { "my_index": { "settings": { "index": { "creation_date": "1539784825143", "number_of_shards": "5", "number_of_replicas...
28 GET my_index/_settings?include_defaults GET _cluster/settings?include_defaults ~200 lines of JSON ~1000 lines of JSON I...
29
Cluster Optimizations
31 By default, Elasticsearch selects from the available shard copies in an unspecified order, taking the allocation awaren...
32 Node A Node B Node C Node D ES Cluster Client search request This isn’t your typical deployment
33 Node A Node B Node C Node D Client search request EastWest ES Cluster Example Attributes: • Hot/Warm • SSD/HDD • Region...
34 Elasticsearch Reference » Modules » Cluster » Shard Allocation Awareness Node Attributes If Elasticsearch is aware of t...
35 When executing search or GET requests, with shard awareness enabled, Elasticsearch will prefer using local shards — sha...
36 Node A Node B Node C Node D (coordinating) By default, searches are sent round robin, with no concerns for the health a...
37 Node A Node B Node C Node D (coordinating) bad! What happens if a node gets distressed? Remember, searches are only as ...
38 Instead of sending requests in a round-robin fashion to each copy of the shard, Elasticsearch selects the "best" copy a...
39 Node A Client Expensive Searches What if you have no other option than to run an expensive search? Is there a way to mi...
40 Node A Client Node B Node C Node D Cluster Dedicated Coordinating Node
41 Elasticsearch Reference » Modules » Node Scale Up / Coordinating Nodes From the docs: Every node is implicitly a coordi...
Index Optimizations
Michael McCandless  http://blog.mikemccandless.com Lucene Segment Merges
44 Index Sorting Source: Gray Arial 10pt Elasticsearch Reference » Index Modules » Index Sorting When creating a new index...
45 PUT my_index { "settings" : { "index" : { "sort.field" : ["username", "date"], "sort.order" : ["asc", "desc"] } } “mapp...
Eager Global Ordinals by default, global ordinals are loaded at search-time. Great when optimizing for indexing speed. "da...
47 PUT my_index/_mapping/_doc { "properties": { "tags": { "type": "keyword", "eager_global_ordinals": true } } } Eager Glo...
Total Shards Per Node The maximum number of shards (replicas and primaries) that will be allocated to a single node. Elast...
Rollup When all else fails, cheat (use your resources) Elasticsearch Reference [6.4] » X-Pack APIs » Rollup APIs » Rollup ...
what to avoid Search Optimizations
51 Use bool.filter to disable document scoring Elasticsearch Reference » Query DSL » Compound queries » Bool Query GET _se...
52 Things to Avoid Elasticsearch Reference » Query DSL » Compound queries » Bool Query Avoid using “now” Use Rounded dates...
RECAP
54 Set up Node Attributes Enable Shard Allocation Awareness Enable Adaptive Replica Selection Limit Total Shards Per Node ...
Questions? Come to the AMA
Learn best practices for squeezing every last drop of performance out of Elasticsearch queries and aggregations -- all based off of real-world production clusters.

  1. 1. 1 Optimizing Elastic for Search Jared McQueen October 25, 2018
  2. 2. 2 “You Know, for Search”
  3. 3. 3
  4. 4. 4
  5. 5. 5
  6. 6. 6
  7. 7. 7 Elasticsearch is already optimized
  8. 8. 8 Optimizing Elastic for SEARCH At a sufficiently low level, understand how distributed search works How architecture considerations affect search Gain exposure to non-default settings: Cluster Settings Index Settings Queries Make Search Fast Again!
  9. 9. 9 Optimizing Elastic for SEARCH
  10. 10. 10 Distributed Search 101 5 minute crash-course
  11. 11. 11 Client Kibana cURL app / web app custom code B4J C++ Clojure ColdFusion (CFML) Erlang Go Groovy Haskell Java JavaScript kotlin Lua .NET OCaml Perl PHP Python R Ruby Rust Scala Smalltalk Vert.x Search starts with at client clients send requests
  12. 12. ?q=user:jmcqueen 12 Index API endpoint Query GET my_index/_search (Collection of similar documents) REST Request URI Elasticsearch Reference » Search APIs » URI Search
  13. 13. { "query": { "match": { "user": "jmcqueen" } } } 13 GET my_index/_search REST Request Body Elasticsearch Reference » Search APIs » Request Body Search
  14. 14. 14 Node A Client Cluster: A cluster is a collection of one or more nodes (servers) that together holds your entire data and provides federated search Node B Node C Node D Cluster
  15. 15. 15 Node A Client Node B Node C Node D search request Cluster
  16. 16. 16 Node A Client Cluster Node B Node C Node D search request (coordinating)
  17. 17. 17 Node A Cluster Node B Node C Node D (coordinating) Scatter Phase The coordinating node forwards the request to the data nodes which hold the data. Each data node executes the request locally and returns its results to the coordinating node Let’s zoom into Node A
  18. 18. 18 Index Shard Shard Segments Segments An index is a collection of documents that have somewhat similar characteristics Index: Node A
  19. 19. 19 Shards contain mini-indices in Lucene where data is actually stored Shard: Index Shard Shard Segments Segments Node A
  20. 20. 20 Where data in Lucene is flushed, merged, and stored on disk Segments: Index Shard Shard Segments Segments Node A
  21. 21. 21 A sufficient amount of data is collected to satisfy the query and sent back to the coordinating node Index Shard Shard Segments Segments Node A
  22. 22. 22 Node A Cluster Node B Node C Node D (coordinating) Gather Phase The coordinating node reduces each data node’s results into a single global resultset
  23. 23. 23 Node A Client Cluster Node B Node C Node D (coordinating) search response Coordinating node sends results back to client
  24. 24. 24 Crash Course Complete!
  25. 25. 25 how to void your warranty Knobs and Dials
  26. 26. 26 GET _cluster/settings { "persistent": {}, "transient": {} }
  27. 27. 27 { "my_index": { "settings": { "index": { "creation_date": "1539784825143", "number_of_shards": "5", "number_of_replicas": "1", "uuid": "VlhRNHu7S_CxvyiQCec20w", "version": { "created": "6040299" }, "provided_name": "my_index" } } } } PUT my_index GET my_index/_settings
  28. 28. 28 GET my_index/_settings?include_defaults GET _cluster/settings?include_defaults ~200 lines of JSON ~1000 lines of JSON Include Defaults Elasticsearch Reference » Indices APIs » Get Field Mapping
  29. 29. 29
  30. 30. 30 Cluster Optimizations
  31. 31. 31 By default, Elasticsearch selects from the available shard copies in an unspecified order, taking the allocation awareness and adaptive replica selection configuration into account. From the Docs: GET my_index/_search allocation awareness ?? adaptive replica selection ??
  32. 32. 32 Node A Node B Node C Node D ES Cluster Client search request This isn’t your typical deployment
  33. 33. 33 Node A Node B Node C Node D Client search request EastWest ES Cluster Example Attributes: • Hot/Warm • SSD/HDD • Region / AZ • Datacenter How do we add node attributes?
  34. 34. 34 Elasticsearch Reference » Modules » Cluster » Shard Allocation Awareness Node Attributes If Elasticsearch is aware of the physical conﬁguration of your hardware, it can ensure that the primary shard and its replica shards are spread across different physical servers, racks, or zones, to minimize the risk of losing all shard copies at the same time. From the Docs: GET _cat/nodeattrsnode.attr.region: east Enabling Node Attributes In elasticsearch.yml: Viewing Node Attributes In dev tools: What do we do with Node Attributes?
  35. 35. 35 When executing search or GET requests, with shard awareness enabled, Elasticsearch will prefer using local shards — shards in the same awareness group — to execute the request. This is usually faster than crossing between racks or across zone boundaries. Elasticsearch Reference » Modules » Cluster » Shard Allocation Awareness Shard Allocation Awareness From the Docs: cluster.routing.allocation.awareness.attributes: region In elasticsearch.yml Preferring local shards = FAST!
  36. 36. 36 Node A Node B Node C Node D (coordinating) By default, searches are sent round robin, with no concerns for the health and welfare of the nodes Think back to the scatter / gather phases of a search request
  37. 37. 37 Node A Node B Node C Node D (coordinating) bad! What happens if a node gets distressed? Remember, searches are only as fast as the slowest node
  38. 38. 38 Instead of sending requests in a round-robin fashion to each copy of the shard, Elasticsearch selects the "best" copy and routes the request there. Adaptive Replica Selection From the Docs: https://www.elastic.co/blog/improving-response-latency-in-elasticsearch-with-adaptive-replica-selection Great blog post: PUT /_cluster/settings { "transient": { "cluster.routing.use_adaptive_replica_selection": true } } Elasticsearch Reference » Search APIs available since 6.1. Will default to true in 7.X
  39. 39. 39 Node A Client Expensive Searches What if you have no other option than to run an expensive search? Is there a way to minimize the impact to my data nodes? Should I follow the advice of my sales rep? Node B Node C Node D Cluster
  40. 40. 40 Node A Client Node B Node C Node D Cluster Dedicated Coordinating Node
  41. 41. 41 Elasticsearch Reference » Modules » Node Scale Up / Coordinating Nodes From the docs: Every node is implicitly a coordinating node. This means that a node that has all three node.master, node.data and node.ingest set to false will only act as a coordinating node, which cannot be disabled. As a result, such a node needs to have enough memory and CPU in order to deal with the gather phase. In elasticsearch.yml: node.master: false node.data: false node.ingest: false
  42. 42. 42 Index Optimizations
  43. 43. Michael McCandless  http://blog.mikemccandless.com Lucene Segment Merges
  44. 44. 44 Index Sorting Source: Gray Arial 10pt Elasticsearch Reference » Index Modules » Index Sorting When creating a new index in Elasticsearch it is possible to conﬁgure how the Segments inside each Shard will be sorted. By default Lucene does not apply any sort. The index.sort.* settings deﬁne which ﬁelds should be used to sort the documents inside each Segment. From the docs: speeds up aggregations, especially “top N” type queries can be defined only once at index creation - you cannot add or update a sort on an existing index decreases indexing throughput since documents must be sorted at flush and merge time
  45. 45. 45 PUT my_index { "settings" : { "index" : { "sort.field" : ["username", "date"], "sort.order" : ["asc", "desc"] } } “mappings” : {..} } Index Sorting Elasticsearch Reference » Index Modules » Index Sorting
  46. 46. Eager Global Ordinals by default, global ordinals are loaded at search-time. Great when optimizing for indexing speed. "data-structure" on top of doc values, that maintains an incremental numbering for each unique term in a lexiconographic order. primarily used for `keyword` ﬁelds, are used for features that use segment ordinals, such as `terms` aggregation This will shift the cost from search-time to refresh-time Elasticsearch Reference » Mapping » Mapping parameters » eager_global_ordinals From the docs:
  47. 47. 47 PUT my_index/_mapping/_doc { "properties": { "tags": { "type": "keyword", "eager_global_ordinals": true } } } Eager Global Ordinals Elasticsearch Reference » Mapping » Mapping parameters » eager_global_ordinals
  48. 48. Total Shards Per Node The maximum number of shards (replicas and primaries) that will be allocated to a single node. Elasticsearch Reference » Index Modules » Index Shard Allocation » Total Shards Per Node From the docs: PUT my_index/_settings { "routing": { "allocation": { "total_shards_per_node": 2 } } }
  49. 49. Rollup When all else fails, cheat (use your resources) Elasticsearch Reference [6.4] » X-Pack APIs » Rollup APIs » Rollup Search pick all the ﬁelds you want rolled up and a new index is created with just the rolled-up data. This new rollup index then lives side by side with the index that it’s being rolled up from.  Rollup API also has the ability to search both live and rollup data at the same time, returning data from both indices in a single response. https://www.elastic.co/blog/data-rollups-in-elasticsearch-you-know-for-saving-space
  50. 50. 50 what to avoid Search Optimizations
  51. 51. 51 Use bool.filter to disable document scoring Elasticsearch Reference » Query DSL » Compound queries » Bool Query GET _search { "query": { "bool": { "filter": { "term": { "status": "active" } } } } } bool query clauses: should must must_not filter disables scoring enables query cache search boost
  52. 52. 52 Things to Avoid Elasticsearch Reference » Query DSL » Compound queries » Bool Query Avoid using “now” Use Rounded dates - they have the beneﬁt of making better use of the query cache In general, scripts should be avoided. If they are absolutely needed, you should prefer the painlessand expressions engines. Avoid Scripts user bool.ﬁlter - also allows the query to be cached Avoid Scoring joins should be avoided. nested can make queries several times slower and parent-child relation can make queries hundreds of times slower Avoid using parent/child relationships
  53. 53. 53 RECAP
  54. 54. 54 Set up Node Attributes Enable Shard Allocation Awareness Enable Adaptive Replica Selection Limit Total Shards Per Node Use Coordinating Nodes for Expensive Searches Finally, Scale Up Cluster Index Sorting Enable Eager Global Ordinals Force Merge Rollup Avoid Scoring Avoid Scripts Avoid “now” in Search Avoid Parent/Child For more information, read the docs: Index Query Elasticsearch Reference » How To » Tune for search speed
  55. 55. 55 Questions? Come to the AMA

