Elasticsearch in Production (London version)

4,335 views

Published on

Elasticsearch in production, or an overview of things you want to know about before happening upon them in production.

Published in: Software, Technology, Business
  • Be the first to comment

Elasticsearch in Production (London version)

  1. 1. Elasticsearch in Production ! Alex Brasetvik alex@found.no @alexbrasetvik
  2. 2. Elasticsearch in Production ! Alex Brasetvik alex@found.no @alexbrasetvik
  3. 3. Who? Co-founder of Found AS 8+ years search, 3+ Elasticsearch Herding hundreds of Elasticsearch clusters
  4. 4. Agenda
  5. 5. Agenda • Anti-patterns • Memory / Resource Usage • Distributed problems • Security • Client concerns • Changing a cluster
  6. 6. found.no/foundation Elasticsearch in Production Elasticsearch as a NoSQL Database Intro to Function Scoring All About Analyzers Securing your Elasticsearch Cluster
  7. 7. Snapshot / Restore Circuit breakers Document values Aggregations Distributed percolation Suggesters …
  8. 8. Anti-Patterns
  9. 9. Arbitrary Keys • “Schema Free” • One field per value • Ever-growing cluster state acls: 1234: READ 42: WRITE
  10. 10. Heavy Updating • Update = Delete + Reindex • Be careful with counters
  11. 11. Slow queries • WHERE foo ILIKE ‘%bar%’ • {“query_string”: {“query”: “foo:*bar*”}} • Don’t ask for 3300 results :)
  12. 12. Arbitrary searches query: filtered: filter: term: user_id: 42 query: [user’s query here]
  13. 13. Memory
  14. 14. Memory • Field caches • Filter caches • Page caches • Aggregations • Index building
  15. 15. Page Cache • Keeping index pages in memory • Can’t have too much • Outgrow: Gradual slowdown
  16. 16. Heap Space • Memory used by Elasticsearch process • Field / Filter caches • Aggregations
  17. 17. Time Bomb
  18. 18. Time Bomb
  19. 19. OutOfMemoryError Woah there I ate all the memories Your cluster may or may not work any more
  20. 20. OutOfMemory • Growing too big • Selecting too big timespan in Kibana • Document ingestion peak
  21. 21. Preventing OOMs • Have enough memory :-) • Understand your search’s memory profile • Bulk / Circuit breaker settings • Monitoring • Document values
  22. 22. Marvel ( /_stats )
  23. 23. "my_field": { "type": "string", "fielddata": { "format": "doc_values" } }
  24. 24. Document Values • Rely on page cache • Only caches doc values actually used
  25. 25. Sizing
  26. 26. Sizing • Test, don’t guess • Start big, scale down • Index, search, monitor
  27. 27. Glitch Meltdown
  28. 28. Glitch Meltdown
  29. 29. • Tie-breaker can be a cheap master-node • Applies to data centers / availability zones too
  30. 30. Data-only nodes Master-only nodes
  31. 31. Jepsen
  32. 32. Jepsen • Kyle Kingsbury’s series on distributed systems • Distributed systems are hard • aphyr.com
  33. 33. Security
  34. 34. Security • “Not my job!” – Elasticsearch • That’s fine!
  35. 35. Dynamic Scripts ! • Scoring • Aggregations • Updating
  36. 36. Dynamic Scripts Runtime.getRuntime().exec(…)
  37. 37. Dynamic Scripts Runtime.getRuntime().exec(…) <script src=“http://127.0.0.1:9200/_search?callback=capture&…
  38. 38. Security ! • Disable dynamic scripts (On by default in ≤1.1) • Mind index patterns • Even then, don’t accept arbitrary requests
  39. 39. Client Concerns
  40. 40. Client Concerns • Connection pools • Idempotent requests • Have sane syncing/indexing strategies
  41. 41. # BOOM !
  42. 42. Cluster changes
  43. 43. Cluster changes • Make new nodes join existing cluster • No rolling restarts • Easy rollback if things go bad
  44. 44. v1.0.0 v1.0.1
  45. 45. Cluster changes • Test first • Mind recover_*-settings
  46. 46. Multi-Cluster Workflows • Snapshot/Restore • Operations across clusters • Swap clusters! • Works well with good syncing strategy
  47. 47. • Rolling restarts: Risky, fast • Grow and shrink: Less risky, copies lots of data • Multiple clusters: Least risky, copies lots of data
  48. 48. Misc • Same JVM • ulimits • Unicast • Kernel-settings like IO-scheduler
  49. 49. ? @foundsays

×