Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

From Lucene to Elasticsearch, a short explanation of horizontal scalability

2,315 views

Published on

What makes it that Elasticsearch is "horizontaly" scalable while Lucene is not? How does the technology of one affect the other? How does ElasticSearch scale over Lucene and what are the limiting factor?

Published in: Technology

From Lucene to Elasticsearch, a short explanation of horizontal scalability

  1. 1. Scaling Lucene The event of ElasticSearch Stéphane Gamard
  2. 2. Scalability • Index Size - The number of entries upon which we act • QPS - Number of requests serviced per second • Time to operation - Time taken to be operational Scalability is defined in 3 main axis:
  3. 3. Lucene • IR library - Purely focused on Tf-iDf • Bounded by native resources - Vertical scaling • NRT Inverse Lookup - Segments In a nutshell, Lucene does not scale. why?
  4. 4. Lucene Segments: the lucene storage just a “bunch of files”
  5. 5. Lucene Indexing In a “document” perspective {#hello, #world} {#there, #is, #a, #brown, #fox} {#the, … , #kitchen} … T1 {#1, #33} T2 {#2, … , #87} … T45 {#2, …} … #a T1 #is T2 … #fox T45 … Dictionary Inverse Lookup Segment
  6. 6. Lucene Indexing Factors of growth T1 {#1, #33} T2 {#2, … , #87} … T45 {#2, …} … #a T1 #is T2 … #fox T45 … Dictionary Inverse Lookup • Dictionary Size - NLP* • New Inverse Entries Segment
  7. 7. Lucene Indexing In a storage perspective Segment
  8. 8. Lucene Indexing In a storage perspective Segment
  9. 9. Lucene Indexing In a storage perspective Segment
  10. 10. Lucene Indexing In a storage perspective Segment IndexReader(s) IndexWriter
  11. 11. Lucene Indexing In a storage perspective IndexReader(s) IndexWriter Lucene Index
  12. 12. Lucene Segments: the lucene storage just a “bunch of files”
  13. 13. Lucene Indexing The wonderful world of merging segments http://blog.mikemccandless.com/ 2011/02/visualizing-lucenes- segment-merges.html
  14. 14. Lucene Wrap-up • A collection of segments • One or multiple IndexReader • A single IndexWriter A Lucene Index is:
  15. 15. Lucene Wrap-up A single Lucene Index scales to: • Index- Available HDD/Ram for segments • QPS - number of IndexReader threads • T-to-Op - Speed at which indexWriter can ingest (IOPs) It can only scale vertically!!!
  16. 16. Elasticsearch Also known as the commodity scaling of Lucene ;) There is no magic… It’s about partitioning, Using an index of indexes as its index.
  17. 17. Elasticsearch A shard is the magic sauce of web scale Lucene Lucene Lucene Lucene Lucene Elasticsearch Index
  18. 18. Elasticsearch Document Indexing Lucene Lucene Lucene Lucene Lucene • Distributed • Routing
  19. 19. Elasticsearch Request Lucene Lucene Lucene Lucene Lucene • Parallel • Aggregated {search: {…}}
  20. 20. Elasticsearch In a nutshell • Distributed - Distribute IndexWriter per shard • Parallel - Parallelise request IndexReader per shard
  21. 21. Clustering How to leverage ES to scale Lucene Lucene • 2 Threads - 1 searcher, 1 writer • 2G ram - Lucene Cache • 30G disk - Index size Sample sizing for xM indexed documents
  22. 22. Elasticsearch Index Clustering Lucene 2T/2G/30G Lucene 2T/2G/30G Lucene 2T/2G/30G Lucene 2T/2G/30G Single Machine Scope: 8Core 16G ram 500G hdd can sustain 4 times xM documents
  23. 23. Clustering # Documents QPS 1 machine -> 4 * xM documents
  24. 24. Clustering 2 machines -> 2 * 4 * xM documents # Documents QPS • 4 Threads - 3 searcher, 1 writer • 4G ram - Lucene Cache • 60G disk - Index size
  25. 25. Clustering # Documents QPS 4 machines -> 2 * 4 * xM documents twice more QPS
  26. 26. Clustering # Documents QPS Is there a limit to this scalability?
  27. 27. Clustering # Documents QPS • 8 Threads - 7 searcher, 1 writer • 8G ram - Lucene Cache • 120G disk - Index size 4 machines -> 4 * 4 * xM documents
  28. 28. Clustering The rules of thumbs • Threads - are the core of the scalability factors • IOPs - is generally the limiting factor to horizontal scaling • Ram - is generally the limiting factor of vertical scaling ES is generally excellent with its parameters
  29. 29. Clustering Health • Redundancy - auto-balance shards for best possible HA • Timing - Warmup and Commit points • Latency - Result merging (especially on remote aggregations)

×