Apache Solr/Lucene Internals by Anatoliy Sokolenko

4,115 views

Published on

Anatoliy Sokolenko, Software Engineer at Grid Dynamics

0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,115
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
113
Comments
0
Likes
8
Embeds 0
No embeds

No notes for slide

Apache Solr/Lucene Internals by Anatoliy Sokolenko

  1. 1. Apache Lucene/Solr Internals
  2. 2. About me Java and all around Principal Software Engineer at Grid Dynamics Kharkiv asokolenko@griddynamics.com
  3. 3. Apache Lucene/Solr Internals
  4. 4. 4 nodes ✕ 12GB disk space June 2013 database 14.630.209 records Indexing took 5 hours in 100 threads 1000 batch Lucene.net VM 16 CPU cores 16 GB memory
  5. 5. lightweight performant search library
  6. 6. Data Model • document oriented • flat • store • index
  7. 7. Data Model • document oriented • flat • store • index score:1 tag:java type:answer Document boost = 1.1 docID = 23
  8. 8. Showcase
  9. 9. Basic Flow Lucene Index Index Writer Index Searcher Analyzer Index Reader
  10. 10. Basic Flow score:1 tag:java type:answer Document boost = 1.1 Lucene Index Index Writer Index Searcher addDocument Analyzer Index Reader
  11. 11. Basic Flow score:1 tag:java type:answer Document boost = 1.1 Lucene Index Index Writer Index Searcher addDocument query tag:java Analyzer Index Reader
  12. 12. Basic Flow score:1 tag:java type:answer Document boost = 1.1 Lucene Index Index Writer Index Searcher addDocument query tag:java score:1 tag:java type:answer Document boost = 1.1 search Analyzer Index Reader
  13. 13. Lucene Index Structure
  14. 14. Index
  15. 15. Index Segment A
  16. 16. Index Segment A Segment B Segment C Segment D
  17. 17. score:0 score:1 score:5 ... tag:java tag:mysql tag:css ... type:answer type:question 3 4 2 2 3 4 3 2 Term Infos
  18. 18. score:0 score:1 score:5 ... tag:java tag:mysql tag:css ... type:answer type:question 3 +1 +2 10 +3 +1 4 +11 5 +2 6 +52 +1 1 +30 +27 3 +7 +1 5 +2 3 +7 +2 4 2 2 3 4 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 5 2 1 Term Infos Term Frequencies
  19. 19. score:0 ... tag:mysql ... score:0 score:1 score:5 ... tag:java tag:mysql tag:css ... type:answer type:questiontype:question 3 +1 +2 10 +3 +1 4 +11 5 +2 6 +52 +1 1 +30 +27 3 +7 +1 5 +2 3 +7 +2 4 2 2 3 4 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 5 2 1 Term InfosTerm Info Index Term Frequencies 3 3 2
  20. 20. Showcase
  21. 21. scalable enterprise search server
  22. 22. Request Handlers Data Import Handler SolrCloud solrconfig.xml schema.xml
  23. 23. SolrCloud
  24. 24. Join Cluster
  25. 25. Shard 3Shard 2Shard 1 Join Cluster
  26. 26. Indexing Shard 1 Shard 2 Shard 3
  27. 27. Indexing Shard 1 Shard 2 Shard 3
  28. 28. Indexing Shard 1 Shard 2 Shard 3
  29. 29. Indexing Shard 1 Shard 2 Shard 3
  30. 30. Indexing Shard 1 Shard 2 Shard 3
  31. 31. Indexing Shard 1 Shard 2 Shard 3
  32. 32. Query Shard 1 Shard 2 Shard 3
  33. 33. Query Shard 1 Shard 2 Shard 3 query tag:java
  34. 34. Query Shard 1 Shard 2 Shard 3 query tag:java
  35. 35. Query Shard 1 Shard 2 Shard 3 query tag:java
  36. 36. Query Shard 1 Shard 2 Shard 3 query tag:java
  37. 37. Failure Shard 1 Shard 2 Shard 3
  38. 38. Failure Shard 1 Shard 2 Shard 3
  39. 39. Failure Shard 1 Shard 2 Shard 3
  40. 40. CAP Model C A P SolrCloud Solr
  41. 41. Showcase
  42. 42. Faceted Navigation
  43. 43. Showcase
  44. 44. Algorithm tag:java tag:mysql tag:css 5 +2 6 +52 +1 1 +30 +27 +2 7 31 58 59Query Result Index
  45. 45. Algorithm tag:java tag:mysql tag:css 5 +2 6 +52 +1 1 +30 +27 +2 7 31 58 59Query Result Index 5 7 6 58 59 1 31 58 60
  46. 46. Algorithm tag:java tag:mysql tag:css 5 +2 6 +52 +1 1 +30 +27 +2 7 31 58 59Query Result Index Facet 1 2 2 5 7 6 58 59 1 31 58 60
  47. 47. Showcase
  48. 48. Text Analysis
  49. 49. Analyzer Tokenizer Filter Char filter
  50. 50. Analyzer Tokenizer Filter Char filter Index time
  51. 51. Analyzer <strong>There are no pointers in Java!</strong> Tokenizer Filter Char filter Index time
  52. 52. Analyzer <strong>There are no pointers in Java!</strong> Tokenizer Filter There are no pointers in Java! Char filter Index time
  53. 53. Analyzer <strong>There are no pointers in Java!</strong> Tokenizer There are no pointers in Java Filter There are no pointers in Java! Char filter Index time
  54. 54. Analyzer <strong>There are no pointers in Java!</strong> Tokenizer There are no pointers in Java Filter There are no pointers in Java! Char filter Index time ? ? ? pointer ? java
  55. 55. Analyzer <strong>There are no pointers in Java!</strong> Tokenizer There are no pointers in Java Filter There are no pointers in Java! Char filter Index time Query time ? ? ? pointer ? java
  56. 56. Analyzer <strong>There are no pointers in Java!</strong> Tokenizer There are no pointers in Java Filter There are no pointers in Java! Char filter pointers in Java Index time Query time ? ? ? pointer ? java
  57. 57. Analyzer <strong>There are no pointers in Java!</strong> Tokenizer There are no pointers in Java Filter There are no pointers in Java! Char filter pointers in Java Index time Query time pointers in Java ? ? ? pointer ? java
  58. 58. Analyzer <strong>There are no pointers in Java!</strong> Tokenizer There are no pointers in Java Filter There are no pointers in Java! Char filter pointers in Java Index time Query time pointers in Java pointers Java in ? ? ? pointer ? java
  59. 59. Analyzer <strong>There are no pointers in Java!</strong> Tokenizer There are no pointers in Java Filter There are no pointers in Java! Char filter pointers in Java Index time Query time pointers in Java pointers Java in ? ? ? pointer ? java pointer java ?
  60. 60. Showcase
  61. 61. Spell Suggestions
  62. 62. Levenshtein Distance html htmm Levenshtein distance = 1 hlmz html Levenshtein distance = 2 tag:php tag:jquery tag:json tag:java tag:c# tag:apache tag:osx tag:html
  63. 63. Levenshtein Automaton html Levenshtein distance = 1 Ht t m m l l t m l H t t m m l l m l tt m H l m
  64. 64. Showcase
  65. 65. Solr is... • enterprise level search engine • vertically scalable • horizontaly scalable, but... • tunable • poorly documentation • with active community
  66. 66. • http://blog.mikemccandless.com • http://lucene.apache.org/core/4_3_1/ index.html • Introduction to Information Retrieval http://nlp.stanford.edu/IR-book/ • http://wiki.apache.org/solr/ • https://cwiki.apache.org/confluence/display/ solr/Apache+Solr+Reference+Guide References
  67. 67. Q&A http://twitter.com/AnatolSokolenko http://flip.it/whFqy asokolenko@griddynamics.com
  68. 68. The End

×