Cool bonsai cool - an introduction to ElasticSearch

9,525 views

Published on

An introduction to the ElasticSearch search engine and how to use it from Perl

Cool bonsai cool - an introduction to ElasticSearch

  1. 1. “ Cool, Bonsai, Cool” An introduction to Clinton Gormley, YAPC::EU 2011
  2. 2. Why do I need a search engine?
  3. 6. Search is how we find stuff
  4. 9. How does a search engine work?
  5. 11. Acme:: Magic 8Ball Acme:: Magic ::Pony Config:: Magic File:: Magic File::MimeInfo:: Magic File::M Magic ::XS Magic Template Meta::File::M Magic MRO:: Magic Template:: Magic Template:: Magic ::Pager Test:: Magic XS:: Magic Ext XS::Object:: Magic
  6. 12. Magic == inverted index + relevance scoring
  7. 13. Acme::Magic8Ball Acme::Magic::Pony Config::Magic File::Magic File::MimeInfo::Magic File::MMagic::XS MagicTemplate Meta::File::MMagic MRO::Magic Template::Magic Template::Magic::Pager Test::Magic XS::MagicExt XS::Object::Magic Take some text
  8. 14. Acme::Magic8Ball Acme::Magic::Pony Config::Magic File::Magic File::MimeInfo::Magic File::MMagic::XS MagicTemplate Meta::File::MMagic MRO::Magic Template::Magic Template::Magic::Pager Test::Magic XS::MagicExt XS::Object::Magic Tokenise it
  9. 15. acme magic 8 ball acme magic pony config magic file magic file mime info magic file m magic xs magic template meta file m magic mro magic template magic template magic pager test magic xs magic ext xs object magic Tokenise it
  10. 16. acme magic 8 ball acme magic pony config magic file magic file mime info magic file m magic xs magic template meta file m magic mro magic template magic template magic pager test magic xs magic ext xs object magic Find unique tokens/terms
  11. 17. 8 acme ball config ext file info m magic Find unique tokens/terms meta mime mro object pager pony template test xs
  12. 18. acme file magic mime template xs Acme::Magic8Ball Acme::Magic::Pony File::Magic File::MimeInfo::Magic MagicTemplate Template::Magic Template::Magic::Pager XS::Object::Magic XS::MagicExt File::MMagic::XS Map terms to documents
  13. 19. acme file magic mime template xs Acme::Magic8Ball Acme::Magic::Pony File::Magic File::MimeInfo::Magic MagicTemplate Template::Magic Template::Magic::Pager XS::Object::Magic XS::MagicExt File::MMagic::XS Search for: “ file xs ”
  14. 20. Search for: “ file xs ” acme file magic mime template xs Acme::Magic8Ball Acme::Magic::Pony File::Magic File::MimeInfo::Magic MagicTemplate Template::Magic Template::Magic::Pager XS::Object::Magic XS::MagicExt File::MMagic::XS
  15. 21. But, not just about finding
  16. 23. Sort by RELEVANCE
  17. 24. Relevance: How many matching terms does this document contain?
  18. 25. Relevance: How often does each term appear in this document, as a % of its length?
  19. 26. Relevance: How frequently does each term appear in all your documents ?
  20. 27. Relevance: Can be customised
  21. 28. Relevance: Can be customised By document or field
  22. 29. Relevance: Can be customised By document or field At index or search time
  23. 30. Simple as: C an be customised B y document or field A t index or search time
  24. 31. FAST!
  25. 32. POWERFUL!
  26. 33. MAGIC!
  27. 37. www.elasticsearch.org
  28. 38. elasticsearch is:
  29. 39. elasticsearch is: <ul><li>an Open Source (Apache 2) </li></ul>
  30. 40. elasticsearch is: <ul><li>an Open Source (Apache 2)
  31. 41. distributed </li></ul>
  32. 42. elasticsearch is: <ul><li>an Open Source (Apache 2)
  33. 43. distributed
  34. 44. RESTful </li></ul>
  35. 45. elasticsearch is: <ul><li>an Open Source (Apache 2)
  36. 46. distributed
  37. 47. RESTful
  38. 48. search engine </li></ul>
  39. 49. elasticsearch is: <ul><li>an Open Source (Apache 2)
  40. 50. distributed
  41. 51. RESTful
  42. 52. search engine
  43. 53. built on top of Lucene </li></ul>
  44. 54. Installing elasticsearch: Latest version at: http://www.elasticsearch.org/download/ wget https://github.com/.../elasticsearch-0.17.6.tar.gz tar -xzf elasticsearch-0.17.6.tar.gz cd elasticsearch-0.17.6/ ./bin/elasticsearch
  45. 55. Installing ElasticSearch.pm: Latest version at: https://metacpan.org/module/ElasticSearch cpanm ElasticSearch perl -de 0 > use ElasticSearch; > $e = ElasticSearch->new( trace_calls => 1) > $e->cluster_health
  46. 56. Some terminology Relational DB elasticsearch
  47. 57. Some terminology Relational DB elasticsearch database ⇒ index
  48. 58. Some terminology Relational DB elasticsearch database ⇒ index table ⇒ type
  49. 59. Some terminology Relational DB elasticsearch database ⇒ index table ⇒ type row ⇒ document
  50. 60. Some terminology Relational DB elasticsearch database ⇒ index table ⇒ type row ⇒ document column ⇒ field
  51. 61. Some terminology Relational DB elasticsearch database ⇒ index table ⇒ type row ⇒ document column ⇒ field schema ⇒ mapping
  52. 62. Some terminology Relational DB elasticsearch database ⇒ index table ⇒ type row ⇒ document column ⇒ field schema ⇒ mapping index ⇒ everything is indexed
  53. 63. Some terminology Relational DB elasticsearch database ⇒ index table ⇒ type row ⇒ document column ⇒ field schema ⇒ mapping index ⇒ everything is indexed SQL ⇒ query DSL
  54. 64. Clustering
  55. 65. Clustering auto-discovery
  56. 66. Clustering single master auto-elected
  57. 67. Clustering immediate failover master re-election
  58. 68. Clustering index ==
  59. 69. Clustering index == 1 or more primary shards
  60. 70. Clustering index == 1 or more primary shards + 0 or more replica shards
  61. 71. Clustering more primary shards
  62. 72. Clustering ⇒ faster indexing ⇒ more scale more primary shards
  63. 73. Clustering ⇒ faster indexing ⇒ more scale more primary shards more replicas
  64. 74. Clustering ⇒ faster indexing ⇒ more scale ⇒ faster searching ⇒ more failover more primary shards more replicas
  65. 75. Clustering Big subject... http://www.elasticsearch.org/videos/2011/08/09/road-to-a-distributed-searchengine-berlinbuzzwords.html http://berlinbuzzwords.de/sites/ berlinbuzzwords.de/files/elasticsearch-bbuzz2011.pdf
  66. 76. Document oriented:
  67. 77. Document oriented: No ORM required
  68. 78. Document oriented: JSON in ⇔ JSON out
  69. 79. Schema free Dynamic mapping
  70. 80. Schema free Dynamic (or strict) mapping
  71. 81. Unknown field?
  72. 82. elasticsearch guesses the type
  73. 83. elasticsearch guesses the type and indexes it
  74. 84. Put data in: $e->index( );
  75. 85. Put data in: $e->index( index => 'twitter', );
  76. 86. Put data in: $e->index( index => 'twitter', type => 'tweet', );
  77. 87. Put data in: $e->index( index => 'twitter', type => 'tweet', id => 1, );
  78. 88. Put data in: $e->index( index => 'twitter', type => 'tweet', id => 1, # optional );
  79. 89. Put data in: $e->index( index => 'twitter', type => 'tweet', id => 1, # ES always returns the ID );
  80. 90. Put data in: $e->index( index => 'twitter', type => 'tweet', id => 1, data => { } );
  81. 91. Put data in: $e->index( index => 'twitter', type => 'tweet', id => 1, data => { tweet => “ElasticSearch is cool”, } );
  82. 92. Put data in: $e->index( index => 'twitter', type => 'tweet', id => 1, data => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, } );
  83. 93. Put data in: $e->index( index => 'twitter', type => 'tweet', id => 1, data => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, } );
  84. 94. Put data in: $e->index( index => 'twitter', type => 'tweet', id => 1, data => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, tags => [“search”,”perl”], } );
  85. 95. Realtime GET
  86. 96. Retrieve your doc immediately
  87. 97. Persistent
  88. 98. No commit required
  89. 99. Get data out: $e->get( index => 'twitter', type => 'tweet', id => 1);
  90. 100. Get data out: $e->get( index => 'twitter', type => 'tweet', id => 1); { _index => 'twitter', _type => 'tweet', _id => 1, }
  91. 101. Get data out: $e->get( index => 'twitter', type => 'tweet', id => 1); { _index => 'twitter', _type => 'tweet', _id => 1, _version => 1, }
  92. 102. Get data out: $e->get( index => 'twitter', type => 'tweet', id => 1); { _index => 'twitter', _type => 'tweet', _id => 1, _version => 1, _source => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, tags => ['search','perl'], } }
  93. 103. bulk-indexing
  94. 104. bulk-indexing multi-get
  95. 105. bulk-indexing multi-get avoids http latency
  96. 106. bulk-indexing multi-get avoids http latency 10x as fast!
  97. 107. Versioning
  98. 108. Versioning “ Optimistic currency control”
  99. 109. Versioning “ Put if absent”
  100. 110. Versioning Optional
  101. 111. Versioning Can use external version numbers
  102. 112. So far, all we have is a NoSQL document store which is fast, reliable, scalable & easy to use
  103. 113. So far, all we have is a NoSQL document store which is fast, reliable, scalable & easy to use
  104. 115. Simple search $e->search( index => 'twitter', type => 'tweet', );
  105. 116. Simple search $e->search( index => ['twitter','facebook'] , type => ['tweet','post'] , );
  106. 117. Simple search $e->search( # all indices # all types );
  107. 118. Simple search $e->search( index => 'twitter', type => 'tweet', query => { } );
  108. 119. Simple search $e->search( index => 'twitter', type => 'tweet', query => { text => { _all => 'clinton' } } );
  109. 120. Simple search $e->search( index => 'twitter', type => 'tweet', query b => 'clinton' );
  110. 121. Simple search $e->search( index => 'twitter', type => 'tweet', query b => 'clinton' # ElasticSearch::SearchBuilder, # like SQL::Abstract );
  111. 122. Search results { took => 1, hits => { total => 1, max_score => 1, hits => [{ _score => 1, _index => 'twitter', _type => 'tweet', _id => 1, _source => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, tags => ['search','perl'], } }], }, ... other information ... }
  112. 123. Search results { took => 1, # milliseconds hits => { total => 1, max_score => 1, hits => [{ _score => 1, _index => 'twitter', _type => 'tweet', _id => 1, _source => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, tags => ['search','perl'], } }], }, ... other information ... }
  113. 124. Search results { took => 1, hits => { total => 1, # total results max_score => 1, hits => [{ _score => 1, _index => 'twitter', _type => 'tweet', _id => 1, _source => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, tags => ['search','perl'], } }], }, ... other information ... }
  114. 125. Search results { took => 1, hits => { total => 1, max_score => 1, hits => [{ _score => 1, _index => 'twitter', _type => 'tweet', _id => 1, _source => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, tags => ['search','perl'], } }], }, ... other information ... }
  115. 126. Search results { took => 1, hits => { total => 1, max_score => 1, hits => [{ _score => 1, _index => 'twitter', _type => 'tweet', _id => 1, _source => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, tags => ['search','perl'], } }], }, ... other information ... }
  116. 127. JSON doc included in results
  117. 128. No need to fetch from DB
  118. 129. Docs visible to search in near-real time (< 1 second)
  119. 130. refresh_index() to force
  120. 131. What can you do with search?
  121. 132. standard text search
  122. 133. ...with highlighting
  123. 134. stemming
  124. 135. stemming arabic, armenian, basque, brazilian, bulgarian, catalan, chinese, cjk, czech, danish, dutch, english, finnish, french, galician, german, german2, greek, hindi, hungarian, indonesian, italian, kp, light_finish, light_french, light_german, light_hungarian, light_italian, light_portuguese, light_russian, light_spanish, light_swedish., lovins, minimal_english, minimal_french, minimal_german, minimal_portuguese, norwegian, persian, porter, porter2, portuguese, possessive_english, romanian, russian, spanish, swedish, thai, turkish
  125. 136. ngrams & edge-ngrams
  126. 137. auto-complete
  127. 138. camelCase
  128. 139. camelCase
  129. 140. camelCase
  130. 141. term facets, date histograms
  131. 142. ranges
  132. 143. geo bounding box
  133. 144. geo distance
  134. 145. geo distance ranges
  135. 146. geo polygons
  136. 149. “‎ Terms of endearment” The ElasticSearch query language explained‎ Thurs. 14:35 - Auditorija 301

×