Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
“ Cool, Bonsai, Cool” An introduction to Clinton Gormley, YAPC::EU 2011
Why do I need a search engine?
 
 
 
Search is how we find stuff
 
 
How does a search engine work?
 
Acme:: Magic 8Ball  Acme:: Magic ::Pony  Config:: Magic   File:: Magic   File::MimeInfo:: Magic   File::M Magic ::XS  Magi...
Magic ==  inverted index + relevance scoring
Acme::Magic8Ball  Acme::Magic::Pony  Config::Magic  File::Magic  File::MimeInfo::Magic  File::MMagic::XS  MagicTemplate  M...
Acme::Magic8Ball  Acme::Magic::Pony  Config::Magic  File::Magic  File::MimeInfo::Magic  File::MMagic::XS  MagicTemplate  M...
acme  magic 8 ball  acme  magic pony  config  magic  file  magic  file mime info  magic  file m magic xs  magic template  ...
acme  magic 8 ball  acme  magic pony  config  magic  file  magic  file mime info  magic  file m magic xs  magic template  ...
8 acme ball config ext file info m magic Find unique tokens/terms meta mime mro object pager pony template test xs
acme file magic mime template xs Acme::Magic8Ball Acme::Magic::Pony File::Magic File::MimeInfo::Magic MagicTemplate Templa...
acme file magic mime template xs Acme::Magic8Ball Acme::Magic::Pony File::Magic File::MimeInfo::Magic MagicTemplate Templa...
Search for: “ file xs ” acme file magic mime template xs Acme::Magic8Ball Acme::Magic::Pony File::Magic File::MimeInfo::Ma...
But, not just about finding
 
Sort by RELEVANCE
Relevance: How many matching terms does  this document contain?
Relevance: How often does each term appear in  this  document, as a % of its length?
Relevance: How frequently does each term appear in  all your documents ?
Relevance: Can be customised
Relevance: Can be customised By document or field
Relevance: Can be customised By document or field At index or search time
Simple as: C an be customised B y document or field A t index or search time
FAST!
POWERFUL!
MAGIC!
 
 
 
www.elasticsearch.org
elasticsearch is:
elasticsearch is: <ul><li>an Open Source (Apache 2) </li></ul>
elasticsearch is: <ul><li>an Open Source (Apache 2)
distributed </li></ul>
elasticsearch is: <ul><li>an Open Source (Apache 2)
distributed
RESTful  </li></ul>
elasticsearch is: <ul><li>an Open Source (Apache 2)
distributed
RESTful
search engine </li></ul>
elasticsearch is: <ul><li>an Open Source (Apache 2)
distributed
RESTful
search engine
built on top of Lucene </li></ul>
Installing elasticsearch: Latest version at:  http://www.elasticsearch.org/download/  wget https://github.com/.../elastics...
Installing ElasticSearch.pm: Latest version at:  https://metacpan.org/module/ElasticSearch cpanm ElasticSearch perl -de 0 ...
Some terminology Relational DB elasticsearch
Some terminology Relational DB elasticsearch database ⇒ index
Some terminology Relational DB elasticsearch database ⇒ index table ⇒ type
Some terminology Relational DB elasticsearch database ⇒ index table ⇒ type row ⇒ document
Some terminology Relational DB elasticsearch database ⇒ index table ⇒ type row ⇒ document column ⇒ field
Some terminology Relational DB elasticsearch database ⇒ index table ⇒ type row ⇒ document column ⇒ field schema ⇒ mapping
Some terminology Relational DB elasticsearch database ⇒ index table ⇒ type row ⇒ document column ⇒ field schema ⇒ mapping ...
Some terminology Relational DB elasticsearch database ⇒ index table ⇒ type row ⇒ document column ⇒ field schema ⇒ mapping ...
Clustering
Clustering auto-discovery
Clustering single master auto-elected
Clustering immediate failover master re-election
Clustering index ==
Clustering index == 1 or more primary shards
Clustering index == 1 or more primary shards + 0 or more replica shards
Clustering more primary shards
Clustering ⇒  faster indexing ⇒  more scale more primary shards
Clustering ⇒  faster indexing ⇒  more scale more primary shards more replicas
Clustering ⇒  faster indexing ⇒  more scale ⇒  faster searching ⇒  more failover more primary shards more replicas
Clustering Big subject... http://www.elasticsearch.org/videos/2011/08/09/road-to-a-distributed-searchengine-berlinbuzzword...
Document oriented:
Document oriented: No ORM required
Document oriented: JSON in  ⇔  JSON out
Schema free Dynamic mapping
Schema free Dynamic (or strict) mapping
Unknown field?
elasticsearch guesses the type
elasticsearch guesses the type and indexes it
Put data in: $e->index( );
Put data in: $e->index( index  => 'twitter', );
Put data in: $e->index( index  => 'twitter', type  => 'tweet', );
Put data in: $e->index( index  => 'twitter', type  => 'tweet', id  => 1,  );
Put data in: $e->index( index  => 'twitter', type  => 'tweet', id  => 1, #  optional );
Put data in: $e->index( index  => 'twitter', type  => 'tweet', id  => 1, #  ES always returns the ID );
Put data in: $e->index( index  => 'twitter', type  => 'tweet', id  => 1,  data  => { } );
Put data in: $e->index( index  => 'twitter', type  => 'tweet', id  => 1,  data  => { tweet => “ElasticSearch is cool”,  } );
Put data in: $e->index( index  => 'twitter', type  => 'tweet', id  => 1,  data  => { tweet => “ElasticSearch is cool”,  se...
Put data in: $e->index( index  => 'twitter', type  => 'tweet', id  => 1,  data  => { tweet => “ElasticSearch is cool”,  se...
Put data in: $e->index( index  => 'twitter', type  => 'tweet', id  => 1,  data  => { tweet => “ElasticSearch is cool”,  se...
Realtime GET
Retrieve your doc immediately
Persistent
No commit required
Get data out: $e->get( index  => 'twitter', type  => 'tweet', id  => 1);
Get data out: $e->get( index  => 'twitter', type  => 'tweet', id  => 1);  { _index  => 'twitter',  _type  => 'tweet', _id ...
Get data out: $e->get( index  => 'twitter', type  => 'tweet', id  => 1);  { _index  => 'twitter',  _type  => 'tweet', _id ...
Get data out: $e->get( index  => 'twitter', type  => 'tweet', id  => 1);  { _index  => 'twitter',  _type  => 'tweet', _id ...
bulk-indexing
bulk-indexing multi-get
bulk-indexing multi-get avoids http latency
bulk-indexing multi-get avoids http latency 10x as fast!
Versioning
Versioning “ Optimistic currency control”
Versioning “ Put if absent”
Versioning Optional
Versioning Can use external version numbers
So far, all we have is a NoSQL document store which is  fast, reliable, scalable & easy to use
So far, all we have is a NoSQL document store which is  fast, reliable, scalable & easy to use
 
Simple search $e->search(  index  => 'twitter', type  => 'tweet', );
Simple search $e->search(  index  =>  ['twitter','facebook'] , type  =>  ['tweet','post'] , );
Simple search $e->search(  #  all indices #  all types );
Simple search $e->search(  index  => 'twitter', type  => 'tweet', query  => {  } );
Simple search $e->search(  index  => 'twitter', type  => 'tweet', query  => { text => { _all => 'clinton' } } );
Simple search $e->search(  index  => 'twitter', type  => 'tweet', query b => 'clinton' );
Simple search $e->search(  index  => 'twitter', type  => 'tweet', query b => 'clinton' #  ElasticSearch::SearchBuilder, # ...
Search results { took => 1, hits => { total  => 1, max_score => 1, hits  => [{ _score  => 1, _index  => 'twitter', _type  ...
Search results { took => 1,  # milliseconds hits => { total  => 1, max_score => 1, hits  => [{ _score  => 1, _index  => 't...
Search results { took => 1, hits => { total  => 1,  # total results max_score => 1, hits  => [{ _score  => 1, _index  => '...
Search results { took => 1, hits => { total  => 1, max_score => 1, hits  => [{ _score  => 1, _index  => 'twitter', _type  ...
Search results { took => 1, hits => { total  => 1, max_score => 1, hits  => [{ _score  => 1, _index  => 'twitter', _type  ...
JSON doc included in results
No need to fetch from DB
Docs visible to search in near-real time  (< 1 second)
refresh_index()  to force
What can you do with search?
standard text search
...with highlighting
stemming
stemming arabic, armenian, basque, brazilian, bulgarian, catalan, chinese, cjk, czech, danish, dutch, english, finnish, fr...
ngrams & edge-ngrams
auto-complete
camelCase
camelCase
Upcoming SlideShare
Loading in …5
×

Cool bonsai cool - an introduction to ElasticSearch

10,368 views

Published on

An introduction to the ElasticSearch search engine and how to use it from Perl

Cool bonsai cool - an introduction to ElasticSearch

  1. 1. “ Cool, Bonsai, Cool” An introduction to Clinton Gormley, YAPC::EU 2011
  2. 2. Why do I need a search engine?
  3. 6. Search is how we find stuff
  4. 9. How does a search engine work?
  5. 11. Acme:: Magic 8Ball Acme:: Magic ::Pony Config:: Magic File:: Magic File::MimeInfo:: Magic File::M Magic ::XS Magic Template Meta::File::M Magic MRO:: Magic Template:: Magic Template:: Magic ::Pager Test:: Magic XS:: Magic Ext XS::Object:: Magic
  6. 12. Magic == inverted index + relevance scoring
  7. 13. Acme::Magic8Ball Acme::Magic::Pony Config::Magic File::Magic File::MimeInfo::Magic File::MMagic::XS MagicTemplate Meta::File::MMagic MRO::Magic Template::Magic Template::Magic::Pager Test::Magic XS::MagicExt XS::Object::Magic Take some text
  8. 14. Acme::Magic8Ball Acme::Magic::Pony Config::Magic File::Magic File::MimeInfo::Magic File::MMagic::XS MagicTemplate Meta::File::MMagic MRO::Magic Template::Magic Template::Magic::Pager Test::Magic XS::MagicExt XS::Object::Magic Tokenise it
  9. 15. acme magic 8 ball acme magic pony config magic file magic file mime info magic file m magic xs magic template meta file m magic mro magic template magic template magic pager test magic xs magic ext xs object magic Tokenise it
  10. 16. acme magic 8 ball acme magic pony config magic file magic file mime info magic file m magic xs magic template meta file m magic mro magic template magic template magic pager test magic xs magic ext xs object magic Find unique tokens/terms
  11. 17. 8 acme ball config ext file info m magic Find unique tokens/terms meta mime mro object pager pony template test xs
  12. 18. acme file magic mime template xs Acme::Magic8Ball Acme::Magic::Pony File::Magic File::MimeInfo::Magic MagicTemplate Template::Magic Template::Magic::Pager XS::Object::Magic XS::MagicExt File::MMagic::XS Map terms to documents
  13. 19. acme file magic mime template xs Acme::Magic8Ball Acme::Magic::Pony File::Magic File::MimeInfo::Magic MagicTemplate Template::Magic Template::Magic::Pager XS::Object::Magic XS::MagicExt File::MMagic::XS Search for: “ file xs ”
  14. 20. Search for: “ file xs ” acme file magic mime template xs Acme::Magic8Ball Acme::Magic::Pony File::Magic File::MimeInfo::Magic MagicTemplate Template::Magic Template::Magic::Pager XS::Object::Magic XS::MagicExt File::MMagic::XS
  15. 21. But, not just about finding
  16. 23. Sort by RELEVANCE
  17. 24. Relevance: How many matching terms does this document contain?
  18. 25. Relevance: How often does each term appear in this document, as a % of its length?
  19. 26. Relevance: How frequently does each term appear in all your documents ?
  20. 27. Relevance: Can be customised
  21. 28. Relevance: Can be customised By document or field
  22. 29. Relevance: Can be customised By document or field At index or search time
  23. 30. Simple as: C an be customised B y document or field A t index or search time
  24. 31. FAST!
  25. 32. POWERFUL!
  26. 33. MAGIC!
  27. 37. www.elasticsearch.org
  28. 38. elasticsearch is:
  29. 39. elasticsearch is: <ul><li>an Open Source (Apache 2) </li></ul>
  30. 40. elasticsearch is: <ul><li>an Open Source (Apache 2)
  31. 41. distributed </li></ul>
  32. 42. elasticsearch is: <ul><li>an Open Source (Apache 2)
  33. 43. distributed
  34. 44. RESTful </li></ul>
  35. 45. elasticsearch is: <ul><li>an Open Source (Apache 2)
  36. 46. distributed
  37. 47. RESTful
  38. 48. search engine </li></ul>
  39. 49. elasticsearch is: <ul><li>an Open Source (Apache 2)
  40. 50. distributed
  41. 51. RESTful
  42. 52. search engine
  43. 53. built on top of Lucene </li></ul>
  44. 54. Installing elasticsearch: Latest version at: http://www.elasticsearch.org/download/ wget https://github.com/.../elasticsearch-0.17.6.tar.gz tar -xzf elasticsearch-0.17.6.tar.gz cd elasticsearch-0.17.6/ ./bin/elasticsearch
  45. 55. Installing ElasticSearch.pm: Latest version at: https://metacpan.org/module/ElasticSearch cpanm ElasticSearch perl -de 0 > use ElasticSearch; > $e = ElasticSearch->new( trace_calls => 1) > $e->cluster_health
  46. 56. Some terminology Relational DB elasticsearch
  47. 57. Some terminology Relational DB elasticsearch database ⇒ index
  48. 58. Some terminology Relational DB elasticsearch database ⇒ index table ⇒ type
  49. 59. Some terminology Relational DB elasticsearch database ⇒ index table ⇒ type row ⇒ document
  50. 60. Some terminology Relational DB elasticsearch database ⇒ index table ⇒ type row ⇒ document column ⇒ field
  51. 61. Some terminology Relational DB elasticsearch database ⇒ index table ⇒ type row ⇒ document column ⇒ field schema ⇒ mapping
  52. 62. Some terminology Relational DB elasticsearch database ⇒ index table ⇒ type row ⇒ document column ⇒ field schema ⇒ mapping index ⇒ everything is indexed
  53. 63. Some terminology Relational DB elasticsearch database ⇒ index table ⇒ type row ⇒ document column ⇒ field schema ⇒ mapping index ⇒ everything is indexed SQL ⇒ query DSL
  54. 64. Clustering
  55. 65. Clustering auto-discovery
  56. 66. Clustering single master auto-elected
  57. 67. Clustering immediate failover master re-election
  58. 68. Clustering index ==
  59. 69. Clustering index == 1 or more primary shards
  60. 70. Clustering index == 1 or more primary shards + 0 or more replica shards
  61. 71. Clustering more primary shards
  62. 72. Clustering ⇒ faster indexing ⇒ more scale more primary shards
  63. 73. Clustering ⇒ faster indexing ⇒ more scale more primary shards more replicas
  64. 74. Clustering ⇒ faster indexing ⇒ more scale ⇒ faster searching ⇒ more failover more primary shards more replicas
  65. 75. Clustering Big subject... http://www.elasticsearch.org/videos/2011/08/09/road-to-a-distributed-searchengine-berlinbuzzwords.html http://berlinbuzzwords.de/sites/ berlinbuzzwords.de/files/elasticsearch-bbuzz2011.pdf
  66. 76. Document oriented:
  67. 77. Document oriented: No ORM required
  68. 78. Document oriented: JSON in ⇔ JSON out
  69. 79. Schema free Dynamic mapping
  70. 80. Schema free Dynamic (or strict) mapping
  71. 81. Unknown field?
  72. 82. elasticsearch guesses the type
  73. 83. elasticsearch guesses the type and indexes it
  74. 84. Put data in: $e->index( );
  75. 85. Put data in: $e->index( index => 'twitter', );
  76. 86. Put data in: $e->index( index => 'twitter', type => 'tweet', );
  77. 87. Put data in: $e->index( index => 'twitter', type => 'tweet', id => 1, );
  78. 88. Put data in: $e->index( index => 'twitter', type => 'tweet', id => 1, # optional );
  79. 89. Put data in: $e->index( index => 'twitter', type => 'tweet', id => 1, # ES always returns the ID );
  80. 90. Put data in: $e->index( index => 'twitter', type => 'tweet', id => 1, data => { } );
  81. 91. Put data in: $e->index( index => 'twitter', type => 'tweet', id => 1, data => { tweet => “ElasticSearch is cool”, } );
  82. 92. Put data in: $e->index( index => 'twitter', type => 'tweet', id => 1, data => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, } );
  83. 93. Put data in: $e->index( index => 'twitter', type => 'tweet', id => 1, data => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, } );
  84. 94. Put data in: $e->index( index => 'twitter', type => 'tweet', id => 1, data => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, tags => [“search”,”perl”], } );
  85. 95. Realtime GET
  86. 96. Retrieve your doc immediately
  87. 97. Persistent
  88. 98. No commit required
  89. 99. Get data out: $e->get( index => 'twitter', type => 'tweet', id => 1);
  90. 100. Get data out: $e->get( index => 'twitter', type => 'tweet', id => 1); { _index => 'twitter', _type => 'tweet', _id => 1, }
  91. 101. Get data out: $e->get( index => 'twitter', type => 'tweet', id => 1); { _index => 'twitter', _type => 'tweet', _id => 1, _version => 1, }
  92. 102. Get data out: $e->get( index => 'twitter', type => 'tweet', id => 1); { _index => 'twitter', _type => 'tweet', _id => 1, _version => 1, _source => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, tags => ['search','perl'], } }
  93. 103. bulk-indexing
  94. 104. bulk-indexing multi-get
  95. 105. bulk-indexing multi-get avoids http latency
  96. 106. bulk-indexing multi-get avoids http latency 10x as fast!
  97. 107. Versioning
  98. 108. Versioning “ Optimistic currency control”
  99. 109. Versioning “ Put if absent”
  100. 110. Versioning Optional
  101. 111. Versioning Can use external version numbers
  102. 112. So far, all we have is a NoSQL document store which is fast, reliable, scalable & easy to use
  103. 113. So far, all we have is a NoSQL document store which is fast, reliable, scalable & easy to use
  104. 115. Simple search $e->search( index => 'twitter', type => 'tweet', );
  105. 116. Simple search $e->search( index => ['twitter','facebook'] , type => ['tweet','post'] , );
  106. 117. Simple search $e->search( # all indices # all types );
  107. 118. Simple search $e->search( index => 'twitter', type => 'tweet', query => { } );
  108. 119. Simple search $e->search( index => 'twitter', type => 'tweet', query => { text => { _all => 'clinton' } } );
  109. 120. Simple search $e->search( index => 'twitter', type => 'tweet', query b => 'clinton' );
  110. 121. Simple search $e->search( index => 'twitter', type => 'tweet', query b => 'clinton' # ElasticSearch::SearchBuilder, # like SQL::Abstract );
  111. 122. Search results { took => 1, hits => { total => 1, max_score => 1, hits => [{ _score => 1, _index => 'twitter', _type => 'tweet', _id => 1, _source => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, tags => ['search','perl'], } }], }, ... other information ... }
  112. 123. Search results { took => 1, # milliseconds hits => { total => 1, max_score => 1, hits => [{ _score => 1, _index => 'twitter', _type => 'tweet', _id => 1, _source => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, tags => ['search','perl'], } }], }, ... other information ... }
  113. 124. Search results { took => 1, hits => { total => 1, # total results max_score => 1, hits => [{ _score => 1, _index => 'twitter', _type => 'tweet', _id => 1, _source => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, tags => ['search','perl'], } }], }, ... other information ... }
  114. 125. Search results { took => 1, hits => { total => 1, max_score => 1, hits => [{ _score => 1, _index => 'twitter', _type => 'tweet', _id => 1, _source => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, tags => ['search','perl'], } }], }, ... other information ... }
  115. 126. Search results { took => 1, hits => { total => 1, max_score => 1, hits => [{ _score => 1, _index => 'twitter', _type => 'tweet', _id => 1, _source => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, tags => ['search','perl'], } }], }, ... other information ... }
  116. 127. JSON doc included in results
  117. 128. No need to fetch from DB
  118. 129. Docs visible to search in near-real time (< 1 second)
  119. 130. refresh_index() to force
  120. 131. What can you do with search?
  121. 132. standard text search
  122. 133. ...with highlighting
  123. 134. stemming
  124. 135. stemming arabic, armenian, basque, brazilian, bulgarian, catalan, chinese, cjk, czech, danish, dutch, english, finnish, french, galician, german, german2, greek, hindi, hungarian, indonesian, italian, kp, light_finish, light_french, light_german, light_hungarian, light_italian, light_portuguese, light_russian, light_spanish, light_swedish., lovins, minimal_english, minimal_french, minimal_german, minimal_portuguese, norwegian, persian, porter, porter2, portuguese, possessive_english, romanian, russian, spanish, swedish, thai, turkish
  125. 136. ngrams & edge-ngrams
  126. 137. auto-complete
  127. 138. camelCase
  128. 139. camelCase
  129. 140. camelCase
  130. 141. term facets, date histograms
  131. 142. ranges
  132. 143. geo bounding box
  133. 144. geo distance
  134. 145. geo distance ranges
  135. 146. geo polygons
  136. 149. “‎ Terms of endearment” The ElasticSearch query language explained‎ Thurs. 14:35 - Auditorija 301

×