Cool bonsai cool - an introduction to ElasticSearch

  • 6,471 views
Uploaded on

An introduction to the ElasticSearch search engine and how to use it from Perl

An introduction to the ElasticSearch search engine and how to use it from Perl

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
6,471
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
247
Comments
0
Likes
18

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. “ Cool, Bonsai, Cool” An introduction to Clinton Gormley, YAPC::EU 2011
  • 2. Why do I need a search engine?
  • 3.  
  • 4.  
  • 5.  
  • 6. Search is how we find stuff
  • 7.  
  • 8.  
  • 9. How does a search engine work?
  • 10.  
  • 11. Acme:: Magic 8Ball Acme:: Magic ::Pony Config:: Magic File:: Magic File::MimeInfo:: Magic File::M Magic ::XS Magic Template Meta::File::M Magic MRO:: Magic Template:: Magic Template:: Magic ::Pager Test:: Magic XS:: Magic Ext XS::Object:: Magic
  • 12. Magic == inverted index + relevance scoring
  • 13. Acme::Magic8Ball Acme::Magic::Pony Config::Magic File::Magic File::MimeInfo::Magic File::MMagic::XS MagicTemplate Meta::File::MMagic MRO::Magic Template::Magic Template::Magic::Pager Test::Magic XS::MagicExt XS::Object::Magic Take some text
  • 14. Acme::Magic8Ball Acme::Magic::Pony Config::Magic File::Magic File::MimeInfo::Magic File::MMagic::XS MagicTemplate Meta::File::MMagic MRO::Magic Template::Magic Template::Magic::Pager Test::Magic XS::MagicExt XS::Object::Magic Tokenise it
  • 15. acme magic 8 ball acme magic pony config magic file magic file mime info magic file m magic xs magic template meta file m magic mro magic template magic template magic pager test magic xs magic ext xs object magic Tokenise it
  • 16. acme magic 8 ball acme magic pony config magic file magic file mime info magic file m magic xs magic template meta file m magic mro magic template magic template magic pager test magic xs magic ext xs object magic Find unique tokens/terms
  • 17. 8 acme ball config ext file info m magic Find unique tokens/terms meta mime mro object pager pony template test xs
  • 18. acme file magic mime template xs Acme::Magic8Ball Acme::Magic::Pony File::Magic File::MimeInfo::Magic MagicTemplate Template::Magic Template::Magic::Pager XS::Object::Magic XS::MagicExt File::MMagic::XS Map terms to documents
  • 19. acme file magic mime template xs Acme::Magic8Ball Acme::Magic::Pony File::Magic File::MimeInfo::Magic MagicTemplate Template::Magic Template::Magic::Pager XS::Object::Magic XS::MagicExt File::MMagic::XS Search for: “ file xs ”
  • 20. Search for: “ file xs ” acme file magic mime template xs Acme::Magic8Ball Acme::Magic::Pony File::Magic File::MimeInfo::Magic MagicTemplate Template::Magic Template::Magic::Pager XS::Object::Magic XS::MagicExt File::MMagic::XS
  • 21. But, not just about finding
  • 22.  
  • 23. Sort by RELEVANCE
  • 24. Relevance: How many matching terms does this document contain?
  • 25. Relevance: How often does each term appear in this document, as a % of its length?
  • 26. Relevance: How frequently does each term appear in all your documents ?
  • 27. Relevance: Can be customised
  • 28. Relevance: Can be customised By document or field
  • 29. Relevance: Can be customised By document or field At index or search time
  • 30. Simple as: C an be customised B y document or field A t index or search time
  • 31. FAST!
  • 32. POWERFUL!
  • 33. MAGIC!
  • 34.  
  • 35.  
  • 36.  
  • 37. www.elasticsearch.org
  • 38. elasticsearch is:
  • 39. elasticsearch is:
    • an Open Source (Apache 2)
  • 40. elasticsearch is:
    • an Open Source (Apache 2)
    • 41. distributed
  • 42. elasticsearch is:
    • an Open Source (Apache 2)
    • 43. distributed
    • 44. RESTful
  • 45. elasticsearch is:
  • 49. elasticsearch is:
  • 54. Installing elasticsearch: Latest version at: http://www.elasticsearch.org/download/ wget https://github.com/.../elasticsearch-0.17.6.tar.gz tar -xzf elasticsearch-0.17.6.tar.gz cd elasticsearch-0.17.6/ ./bin/elasticsearch
  • 55. Installing ElasticSearch.pm: Latest version at: https://metacpan.org/module/ElasticSearch cpanm ElasticSearch perl -de 0 > use ElasticSearch; > $e = ElasticSearch->new( trace_calls => 1) > $e->cluster_health
  • 56. Some terminology Relational DB elasticsearch
  • 57. Some terminology Relational DB elasticsearch database ⇒ index
  • 58. Some terminology Relational DB elasticsearch database ⇒ index table ⇒ type
  • 59. Some terminology Relational DB elasticsearch database ⇒ index table ⇒ type row ⇒ document
  • 60. Some terminology Relational DB elasticsearch database ⇒ index table ⇒ type row ⇒ document column ⇒ field
  • 61. Some terminology Relational DB elasticsearch database ⇒ index table ⇒ type row ⇒ document column ⇒ field schema ⇒ mapping
  • 62. Some terminology Relational DB elasticsearch database ⇒ index table ⇒ type row ⇒ document column ⇒ field schema ⇒ mapping index ⇒ everything is indexed
  • 63. Some terminology Relational DB elasticsearch database ⇒ index table ⇒ type row ⇒ document column ⇒ field schema ⇒ mapping index ⇒ everything is indexed SQL ⇒ query DSL
  • 64. Clustering
  • 65. Clustering auto-discovery
  • 66. Clustering single master auto-elected
  • 67. Clustering immediate failover master re-election
  • 68. Clustering index ==
  • 69. Clustering index == 1 or more primary shards
  • 70. Clustering index == 1 or more primary shards + 0 or more replica shards
  • 71. Clustering more primary shards
  • 72. Clustering ⇒ faster indexing ⇒ more scale more primary shards
  • 73. Clustering ⇒ faster indexing ⇒ more scale more primary shards more replicas
  • 74. Clustering ⇒ faster indexing ⇒ more scale ⇒ faster searching ⇒ more failover more primary shards more replicas
  • 75. Clustering Big subject... http://www.elasticsearch.org/videos/2011/08/09/road-to-a-distributed-searchengine-berlinbuzzwords.html http://berlinbuzzwords.de/sites/ berlinbuzzwords.de/files/elasticsearch-bbuzz2011.pdf
  • 76. Document oriented:
  • 77. Document oriented: No ORM required
  • 78. Document oriented: JSON in ⇔ JSON out
  • 79. Schema free Dynamic mapping
  • 80. Schema free Dynamic (or strict) mapping
  • 81. Unknown field?
  • 82. elasticsearch guesses the type
  • 83. elasticsearch guesses the type and indexes it
  • 84. Put data in: $e->index( );
  • 85. Put data in: $e->index( index => 'twitter', );
  • 86. Put data in: $e->index( index => 'twitter', type => 'tweet', );
  • 87. Put data in: $e->index( index => 'twitter', type => 'tweet', id => 1, );
  • 88. Put data in: $e->index( index => 'twitter', type => 'tweet', id => 1, # optional );
  • 89. Put data in: $e->index( index => 'twitter', type => 'tweet', id => 1, # ES always returns the ID );
  • 90. Put data in: $e->index( index => 'twitter', type => 'tweet', id => 1, data => { } );
  • 91. Put data in: $e->index( index => 'twitter', type => 'tweet', id => 1, data => { tweet => “ElasticSearch is cool”, } );
  • 92. Put data in: $e->index( index => 'twitter', type => 'tweet', id => 1, data => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, } );
  • 93. Put data in: $e->index( index => 'twitter', type => 'tweet', id => 1, data => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, } );
  • 94. Put data in: $e->index( index => 'twitter', type => 'tweet', id => 1, data => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, tags => [“search”,”perl”], } );
  • 95. Realtime GET
  • 96. Retrieve your doc immediately
  • 97. Persistent
  • 98. No commit required
  • 99. Get data out: $e->get( index => 'twitter', type => 'tweet', id => 1);
  • 100. Get data out: $e->get( index => 'twitter', type => 'tweet', id => 1); { _index => 'twitter', _type => 'tweet', _id => 1, }
  • 101. Get data out: $e->get( index => 'twitter', type => 'tweet', id => 1); { _index => 'twitter', _type => 'tweet', _id => 1, _version => 1, }
  • 102. Get data out: $e->get( index => 'twitter', type => 'tweet', id => 1); { _index => 'twitter', _type => 'tweet', _id => 1, _version => 1, _source => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, tags => ['search','perl'], } }
  • 103. bulk-indexing
  • 104. bulk-indexing multi-get
  • 105. bulk-indexing multi-get avoids http latency
  • 106. bulk-indexing multi-get avoids http latency 10x as fast!
  • 107. Versioning
  • 108. Versioning “ Optimistic currency control”
  • 109. Versioning “ Put if absent”
  • 110. Versioning Optional
  • 111. Versioning Can use external version numbers
  • 112. So far, all we have is a NoSQL document store which is fast, reliable, scalable & easy to use
  • 113. So far, all we have is a NoSQL document store which is fast, reliable, scalable & easy to use
  • 114.  
  • 115. Simple search $e->search( index => 'twitter', type => 'tweet', );
  • 116. Simple search $e->search( index => ['twitter','facebook'] , type => ['tweet','post'] , );
  • 117. Simple search $e->search( # all indices # all types );
  • 118. Simple search $e->search( index => 'twitter', type => 'tweet', query => { } );
  • 119. Simple search $e->search( index => 'twitter', type => 'tweet', query => { text => { _all => 'clinton' } } );
  • 120. Simple search $e->search( index => 'twitter', type => 'tweet', query b => 'clinton' );
  • 121. Simple search $e->search( index => 'twitter', type => 'tweet', query b => 'clinton' # ElasticSearch::SearchBuilder, # like SQL::Abstract );
  • 122. Search results { took => 1, hits => { total => 1, max_score => 1, hits => [{ _score => 1, _index => 'twitter', _type => 'tweet', _id => 1, _source => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, tags => ['search','perl'], } }], }, ... other information ... }
  • 123. Search results { took => 1, # milliseconds hits => { total => 1, max_score => 1, hits => [{ _score => 1, _index => 'twitter', _type => 'tweet', _id => 1, _source => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, tags => ['search','perl'], } }], }, ... other information ... }
  • 124. Search results { took => 1, hits => { total => 1, # total results max_score => 1, hits => [{ _score => 1, _index => 'twitter', _type => 'tweet', _id => 1, _source => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, tags => ['search','perl'], } }], }, ... other information ... }
  • 125. Search results { took => 1, hits => { total => 1, max_score => 1, hits => [{ _score => 1, _index => 'twitter', _type => 'tweet', _id => 1, _source => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, tags => ['search','perl'], } }], }, ... other information ... }
  • 126. Search results { took => 1, hits => { total => 1, max_score => 1, hits => [{ _score => 1, _index => 'twitter', _type => 'tweet', _id => 1, _source => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, tags => ['search','perl'], } }], }, ... other information ... }
  • 127. JSON doc included in results
  • 128. No need to fetch from DB
  • 129. Docs visible to search in near-real time (< 1 second)
  • 130. refresh_index() to force
  • 131. What can you do with search?
  • 132. standard text search
  • 133. ...with highlighting
  • 134. stemming
  • 135. stemming arabic, armenian, basque, brazilian, bulgarian, catalan, chinese, cjk, czech, danish, dutch, english, finnish, french, galician, german, german2, greek, hindi, hungarian, indonesian, italian, kp, light_finish, light_french, light_german, light_hungarian, light_italian, light_portuguese, light_russian, light_spanish, light_swedish., lovins, minimal_english, minimal_french, minimal_german, minimal_portuguese, norwegian, persian, porter, porter2, portuguese, possessive_english, romanian, russian, spanish, swedish, thai, turkish
  • 136. ngrams & edge-ngrams
  • 137. auto-complete
  • 138. camelCase
  • 139. camelCase
  • 140. camelCase
  • 141. term facets, date histograms
  • 142. ranges
  • 143. geo bounding box
  • 144. geo distance
  • 145. geo distance ranges
  • 146. geo polygons
  • 147.  
  • 148.  
  • 149. “‎ Terms of endearment” The ElasticSearch query language explained‎ Thurs. 14:35 - Auditorija 301