Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Elasticsearch Distributed search & analytics on BigData made easy

2,430 views

Published on

Elasticsearch is a cloud-ready, super scalable search engine which is gaining a lot of popularity lately. It is mostly known for being extremely easy to setup and integrate with any technology stack.In this talk we will introduce Elasticdearch, and start by looking at some of its basic capabilities. We will demonstrate how it can be used for document search and even log analytics for DevOps and distributed debugging, and peek into more advanced usages like the real-time aggregations and percolation. Obviously, we will make sure to demonstrate how Elasticsearch can be scaled out easily to work on a distributed architecture and handle pretty much any load.

Published in: Data & Analytics
  • Be the first to comment

Elasticsearch Distributed search & analytics on BigData made easy

  1. 1. Itamar Syn-Hershko http://code972.com @synhershko Elasticsearch Distributed search & analytics on BigData made easy
  2. 2. Me? • Itamar Syn-Hershko / @synhershko • Lucene.NET PMC and lead committer • Freelance consultant and developer • Elasticsearch consulting partner • Microsoft MVP • RavenDB – X-Core developer – “RavenDB in Action” author Consulting Partner
  3. 3. An index
  4. 4. Elasticsearch • Powered by Apache Lucene • Open-source • Rapid growth • High profile users world-wide
  5. 5. REST API • Indexes • Types • IDs $ curl -XPUT 'http://localhost:9200/twitter/tweet/1' -d '{ "user" : "synhershko", "post_date" : "2013-05-30T14:12:12", "message" : "trying out Elastic Search", "followers": 3, "registered": true }'
  6. 6. Full-Text Search
  7. 7. DocumentsTerm <6>and <2> <3>big <6>dark <4>did <2>gown <3>had <2> <3>house <1> <2> <3> <5> <6>in <1> <3> <5>keep <1> <4> <5>keeper <1> <5> <6>keeps <6>light <4>never <1> <4> <5>night <1> <2> <3> <4>old <4>sleep <6>sleeps <1> <2> <3> <4> <5> <6>the <1> <3>town <4>where The index: Dictionary and posting lists 6 documents to index Example from: Justin Zobel , Alistair Moffat, Inverted files for text search engines, ACM Computing Surveys (CSUR) v.38 n.2, p.6-es, 2006 The old night keeper keeps the keep in the town1 In the big old house in the big old gown.2 The house in the town had the big old keep3 Where the old night keeper never did sleep.4 The night keeper keeps the keep in the night5 And keeps in the dark and sleeps in the light.6 Full-text Search 101: The inverted index
  8. 8. Full-text Search 101: The inverted index DocumentsTerm <6>and <2> <3>big <6>dark <4>did <2>gown <3>had <2> <3>house <1> <2> <3> <5> <6>in <1> <3> <5>keep <1> <4> <5>keeper <1> <5> <6>keeps <6>light <4>never <1> <4> <5>night <1> <2> <3> <4>old <4>sleep <6>sleeps <1> <2> <3> <4> <5> <6>the <1> <3>town <4>where The index: Dictionary and posting lists 6 documents to index The old night keeper keeps the keep in the town1 In the big old house in the big old gown.2 The house in the town had the big old keep3 Where the old night keeper never did sleep.4 The night keeper keeps the keep in the night5 And keeps in the dark and sleeps in the light.6 User queries for “keeper”
  9. 9. Term Normalization DocumentsTerm <6>and <2> <3>big <6>dark <4>did <2>gown <3>had <2> <3>house <1> <2> <3> <5> <6>in <1> <3> <5>keep <1> <4> <5>keeper <1> <5> <6>keeps <6>light <4>never <1> <4> <5>night <1> <2> <3> <4>old <4>sleep <6>sleeps <1> <2> <3> <4> <5> <6>the <1> <3>town <4>where • Lowercasing • Stop words (grey) • Not best practice anymore • Stemming • Porter stemmer • s-stemmer • Relevance++ • SizeOnDisk--
  10. 10. Full-Text Search Your data store
  11. 11. How hard is it to get search right, anyway?
  12. 12. Relevance • Precision The fraction of the retrieved documents that are relevant • Recall The fraction of the relevant documents that are retrieved • Order of results
  13. 13. Challenges with search • Relevance • Getting the tokens right – Tokenization – Stemming • Multi-lingual content – Or other cross-cutting search concerns • Tolerance
  14. 14. Real-time Analytics
  15. 15. Real-time Analytics Queue (Redis) “Shippers” “Indexer”
  16. 16. Scaling out
  17. 17. Moar use cases!
  18. 18. #1: Real-Time Alerting System
  19. 19. Percolation
  20. 20. #2: Smarter query parsing
  21. 21. Matching inexact queries • Phrase slop – “Bridge of London” -> “London Bridge” • Word-level edit distance with fuzzy queries – ditsance -> distance – color -> colour
  22. 22. #3: Offline Classification
  23. 23. Structuring the unstructured • Record linkage – Bag of words model – “More Like This” functionality • NLP • Entity extraction
  24. 24. #4: Everything is searchable
  25. 25. Geo-spatial search • Distance • Shape interactions • Multiple algorithms
  26. 26. Geo-spatial search
  27. 27. Image search http://colors.qbox.io/
  28. 28. http://cs.stanford.edu/people/karpathy/deepimage sent Deep Visual-Semantic Alignments for Generating Image Descriptions
  29. 29. #5: Anomaly detection
  30. 30. The Significant Terms Aggregation
  31. 31. Uncommonly common Mark Harwood’s talk at http://www.infoq.com/presentations/elasticsear ch-revealing-uncommonly-common
  32. 32. #6: Debugging a distributed system Queue (Redis)
  33. 33. #6: Debugging a distributed system 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)" System.NullReferenceException: Object reference not set to an instance of an object. at System.Collections.Generic.Dictionary`2.Insert(TKey key, TValue value, Boolean add) at AjaxControlToolkit.ToolkitScriptManager.GetScriptCombineAttributes(Assembly assembly) at AjaxControlToolkit.ToolkitScriptManager.IsScriptCombinable(ScriptEntry scriptEntry) at AjaxControlToolkit.ToolkitScriptManager.OnResolveScriptReference(ScriptReferenceEventArgs e) at System.Web.UI.ScriptManager.RegisterScripts() at System.Web.UI.ScriptManager.OnPagePreRenderComplete(Object sender, EventArgs e) at System.Web.UI.Page.OnPreRenderComplete(EventArgs e) at System.Web.UI.Page.ProcessRequestMain(Boolean includeStagesBeforeAsyncPoint, Boolean includeStagesAfterAsyncPoint)
  34. 34. #7: Distributed git storage • PoC in C# using libgit2sharp • https://github.com/synhershko/libgit2sharp.El asticsearch • Kudos @nulltoken
  35. 35. Thank you. Questions? Itamar Syn-Hershko http://code972.com @synhershko

×