Why Search? (starring Elasticsearch)

832 views

Published on

Why do we need a dedicated search engine to search our unstructured text data? Why can't we just rely on the features built in most databases?

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
832
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Why Search? (starring Elasticsearch)

  1. 1. Why Search? (starring Elasticsearch) Doug Turnbull OpenSource Connections OpenSource Connections
  2. 2. Hello • Me @softwaredoug dturnbull@o19s.com • Us http://o19s.com World class search consultants Right here in C’ville! Hiring passionate interns! OpenSource Connections
  3. 3. Why Search? • What does a dedicated search engine do? o that a database doesn’t? • Why not [MySQL|mongoDB|Cassandra | etc]? • Why a dedicated search engine? OpenSource Connections
  4. 4. Why not MySQL? • We’ve got rows of stuff in tables. IE for SciFi StackExchange, we’ve stored ~20K posts: PostID UserId CreationDate ViewCount Body 0 1 2011-01124 11T20:52:46.75 3 <p>What exactly did Obiwan know about Anakin and Darth Vader before a New Hope started?</p> 1 2 2013-02525 01T12:44:46.52 5 <p>Been meaning to read the Foundation Series, what should I read first?</p> OpenSource Connections
  5. 5. Why not MySQL? • Our mission: Find all the “Darth Vader” in SciFi StackExchange Posts! P U C V Body 0 1 2 1 <p>What exactly did Obiwan know about Anakin and Darth Vader before a New Hope started?</p> 1 2 2 5 <p>Been meaning to read the Foundation Series, what should I read first?</p> Found! Missing! OpenSource Connections
  6. 6. Why not MySQL – SQL Like? • SQL “LIKE” operator – scan all rows for a specific wildcard match SELECT * FROM posts WHERE body LIKE "%darth vader%" Performs Table Scan Match? Match? Match? Match? Approx 300ms to search a measly 20K docs! (what if we had 20 Million?) OpenSource Connections
  7. 7. SQL Like – other problems • Can’t search for words out –of-order: SELECT * FROM posts WHERE body LIKE "%vader, darth%" 0 results • Can’t search for alternate forms of a word: SELECT * FROM posts WHERE body LIKE "%kittie pictures%‚ SELECT * FROM posts WHERE body LIKE "%kitteh pictures%" OpenSource Connections
  8. 8. SQL Like – other problems • No Ranking of Results – given these two docs: I seem to remember a novel, I think it was Dark Lord: The Rise of Darth Vader, that addressed this. It made the assertion that while Darth Vader had lost both hands, he was still as formidable, in the force sense, - Directly about Darth Vader One might ask how none of the Jedi at Qui-Gon's funeral noticed that there was a Dark Lord of the Sith standing right behind them. Darth Vader and Obi-Wan only noticed each other when on the same station … It's apparently hard to pick up another force-user without knowing he or she is there… - Darth Vader is a side topic here Which should come first? OpenSource Connections
  9. 9. SQL Like| CTRL+F |grep is 1. Extremely Slow 2. Not fuzzy -- Needs exact literal matches, no fuzziness! 3. Unranked -- Simply says y/n whether there is a match OpenSource Connections
  10. 10. Search needs to be 1. FAST! A data structure that can efficiently take search terms and return a set of documents 2. FUZZY! A way to record positional and fuzzy modifications to text to assist matching 3. FRUITFUL! Relevant documents bubble to the top. OpenSource Connections
  11. 11. Lets play with an implementation • Your database’s full text search features o MySQL, for example has a FULLTEXT index o Works for trivial cases, not the path of wisdom • Lucene -> Elasticsearch Lucene Solr Elasticsearch • Lucene, 1999 by Doug Cutting • Java library for search • Solr, 2006, Yonik Seely • First to put Lucene behind an http interface • Still going strong • Elasticsearch, 2010, Shay Banon • Alternative implementation • Extremely REST-Y OpenSource Connections
  12. 12. Elasticsearch • Create an index curl –XPUT http://localhost:9200/stackexchange • Index some docs! curl –XPUT http://localhost:9200/stackexchange/post/1 -d ‘{ ‚Body‛: ‚<p>Darth Vader dined with Luke</p>‛, ‚Title‛: ‚...‛}’ OpenSource Connections
  13. 13. What is being built? The answer can be found in your textbook… Book Index: • Topics -> page no • Very efficient tool – compare to scanning the whole book! Lucene uses an index: • Tokens => document ids: laser => [2, 4] light => [2, 5] lightsaber => [0, 1, 5, 7] OpenSource Connections
  14. 14. Computers == Dumb • Humans are smart o I see “cat” or “cats” in the back of a book, no duh – jump to page 9 • Computers are dumb, o “CAT” != “cat” – no match returned o “cat” != “cats” – no match returned • Hence, when indexing, normalize text to more searchable form: cats -> cat fitted -> fit alumnus -> alumnu OpenSource Connections
  15. 15. Normalization aka Text Analysis • Raw input Filtered (char filter) • • <p>Darth Vader dined with Luke</p> Darth Vader dined with Luke • Tokenized, o Darth Vader dined with Luke o [Darth] [Vader] [dined] [with] [Luke] • Token filters (Lowercased, synonyms applied, remove pointless words) o [darth] [vader] [dine] [luke] • Most importantly: this is highly configurable OpenSource Connections
  16. 16. Normalization aka Text Analysis curl -XGET 'http://localhost:9200/_analyze?analyzer=snowball' -d 'Darth Vader dined with Luke‘ { "tokens": [ { "end_offset": 5, "position": 1, "start_offset": 0, "token": "darth", "type": "<ALPHANUM>" }, { "end_offset": 11, "position": 2, "start_offset": 6, "token": "vader", "type": "<ALPHANUM>" }, { "end_offset": 17, "position": 3, "start_offset": 12, "token": "dine", "type": "<ALPHANUM>" }, { "end_offset": 27, "position": 5, "start_offset": 23, "token": "luke", "type": "<ALPHANUM>" } ] } OpenSource Connections
  17. 17. What is being built? field Body term darth doc 1 <metadata> doc 2 <metadata> term vader doc 1 <metadata> term dine doc 1 <metadata> curl –XPUT http://localhost:9200/stackexchange/post/1 –d ‘{ ‚Body‛: ‚<p>Darth Vader dined with Luke</p>‛, ‚Title‛: ‚...‛}’ curl –XPUT http://localhost:9200/stackexchange/post/2 –d ‘{ ‚Body‛: ‚<p>We love Darth</p>‛, ‚Title‛: ‚...‛}’ OpenSource Connections
  18. 18. Ranking field Body term darth doc 1 <metadata> doc 2 <metadata> term vader doc 1 <metadata> term dine doc 1 <metadata> curl –XPUT http://localhost:9200/stackexchange/post/1 –d ‘{ ‚Body‛: ‚<p>Darth Vader dined with Luke</p>‛, ‚Title‛: ‚...‛}’ curl –XPUT http://localhost:9200/stackexchange/post/2 –d ‘{ ‚Body‛: ‚<p>We love Darth</p>‛, ‚Title‛: ‚...‛}’ Can we store anything here to help decide how relevant this term is for this doc? Yes! - Term Frequency - How much “darth” is in this doc? - Position within document - Helps when we search for the phrase “darth vader” OpenSource Connections
  19. 19. Query Documents • When did Darth Vader and Luke have dinner? curl -X POST "http://localhost:9200/stackexchange/_search?pretty=true" -d ' { "query": { "match": { "Body": "luke darth dinner" } User Query } } OpenSource Connections
  20. 20. What happens when we query? luke darth dinner How to consult index for matches? [darth] Analysis [luke] [darth] [dine] Score for [darth] docs (1 and 2) [dine] Score for [dine] docs (1) Return sorted docs client field Body term darth doc 1 <metadata> doc 2 <metadata> term vader doc 1 <metadata> term dine doc 1 <metadata> ... OpenSource Connections
  21. 21. So Elasticsearch! • FAST! o Inverted index data structure is blazing fast o Lucene is probably the most tuned implementation • FUZZY! o We use analysis to normalize text to canonical forms o We can use positional information when querying (not shown here) • FRUITFUL! o Relevant documents are scored based on relative term frequency OpenSource Connections
  22. 22. BUT WAIT THERE’S MORE • Many non-traditional applications of “search” o Rank file directory by proximity to current directory o Geographic-aided search, rank based on distance and search relevancy o Q & A systems – Watson has a ton of Lucene o Log aggregation, ie Kibana -- because in Lucene everything is indexed! • And many features! o Spellchecking o Facets o More-like-this document OpenSource Connections
  23. 23. QUESTIONS? OpenSource Connections

×