Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world search problems

3,495 views

Published on

Marrying Elasticsearch with NLP to solve real-world search problems - Phu Le (Knorex)

Published in: Technology
  • Be the first to comment

TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world search problems

  1. 1. © 2016 Knorex Marrying Elasticsearch with NLP to solve real-world search problems Phu Le, Knorex @ Grokking TechTalk 25 June 2016 Web : http://knorex.com Email : info@knorex.com
  2. 2. © 2016 Knorex Knorex Lumina Web ServicesTM 2 / 36
  3. 3. © 2016 Knorex Knorex Lumina Web ServicesTM 3 / 36
  4. 4. © 2016 Knorex Knorex Lumina Web ServicesTM 4 / 36
  5. 5. © 2016 Knorex Knorex Lumina Web ServicesTM 5 / 36
  6. 6. © 2016 Knorex 1. Architecture 2. Ingredients • Data gathering • Content extraction • Preprocessing • Modelling: terms -> phrases, entities -> documents 3. Elasticsearch • Basic analysis, faceting and filtering • Do you mean • Percolator • Recommendation • Deduplication 3. Summary Outline 6 / 36
  7. 7. © 2016 Knorex Architecture 7 / 36
  8. 8. © 2016 Knorex 1. Data gathering • Deep crawler • Lazy crawler • Visual scraper • Social media adapters 2. Content extraction • Take news article as an example • Title • Content • Published date • Author • Image • … Ingredients 8 / 36
  9. 9. © 2016 Knorex Content extraction 9 / 36
  10. 10. © 2016 Knorex Content extraction 10 / 36
  11. 11. © 2016 Knorex 3. Preprocessing • Sentence splitting, Tokenization • Stemming vs Lemmatizing • Stemming: cries, crying, cried => cri • Lemmatizing: dogs => dog; is, are => be Ingredients 11 / 36
  12. 12. © 2016 Knorex 3. Modelling • Goal: synthesizing words, tokens into larger units and attach meaning to them • Key phrases extractions • Named entity recognition • Basic building block of knowledge • Basis for computing relatedness and extracting relations • Sentiment analysis • Social media snippet • General article or towards concepts / named entities • Emotion • Document classification • Group search results into faceted categories • Recommend related articles by category Ingredients 12 / 36
  13. 13. © 2016 Knorex Terms 13 / 36
  14. 14. © 2016 Knorex Phrases 14 / 36
  15. 15. © 2016 Knorex Entities 15 / 36
  16. 16. © 2016 Knorex Document classification 16 / 36
  17. 17. © 2016 Knorex • First released Feb 2010, among fastest-growing open- source projects, total funding $104M (3 rounds) • Based on Apache Lucene (same as Solr) • Written in Java, support HTTP interface, schema-free JSON document (yay no XML!) • Designed to be scalable, distributed in nature 17 / 36
  18. 18. © 2016 Knorex Analysis ”analyzer”: “standard” ”analyzer”: “whitespace” ”analyzer”: “keyword” 18 / 36
  19. 19. © 2016 Knorex Analysis Successful! [“https”, “www.facebook.com”, ”events”, “194454270949757“] No hits! WTH… it is not working!!!! Default analyzer as-is • url => not_analyzed / keyword analyzer • Use match query instead of term filter / term query: field analyzer awareness • Custom analyzer: e.g. keyword tokenizer + lowercase filter 19 / 36
  20. 20. © 2016 Knorex Analysis I n Search analyzer Index analyzer Elasticsearch index Search Index • Design carefully what fields that search will be executed frequently on • Determine what analyzers to use for each field (experimental based on application needs) • Search analyzer and index analyzer might be different for the same field • Use match query instead of term filter / term query: field analyzer awareness • Exploit multi-field 20 / 36
  21. 21. © 2016 Knorex Faceting and filtering 21 / 36
  22. 22. © 2016 Knorex Do you mean • “grok” -> “grokking”, “sear” -> “search” • Natural approach: • Compute terms aggregation (facet) across all text fields • title • description • content • Use regex to filter matched terms, sort DESC by frequency, take most popular terms to suggest DON’T!!! 22 / 36
  23. 23. © 2016 Knorex 23 / 36
  24. 24. © 2016 Knorex Do you mean • Limitations • Single terms only. Cannot suggest phrases • Terms occurring frequently might not be useful • Improvements • Building another field “phrases” in the document • adding entire title • Using key phrases extraction, named entity recognition to populate meaningful phrases • Custom tokenizers: keyword, edgeNGram • edgeNGram example: “grokking” => “gro”, “grok”, “grokk” • Query: “burs mal” => matched: “bursa malaysia” • memory explosion!!! • Custom scoring (importance, popularity score) instead of term frequency 24 / 36
  25. 25. © 2016 Knorex Do you mean • Elasticsearch built-in suggester • FST example. Source: https://www.elastic.co/blog/you-complete-me • Features: • Speed & scale: FST per-segment, build in real-time, scale horizontally • Analysis: synonym, fuzzy • Support custom ordering and scoring • Limitations: can’t find word anywhere within a phrase 25 / 36
  26. 26. © 2016 Knorex Do you mean • Speed test: 1 millions articles, 2.7 GB index size on single laptop with SSD • Cautions • Don’t add all terms/phrases to suggestion (only meaningful ones!) • Don’t start suggesting immediately. How many words starting with “c”? • Don’t suggest terms that yield no search results • Apply same filter condition of current query to the term suggestion query Regex terms facet Terms suggester 296.5 ms 13 ms 26 / 36
  27. 27. © 2016 Knorex Percolator • percolate: match documents against queries 27 / 36
  28. 28. © 2016 Knorex Percolator • Sample use case: segmenting articles using keywords 28 / 36
  29. 29. © 2016 Knorex Recommendation • Natural approach • More-like-this or fuzzy-like-this on title, content • Not accurate, bag-of-word approach. • Tricky in determining threshold. ”Good value” varies across different document types and domains • Slow. The more terms allowed in the queries, the slower it is. If cut off based on max terms, then accuracy drops • Proposed approaches • Utilize NLP results (modelling step): • Category: recommend articles from same categories • Key phrases: match and rank documents w.r.t target documents by key phrases • Named entities: model with parent/child relationship • Combine with function score feature to rescore results • Example: applying a Gauss decay function to favor more recent results 29 / 36
  30. 30. © 2016 Knorex Recommendation • Sophisticated scoring and ranking can be done outside of Elasticsearch • Still, can tap on Elasticsearch for faceting and filtering capability 30 / 36
  31. 31. © 2016 Knorex Deduplication • Natural approach • Term matching on URL, title • Failed if these are slightly different (very common!) • More-like-this or fuzzy-like-this on content, with high matching threshold: e.g. 70%, 80% • Not accurate, bag-of-word approach. • Tricky in determining threshold. ”Good value” varies across different dcoument types and domains • Slow. The more terms allowed in the queries, the slower it is. If cut off based on max terms, then accuracy drops • Proposed approach • Semantic hashing: minhash, simhash • for a document, compute a hash value • convert the hash value to binary string form • robust and efficient, can cater to near-duplicate • Implement Hamming distance search using Elasticsearch fuzzy_like_this 31 / 36
  32. 32. © 2016 Knorex Deduplication • Do not index duplicate at all or • Collapse similar items in search results, display only the one with highest score • Assign same id for articles that are duplicate (called it groupid) • Use Elasticsearch Top Hits query to collapse result by groupid ⇒ 64-bit hash: 1000010001000111101001011011110010111101000011100 101101001011101 Modified version: 1010010001000111101011011011110010111101000011100 101101000011101 Hamming distance: 3 32 / 36
  33. 33. © 2016 Knorex Further reading • Dismax vs bool queries • Term vs text queries • Filter vs filtered • Facets (old) vs aggregations (facets reborn + statistics) • Geo 33 / 36
  34. 34. © 2016 Knorex Summary • ES is very flexible with numerous features and knobs • Critical to understand basic analysis, different types of queries • Indexing time and search time tradeoff • Precision and recall tradeoff • Complexity and memory estimation • Use NLP techniques as modelling step to improve search quality • Pay great attention to data input and data gathering step 34 / 36
  35. 35. © 2016 Knorex About Knorex Founded in 2010 as spin-off from Data Mining Dept. of A*STAR, Singapore  Enabling our customers to make smarter discovery and turn it into actionable insight Mission 35 / 36
  36. 36. © 2016 Knorex https://www.knorex.com https://itviec.com/companies/knorex 36 / 36
  37. 37. © 2016 Knorex Thank you

×