© 2016 Knorex
Marrying Elasticsearch with
NLP to solve real-world
search problems
Phu Le, Knorex
@ Grokking TechTalk
25 June 2016
Web : http://knorex.com
Email : info@knorex.com
© 2016 Knorex
Knorex Lumina Web ServicesTM
2 / 36
© 2016 Knorex
Knorex Lumina Web ServicesTM
3 / 36
© 2016 Knorex
Knorex Lumina Web ServicesTM
4 / 36
© 2016 Knorex
Knorex Lumina Web ServicesTM
5 / 36
© 2016 Knorex
1. Architecture
2. Ingredients
• Data gathering
• Content extraction
• Preprocessing
• Modelling: terms -> phrases, entities -> documents
3. Elasticsearch
• Basic analysis, faceting and filtering
• Do you mean
• Percolator
• Recommendation
• Deduplication
3. Summary
Outline
6 / 36
© 2016 Knorex
Architecture
7 / 36
© 2016 Knorex
1. Data gathering
• Deep crawler
• Lazy crawler
• Visual scraper
• Social media adapters
2. Content extraction
• Take news article as an example
• Title
• Content
• Published date
• Author
• Image
• …
Ingredients
8 / 36
© 2016 Knorex
Content extraction
9 / 36
© 2016 Knorex
Content extraction
10 / 36
© 2016 Knorex
3. Preprocessing
• Sentence splitting, Tokenization
• Stemming vs Lemmatizing
• Stemming: cries, crying, cried => cri
• Lemmatizing: dogs => dog; is, are => be
Ingredients
11 / 36
© 2016 Knorex
3. Modelling
• Goal: synthesizing words, tokens into larger units and
attach meaning to them
• Key phrases extractions
• Named entity recognition
• Basic building block of knowledge
• Basis for computing relatedness and extracting relations
• Sentiment analysis
• Social media snippet
• General article or towards concepts / named entities
• Emotion
• Document classification
• Group search results into faceted categories
• Recommend related articles by category
Ingredients
12 / 36
© 2016 Knorex
Terms
13 / 36
© 2016 Knorex
Phrases
14 / 36
© 2016 Knorex
Entities
15 / 36
© 2016 Knorex
Document classification
16 / 36
© 2016 Knorex
• First released Feb 2010, among fastest-growing open-
source projects, total funding $104M (3 rounds)
• Based on Apache Lucene (same as Solr)
• Written in Java, support HTTP interface, schema-free
JSON document (yay no XML!)
• Designed to be scalable, distributed in nature
17 / 36
© 2016 Knorex
Analysis
”analyzer”: “standard” ”analyzer”: “whitespace” ”analyzer”: “keyword”
18 / 36
© 2016 Knorex
Analysis
Successful!
[“https”,
“www.facebook.com”,
”events”,
“194454270949757“]
No hits! WTH… it is not working!!!!
Default
analyzer
as-is
• url => not_analyzed / keyword analyzer
• Use match query instead of term filter /
term query: field analyzer awareness
• Custom analyzer: e.g. keyword
tokenizer + lowercase filter
19 / 36
© 2016 Knorex
Analysis
I
n
Search
analyzer
Index
analyzer
Elasticsearch
index
Search Index
• Design carefully what fields that search will be executed frequently on
• Determine what analyzers to use for each field (experimental based on
application needs)
• Search analyzer and index analyzer might be different for the same field
• Use match query instead of term filter / term query: field analyzer awareness
• Exploit multi-field
20 / 36
© 2016 Knorex
Faceting and filtering
21 / 36
© 2016 Knorex
Do you mean
• “grok” -> “grokking”, “sear” -> “search”
• Natural approach:
• Compute terms aggregation (facet) across all text fields
• title
• description
• content
• Use regex to filter matched terms, sort DESC by frequency, take most popular terms
to suggest
DON’T!!!
22 / 36
© 2016 Knorex 23 / 36
© 2016 Knorex
Do you mean
• Limitations
• Single terms only. Cannot suggest phrases
• Terms occurring frequently might not be useful
• Improvements
• Building another field “phrases” in the document
• adding entire title
• Using key phrases extraction, named entity recognition to populate meaningful phrases
• Custom tokenizers: keyword, edgeNGram
• edgeNGram example: “grokking” => “gro”, “grok”, “grokk”
• Query: “burs mal” => matched: “bursa malaysia”
• memory explosion!!!
• Custom scoring (importance, popularity score) instead of term frequency
24 / 36
© 2016 Knorex
Do you mean
• Elasticsearch built-in suggester
• FST example. Source: https://www.elastic.co/blog/you-complete-me
• Features:
• Speed & scale: FST per-segment, build in real-time, scale horizontally
• Analysis: synonym, fuzzy
• Support custom ordering and scoring
• Limitations: can’t find word anywhere within a phrase
25 / 36
© 2016 Knorex
Do you mean
• Speed test: 1 millions articles, 2.7 GB index size on single laptop with SSD
• Cautions
• Don’t add all terms/phrases to suggestion (only meaningful ones!)
• Don’t start suggesting immediately. How many words starting with “c”?
• Don’t suggest terms that yield no search results
• Apply same filter condition of current query to the term suggestion query
Regex terms
facet
Terms
suggester
296.5 ms 13 ms
26 / 36
© 2016 Knorex
Percolator
• percolate: match documents against queries
27 / 36
© 2016 Knorex
Percolator
• Sample use case: segmenting articles using keywords
28 / 36
© 2016 Knorex
Recommendation
• Natural approach
• More-like-this or fuzzy-like-this on title, content
• Not accurate, bag-of-word approach.
• Tricky in determining threshold. ”Good value” varies across different document types and
domains
• Slow. The more terms allowed in the queries, the slower it is. If cut off based on max
terms, then accuracy drops
• Proposed approaches
• Utilize NLP results (modelling step):
• Category: recommend articles from same categories
• Key phrases: match and rank documents w.r.t target documents by key phrases
• Named entities: model with parent/child relationship
• Combine with function score feature to rescore results
• Example: applying a Gauss decay function to favor more recent results
29 / 36
© 2016 Knorex
Recommendation
• Sophisticated scoring and ranking
can be done outside of Elasticsearch
• Still, can tap on Elasticsearch for faceting
and filtering capability
30 / 36
© 2016 Knorex
Deduplication
• Natural approach
• Term matching on URL, title
• Failed if these are slightly different (very common!)
• More-like-this or fuzzy-like-this on content, with high matching threshold: e.g. 70%,
80%
• Not accurate, bag-of-word approach.
• Tricky in determining threshold. ”Good value” varies across different dcoument types and
domains
• Slow. The more terms allowed in the queries, the slower it is. If cut off based on max
terms, then accuracy drops
• Proposed approach
• Semantic hashing: minhash, simhash
• for a document, compute a hash value
• convert the hash value to binary string form
• robust and efficient, can cater to near-duplicate
• Implement Hamming distance search using Elasticsearch fuzzy_like_this
31 / 36
© 2016 Knorex
Deduplication
• Do not index duplicate at all
or
• Collapse similar items in search results, display only the one with highest
score
• Assign same id for articles that are duplicate (called it groupid)
• Use Elasticsearch Top Hits query to collapse result by groupid
⇒ 64-bit hash:
1000010001000111101001011011110010111101000011100
101101001011101
Modified version:
1010010001000111101011011011110010111101000011100
101101000011101
Hamming distance: 3
32 / 36
© 2016 Knorex
Further reading
• Dismax vs bool queries
• Term vs text queries
• Filter vs filtered
• Facets (old) vs aggregations (facets reborn + statistics)
• Geo
33 / 36
© 2016 Knorex
Summary
• ES is very flexible with numerous features and knobs
• Critical to understand basic analysis, different types of queries
• Indexing time and search time tradeoff
• Precision and recall tradeoff
• Complexity and memory estimation
• Use NLP techniques as modelling step to improve search quality
• Pay great attention to data input and data gathering step
34 / 36
© 2016 Knorex
About Knorex
Founded in 2010 as spin-off from Data Mining Dept. of
A*STAR, Singapore
 Enabling our customers to make smarter discovery
and turn it into actionable insight
Mission
35 / 36
© 2016 Knorex
https://www.knorex.com
https://itviec.com/companies/knorex
36 / 36
© 2016 Knorex
Thank you

TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world search problems

  • 1.
    © 2016 Knorex MarryingElasticsearch with NLP to solve real-world search problems Phu Le, Knorex @ Grokking TechTalk 25 June 2016 Web : http://knorex.com Email : info@knorex.com
  • 2.
    © 2016 Knorex KnorexLumina Web ServicesTM 2 / 36
  • 3.
    © 2016 Knorex KnorexLumina Web ServicesTM 3 / 36
  • 4.
    © 2016 Knorex KnorexLumina Web ServicesTM 4 / 36
  • 5.
    © 2016 Knorex KnorexLumina Web ServicesTM 5 / 36
  • 6.
    © 2016 Knorex 1.Architecture 2. Ingredients • Data gathering • Content extraction • Preprocessing • Modelling: terms -> phrases, entities -> documents 3. Elasticsearch • Basic analysis, faceting and filtering • Do you mean • Percolator • Recommendation • Deduplication 3. Summary Outline 6 / 36
  • 7.
  • 8.
    © 2016 Knorex 1.Data gathering • Deep crawler • Lazy crawler • Visual scraper • Social media adapters 2. Content extraction • Take news article as an example • Title • Content • Published date • Author • Image • … Ingredients 8 / 36
  • 9.
    © 2016 Knorex Contentextraction 9 / 36
  • 10.
    © 2016 Knorex Contentextraction 10 / 36
  • 11.
    © 2016 Knorex 3.Preprocessing • Sentence splitting, Tokenization • Stemming vs Lemmatizing • Stemming: cries, crying, cried => cri • Lemmatizing: dogs => dog; is, are => be Ingredients 11 / 36
  • 12.
    © 2016 Knorex 3.Modelling • Goal: synthesizing words, tokens into larger units and attach meaning to them • Key phrases extractions • Named entity recognition • Basic building block of knowledge • Basis for computing relatedness and extracting relations • Sentiment analysis • Social media snippet • General article or towards concepts / named entities • Emotion • Document classification • Group search results into faceted categories • Recommend related articles by category Ingredients 12 / 36
  • 13.
  • 14.
  • 15.
  • 16.
    © 2016 Knorex Documentclassification 16 / 36
  • 17.
    © 2016 Knorex •First released Feb 2010, among fastest-growing open- source projects, total funding $104M (3 rounds) • Based on Apache Lucene (same as Solr) • Written in Java, support HTTP interface, schema-free JSON document (yay no XML!) • Designed to be scalable, distributed in nature 17 / 36
  • 18.
    © 2016 Knorex Analysis ”analyzer”:“standard” ”analyzer”: “whitespace” ”analyzer”: “keyword” 18 / 36
  • 19.
    © 2016 Knorex Analysis Successful! [“https”, “www.facebook.com”, ”events”, “194454270949757“] Nohits! WTH… it is not working!!!! Default analyzer as-is • url => not_analyzed / keyword analyzer • Use match query instead of term filter / term query: field analyzer awareness • Custom analyzer: e.g. keyword tokenizer + lowercase filter 19 / 36
  • 20.
    © 2016 Knorex Analysis I n Search analyzer Index analyzer Elasticsearch index SearchIndex • Design carefully what fields that search will be executed frequently on • Determine what analyzers to use for each field (experimental based on application needs) • Search analyzer and index analyzer might be different for the same field • Use match query instead of term filter / term query: field analyzer awareness • Exploit multi-field 20 / 36
  • 21.
    © 2016 Knorex Facetingand filtering 21 / 36
  • 22.
    © 2016 Knorex Doyou mean • “grok” -> “grokking”, “sear” -> “search” • Natural approach: • Compute terms aggregation (facet) across all text fields • title • description • content • Use regex to filter matched terms, sort DESC by frequency, take most popular terms to suggest DON’T!!! 22 / 36
  • 23.
  • 24.
    © 2016 Knorex Doyou mean • Limitations • Single terms only. Cannot suggest phrases • Terms occurring frequently might not be useful • Improvements • Building another field “phrases” in the document • adding entire title • Using key phrases extraction, named entity recognition to populate meaningful phrases • Custom tokenizers: keyword, edgeNGram • edgeNGram example: “grokking” => “gro”, “grok”, “grokk” • Query: “burs mal” => matched: “bursa malaysia” • memory explosion!!! • Custom scoring (importance, popularity score) instead of term frequency 24 / 36
  • 25.
    © 2016 Knorex Doyou mean • Elasticsearch built-in suggester • FST example. Source: https://www.elastic.co/blog/you-complete-me • Features: • Speed & scale: FST per-segment, build in real-time, scale horizontally • Analysis: synonym, fuzzy • Support custom ordering and scoring • Limitations: can’t find word anywhere within a phrase 25 / 36
  • 26.
    © 2016 Knorex Doyou mean • Speed test: 1 millions articles, 2.7 GB index size on single laptop with SSD • Cautions • Don’t add all terms/phrases to suggestion (only meaningful ones!) • Don’t start suggesting immediately. How many words starting with “c”? • Don’t suggest terms that yield no search results • Apply same filter condition of current query to the term suggestion query Regex terms facet Terms suggester 296.5 ms 13 ms 26 / 36
  • 27.
    © 2016 Knorex Percolator •percolate: match documents against queries 27 / 36
  • 28.
    © 2016 Knorex Percolator •Sample use case: segmenting articles using keywords 28 / 36
  • 29.
    © 2016 Knorex Recommendation •Natural approach • More-like-this or fuzzy-like-this on title, content • Not accurate, bag-of-word approach. • Tricky in determining threshold. ”Good value” varies across different document types and domains • Slow. The more terms allowed in the queries, the slower it is. If cut off based on max terms, then accuracy drops • Proposed approaches • Utilize NLP results (modelling step): • Category: recommend articles from same categories • Key phrases: match and rank documents w.r.t target documents by key phrases • Named entities: model with parent/child relationship • Combine with function score feature to rescore results • Example: applying a Gauss decay function to favor more recent results 29 / 36
  • 30.
    © 2016 Knorex Recommendation •Sophisticated scoring and ranking can be done outside of Elasticsearch • Still, can tap on Elasticsearch for faceting and filtering capability 30 / 36
  • 31.
    © 2016 Knorex Deduplication •Natural approach • Term matching on URL, title • Failed if these are slightly different (very common!) • More-like-this or fuzzy-like-this on content, with high matching threshold: e.g. 70%, 80% • Not accurate, bag-of-word approach. • Tricky in determining threshold. ”Good value” varies across different dcoument types and domains • Slow. The more terms allowed in the queries, the slower it is. If cut off based on max terms, then accuracy drops • Proposed approach • Semantic hashing: minhash, simhash • for a document, compute a hash value • convert the hash value to binary string form • robust and efficient, can cater to near-duplicate • Implement Hamming distance search using Elasticsearch fuzzy_like_this 31 / 36
  • 32.
    © 2016 Knorex Deduplication •Do not index duplicate at all or • Collapse similar items in search results, display only the one with highest score • Assign same id for articles that are duplicate (called it groupid) • Use Elasticsearch Top Hits query to collapse result by groupid ⇒ 64-bit hash: 1000010001000111101001011011110010111101000011100 101101001011101 Modified version: 1010010001000111101011011011110010111101000011100 101101000011101 Hamming distance: 3 32 / 36
  • 33.
    © 2016 Knorex Furtherreading • Dismax vs bool queries • Term vs text queries • Filter vs filtered • Facets (old) vs aggregations (facets reborn + statistics) • Geo 33 / 36
  • 34.
    © 2016 Knorex Summary •ES is very flexible with numerous features and knobs • Critical to understand basic analysis, different types of queries • Indexing time and search time tradeoff • Precision and recall tradeoff • Complexity and memory estimation • Use NLP techniques as modelling step to improve search quality • Pay great attention to data input and data gathering step 34 / 36
  • 35.
    © 2016 Knorex AboutKnorex Founded in 2010 as spin-off from Data Mining Dept. of A*STAR, Singapore  Enabling our customers to make smarter discovery and turn it into actionable insight Mission 35 / 36
  • 36.
  • 37.

Editor's Notes

  • #7 This round, our team will give u more updates on Deep Learning effort and KGen, as we promised In between, we will also share about the integration status of Lumina Web Services RTB will leave to another session For each part, I will share some key challenges we face, and what’s next KGen will be covered in more details by Yiping
  • #8 Lazy crawler
  • #9 A definitive guide to Elasticsearch has to cover a lot of aspects and features This presentation focuses on some common use cases we experienced when building our search solutions I’ll first present basic ingredients needed before we even start building a search solution Crawlers: - different types of crawlers are required
  • #10 Never underestimate the complexity of data gathering. Search is completely data driven. Garbage in, garbage out. Automatically extracting information from websites are tricky. If the content is from image, scanned PDF file, even harder (OCR & layout analysis required)
  • #11 Never underestimate the complexity of data gathering. Search is completely data driven. Garbage in, garbage out. Automatically extracting information from websites are tricky. If the content is from image, scanned PDF file, even harder (OCR & layout analysis required)
  • #21 Any time we find something doesn’t match => examine its index / search analyzer configuration
  • #31 This terms matching and ranking is done in MongoDB. We took the ids of matched documents and compose another query to ES using those ids and enjoy faceting. Concern: will be problem if the list of ids are long