SlideShare a Scribd company logo
1 of 37
© 2016 Knorex
Marrying Elasticsearch with
NLP to solve real-world
search problems
Phu Le, Knorex
@ Grokking TechTalk
25 June 2016
Web : http://knorex.com
Email : info@knorex.com
© 2016 Knorex
Knorex Lumina Web ServicesTM
2 / 36
© 2016 Knorex
Knorex Lumina Web ServicesTM
3 / 36
© 2016 Knorex
Knorex Lumina Web ServicesTM
4 / 36
© 2016 Knorex
Knorex Lumina Web ServicesTM
5 / 36
© 2016 Knorex
1. Architecture
2. Ingredients
• Data gathering
• Content extraction
• Preprocessing
• Modelling: terms -> phrases, entities -> documents
3. Elasticsearch
• Basic analysis, faceting and filtering
• Do you mean
• Percolator
• Recommendation
• Deduplication
3. Summary
Outline
6 / 36
© 2016 Knorex
Architecture
7 / 36
© 2016 Knorex
1. Data gathering
• Deep crawler
• Lazy crawler
• Visual scraper
• Social media adapters
2. Content extraction
• Take news article as an example
• Title
• Content
• Published date
• Author
• Image
• …
Ingredients
8 / 36
© 2016 Knorex
Content extraction
9 / 36
© 2016 Knorex
Content extraction
10 / 36
© 2016 Knorex
3. Preprocessing
• Sentence splitting, Tokenization
• Stemming vs Lemmatizing
• Stemming: cries, crying, cried => cri
• Lemmatizing: dogs => dog; is, are => be
Ingredients
11 / 36
© 2016 Knorex
3. Modelling
• Goal: synthesizing words, tokens into larger units and
attach meaning to them
• Key phrases extractions
• Named entity recognition
• Basic building block of knowledge
• Basis for computing relatedness and extracting relations
• Sentiment analysis
• Social media snippet
• General article or towards concepts / named entities
• Emotion
• Document classification
• Group search results into faceted categories
• Recommend related articles by category
Ingredients
12 / 36
© 2016 Knorex
Terms
13 / 36
© 2016 Knorex
Phrases
14 / 36
© 2016 Knorex
Entities
15 / 36
© 2016 Knorex
Document classification
16 / 36
© 2016 Knorex
• First released Feb 2010, among fastest-growing open-
source projects, total funding $104M (3 rounds)
• Based on Apache Lucene (same as Solr)
• Written in Java, support HTTP interface, schema-free
JSON document (yay no XML!)
• Designed to be scalable, distributed in nature
17 / 36
© 2016 Knorex
Analysis
”analyzer”: “standard” ”analyzer”: “whitespace” ”analyzer”: “keyword”
18 / 36
© 2016 Knorex
Analysis
Successful!
[“https”,
“www.facebook.com”,
”events”,
“194454270949757“]
No hits! WTH… it is not working!!!!
Default
analyzer
as-is
• url => not_analyzed / keyword analyzer
• Use match query instead of term filter /
term query: field analyzer awareness
• Custom analyzer: e.g. keyword
tokenizer + lowercase filter
19 / 36
© 2016 Knorex
Analysis
I
n
Search
analyzer
Index
analyzer
Elasticsearch
index
Search Index
• Design carefully what fields that search will be executed frequently on
• Determine what analyzers to use for each field (experimental based on
application needs)
• Search analyzer and index analyzer might be different for the same field
• Use match query instead of term filter / term query: field analyzer awareness
• Exploit multi-field
20 / 36
© 2016 Knorex
Faceting and filtering
21 / 36
© 2016 Knorex
Do you mean
• “grok” -> “grokking”, “sear” -> “search”
• Natural approach:
• Compute terms aggregation (facet) across all text fields
• title
• description
• content
• Use regex to filter matched terms, sort DESC by frequency, take most popular terms
to suggest
DON’T!!!
22 / 36
© 2016 Knorex 23 / 36
© 2016 Knorex
Do you mean
• Limitations
• Single terms only. Cannot suggest phrases
• Terms occurring frequently might not be useful
• Improvements
• Building another field “phrases” in the document
• adding entire title
• Using key phrases extraction, named entity recognition to populate meaningful phrases
• Custom tokenizers: keyword, edgeNGram
• edgeNGram example: “grokking” => “gro”, “grok”, “grokk”
• Query: “burs mal” => matched: “bursa malaysia”
• memory explosion!!!
• Custom scoring (importance, popularity score) instead of term frequency
24 / 36
© 2016 Knorex
Do you mean
• Elasticsearch built-in suggester
• FST example. Source: https://www.elastic.co/blog/you-complete-me
• Features:
• Speed & scale: FST per-segment, build in real-time, scale horizontally
• Analysis: synonym, fuzzy
• Support custom ordering and scoring
• Limitations: can’t find word anywhere within a phrase
25 / 36
© 2016 Knorex
Do you mean
• Speed test: 1 millions articles, 2.7 GB index size on single laptop with SSD
• Cautions
• Don’t add all terms/phrases to suggestion (only meaningful ones!)
• Don’t start suggesting immediately. How many words starting with “c”?
• Don’t suggest terms that yield no search results
• Apply same filter condition of current query to the term suggestion query
Regex terms
facet
Terms
suggester
296.5 ms 13 ms
26 / 36
© 2016 Knorex
Percolator
• percolate: match documents against queries
27 / 36
© 2016 Knorex
Percolator
• Sample use case: segmenting articles using keywords
28 / 36
© 2016 Knorex
Recommendation
• Natural approach
• More-like-this or fuzzy-like-this on title, content
• Not accurate, bag-of-word approach.
• Tricky in determining threshold. ”Good value” varies across different document types and
domains
• Slow. The more terms allowed in the queries, the slower it is. If cut off based on max
terms, then accuracy drops
• Proposed approaches
• Utilize NLP results (modelling step):
• Category: recommend articles from same categories
• Key phrases: match and rank documents w.r.t target documents by key phrases
• Named entities: model with parent/child relationship
• Combine with function score feature to rescore results
• Example: applying a Gauss decay function to favor more recent results
29 / 36
© 2016 Knorex
Recommendation
• Sophisticated scoring and ranking
can be done outside of Elasticsearch
• Still, can tap on Elasticsearch for faceting
and filtering capability
30 / 36
© 2016 Knorex
Deduplication
• Natural approach
• Term matching on URL, title
• Failed if these are slightly different (very common!)
• More-like-this or fuzzy-like-this on content, with high matching threshold: e.g. 70%,
80%
• Not accurate, bag-of-word approach.
• Tricky in determining threshold. ”Good value” varies across different dcoument types and
domains
• Slow. The more terms allowed in the queries, the slower it is. If cut off based on max
terms, then accuracy drops
• Proposed approach
• Semantic hashing: minhash, simhash
• for a document, compute a hash value
• convert the hash value to binary string form
• robust and efficient, can cater to near-duplicate
• Implement Hamming distance search using Elasticsearch fuzzy_like_this
31 / 36
© 2016 Knorex
Deduplication
• Do not index duplicate at all
or
• Collapse similar items in search results, display only the one with highest
score
• Assign same id for articles that are duplicate (called it groupid)
• Use Elasticsearch Top Hits query to collapse result by groupid
⇒ 64-bit hash:
1000010001000111101001011011110010111101000011100
101101001011101
Modified version:
1010010001000111101011011011110010111101000011100
101101000011101
Hamming distance: 3
32 / 36
© 2016 Knorex
Further reading
• Dismax vs bool queries
• Term vs text queries
• Filter vs filtered
• Facets (old) vs aggregations (facets reborn + statistics)
• Geo
33 / 36
© 2016 Knorex
Summary
• ES is very flexible with numerous features and knobs
• Critical to understand basic analysis, different types of queries
• Indexing time and search time tradeoff
• Precision and recall tradeoff
• Complexity and memory estimation
• Use NLP techniques as modelling step to improve search quality
• Pay great attention to data input and data gathering step
34 / 36
© 2016 Knorex
About Knorex
Founded in 2010 as spin-off from Data Mining Dept. of
A*STAR, Singapore
 Enabling our customers to make smarter discovery
and turn it into actionable insight
Mission
35 / 36
© 2016 Knorex
https://www.knorex.com
https://itviec.com/companies/knorex
36 / 36
© 2016 Knorex
Thank you

More Related Content

What's hot

MongoDB Schema Design: Four Real-World Examples
MongoDB Schema Design: Four Real-World ExamplesMongoDB Schema Design: Four Real-World Examples
MongoDB Schema Design: Four Real-World ExamplesMike Friedman
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBMike Dirolf
 
Grokking Techtalk #37: Data intensive problem
 Grokking Techtalk #37: Data intensive problem Grokking Techtalk #37: Data intensive problem
Grokking Techtalk #37: Data intensive problemGrokking VN
 
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...DataStax
 
검색엔진이 데이터를 다루는 법 김종민
검색엔진이 데이터를 다루는 법 김종민검색엔진이 데이터를 다루는 법 김종민
검색엔진이 데이터를 다루는 법 김종민종민 김
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesSeungYong Oh
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation FrameworkCaserta
 
[pgday.Seoul 2022] 서비스개편시 PostgreSQL 도입기 - 진소린 & 김태정
[pgday.Seoul 2022] 서비스개편시 PostgreSQL 도입기 - 진소린 & 김태정[pgday.Seoul 2022] 서비스개편시 PostgreSQL 도입기 - 진소린 & 김태정
[pgday.Seoul 2022] 서비스개편시 PostgreSQL 도입기 - 진소린 & 김태정PgDay.Seoul
 
Grokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKIGrokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKIGrokking VN
 
Amazon Aurora Deep Dive (김기완) - AWS DB Day
Amazon Aurora Deep Dive (김기완) - AWS DB DayAmazon Aurora Deep Dive (김기완) - AWS DB Day
Amazon Aurora Deep Dive (김기완) - AWS DB DayAmazon Web Services Korea
 
Materialized Views and Secondary Indexes in Scylla: They Are finally here!
Materialized Views and Secondary Indexes in Scylla: They Are finally here!Materialized Views and Secondary Indexes in Scylla: They Are finally here!
Materialized Views and Secondary Indexes in Scylla: They Are finally here!ScyllaDB
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBRavi Teja
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxData
 
When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...MongoDB
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginnersNeil Baker
 

What's hot (20)

MongoDB Schema Design: Four Real-World Examples
MongoDB Schema Design: Four Real-World ExamplesMongoDB Schema Design: Four Real-World Examples
MongoDB Schema Design: Four Real-World Examples
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Druid
DruidDruid
Druid
 
Grokking Techtalk #37: Data intensive problem
 Grokking Techtalk #37: Data intensive problem Grokking Techtalk #37: Data intensive problem
Grokking Techtalk #37: Data intensive problem
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
 
검색엔진이 데이터를 다루는 법 김종민
검색엔진이 데이터를 다루는 법 김종민검색엔진이 데이터를 다루는 법 김종민
검색엔진이 데이터를 다루는 법 김종민
 
MongodB Internals
MongodB InternalsMongodB Internals
MongodB Internals
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation Framework
 
Mongo DB
Mongo DB Mongo DB
Mongo DB
 
[pgday.Seoul 2022] 서비스개편시 PostgreSQL 도입기 - 진소린 & 김태정
[pgday.Seoul 2022] 서비스개편시 PostgreSQL 도입기 - 진소린 & 김태정[pgday.Seoul 2022] 서비스개편시 PostgreSQL 도입기 - 진소린 & 김태정
[pgday.Seoul 2022] 서비스개편시 PostgreSQL 도입기 - 진소린 & 김태정
 
Grokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKIGrokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKI
 
Amazon Aurora Deep Dive (김기완) - AWS DB Day
Amazon Aurora Deep Dive (김기완) - AWS DB DayAmazon Aurora Deep Dive (김기완) - AWS DB Day
Amazon Aurora Deep Dive (김기완) - AWS DB Day
 
Materialized Views and Secondary Indexes in Scylla: They Are finally here!
Materialized Views and Secondary Indexes in Scylla: They Are finally here!Materialized Views and Secondary Indexes in Scylla: They Are finally here!
Materialized Views and Secondary Indexes in Scylla: They Are finally here!
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 
When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginners
 
Event-sourced architectures with Akka
Event-sourced architectures with AkkaEvent-sourced architectures with Akka
Event-sourced architectures with Akka
 

Viewers also liked

Querying your database in natural language by Daniel Moisset PyData SV 2014
Querying your database in natural language by Daniel Moisset PyData SV 2014Querying your database in natural language by Daniel Moisset PyData SV 2014
Querying your database in natural language by Daniel Moisset PyData SV 2014PyData
 
Natural language search using Neo4j
Natural language search using Neo4jNatural language search using Neo4j
Natural language search using Neo4jKenny Bastani
 
TechTalk #15 Grokking: The data processing journey at AhaMove
TechTalk #15 Grokking:  The data processing journey at AhaMoveTechTalk #15 Grokking:  The data processing journey at AhaMove
TechTalk #15 Grokking: The data processing journey at AhaMoveGrokking VN
 
NLIDB(Natural Language Interface to DataBases)
NLIDB(Natural Language Interface to DataBases)NLIDB(Natural Language Interface to DataBases)
NLIDB(Natural Language Interface to DataBases)Swetha Pallati
 
Natural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4jNatural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4jWilliam Lyon
 
Running Natural Language Queries on MongoDB
Running Natural Language Queries on MongoDBRunning Natural Language Queries on MongoDB
Running Natural Language Queries on MongoDBMongoDB
 
A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies
A Graph-based Clustering Scheme for Identifying Related Tags in FolksonomiesA Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies
A Graph-based Clustering Scheme for Identifying Related Tags in FolksonomiesSymeon Papadopoulos
 
Enhance discovery Solr and Mahout
Enhance discovery Solr and MahoutEnhance discovery Solr and Mahout
Enhance discovery Solr and Mahoutlucenerevolution
 
Kernel Recipes 2014 - x86 instruction encoding and the nasty hacks we do in t...
Kernel Recipes 2014 - x86 instruction encoding and the nasty hacks we do in t...Kernel Recipes 2014 - x86 instruction encoding and the nasty hacks we do in t...
Kernel Recipes 2014 - x86 instruction encoding and the nasty hacks we do in t...Anne Nicolas
 
NLP approach for medical translation task
NLP approach for medical translation taskNLP approach for medical translation task
NLP approach for medical translation taskAnastasiia Kornilova
 
Are all CAC/NLP Vendors the Same?
Are all CAC/NLP Vendors the Same?Are all CAC/NLP Vendors the Same?
Are all CAC/NLP Vendors the Same?Kerry Fagan
 
Nlp based retrieval of medical information for diagnosis of human diseases
Nlp based retrieval of medical information for diagnosis of human diseasesNlp based retrieval of medical information for diagnosis of human diseases
Nlp based retrieval of medical information for diagnosis of human diseaseseSAT Journals
 
SQL Server Cross Platform Portable con Docker
SQL Server Cross Platform Portable con DockerSQL Server Cross Platform Portable con Docker
SQL Server Cross Platform Portable con DockerChristian Melendez
 
HealthCare Data Mining and Natural Language Processing
HealthCare Data Mining and Natural Language ProcessingHealthCare Data Mining and Natural Language Processing
HealthCare Data Mining and Natural Language ProcessingNehal (Neil) Shah
 
Natural Language Processing for the Semantic Web
Natural Language Processing for the Semantic WebNatural Language Processing for the Semantic Web
Natural Language Processing for the Semantic WebIsabelle Augenstein
 
Natural Language Processing and Graph Databases in Lumify
Natural Language Processing and Graph Databases in LumifyNatural Language Processing and Graph Databases in Lumify
Natural Language Processing and Graph Databases in LumifyCharlie Greenbacker
 

Viewers also liked (20)

Querying your database in natural language by Daniel Moisset PyData SV 2014
Querying your database in natural language by Daniel Moisset PyData SV 2014Querying your database in natural language by Daniel Moisset PyData SV 2014
Querying your database in natural language by Daniel Moisset PyData SV 2014
 
Natural language search using Neo4j
Natural language search using Neo4jNatural language search using Neo4j
Natural language search using Neo4j
 
TechTalk #15 Grokking: The data processing journey at AhaMove
TechTalk #15 Grokking:  The data processing journey at AhaMoveTechTalk #15 Grokking:  The data processing journey at AhaMove
TechTalk #15 Grokking: The data processing journey at AhaMove
 
NLIDB(Natural Language Interface to DataBases)
NLIDB(Natural Language Interface to DataBases)NLIDB(Natural Language Interface to DataBases)
NLIDB(Natural Language Interface to DataBases)
 
Natural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4jNatural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4j
 
Running Natural Language Queries on MongoDB
Running Natural Language Queries on MongoDBRunning Natural Language Queries on MongoDB
Running Natural Language Queries on MongoDB
 
Similarity at Scale
Similarity at ScaleSimilarity at Scale
Similarity at Scale
 
InformationRetrieval
InformationRetrievalInformationRetrieval
InformationRetrieval
 
A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies
A Graph-based Clustering Scheme for Identifying Related Tags in FolksonomiesA Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies
A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies
 
Enhance discovery Solr and Mahout
Enhance discovery Solr and MahoutEnhance discovery Solr and Mahout
Enhance discovery Solr and Mahout
 
Kernel Recipes 2014 - x86 instruction encoding and the nasty hacks we do in t...
Kernel Recipes 2014 - x86 instruction encoding and the nasty hacks we do in t...Kernel Recipes 2014 - x86 instruction encoding and the nasty hacks we do in t...
Kernel Recipes 2014 - x86 instruction encoding and the nasty hacks we do in t...
 
NLP approach for medical translation task
NLP approach for medical translation taskNLP approach for medical translation task
NLP approach for medical translation task
 
Are all CAC/NLP Vendors the Same?
Are all CAC/NLP Vendors the Same?Are all CAC/NLP Vendors the Same?
Are all CAC/NLP Vendors the Same?
 
Nlp based retrieval of medical information for diagnosis of human diseases
Nlp based retrieval of medical information for diagnosis of human diseasesNlp based retrieval of medical information for diagnosis of human diseases
Nlp based retrieval of medical information for diagnosis of human diseases
 
SQL Server Cross Platform Portable con Docker
SQL Server Cross Platform Portable con DockerSQL Server Cross Platform Portable con Docker
SQL Server Cross Platform Portable con Docker
 
HealthCare Data Mining and Natural Language Processing
HealthCare Data Mining and Natural Language ProcessingHealthCare Data Mining and Natural Language Processing
HealthCare Data Mining and Natural Language Processing
 
Natural Language Processing for the Semantic Web
Natural Language Processing for the Semantic WebNatural Language Processing for the Semantic Web
Natural Language Processing for the Semantic Web
 
Natural Language Processing and Graph Databases in Lumify
Natural Language Processing and Graph Databases in LumifyNatural Language Processing and Graph Databases in Lumify
Natural Language Processing and Graph Databases in Lumify
 
Quepy
QuepyQuepy
Quepy
 
Named Entities
Named EntitiesNamed Entities
Named Entities
 

Similar to TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world search problems

Exploring MongoDB & Elasticsearch: Better Together
Exploring MongoDB & Elasticsearch: Better TogetherExploring MongoDB & Elasticsearch: Better Together
Exploring MongoDB & Elasticsearch: Better TogetherObjectRocket
 
Advanced Schema Design Patterns
Advanced Schema Design PatternsAdvanced Schema Design Patterns
Advanced Schema Design PatternsMongoDB
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsGeorge Stathis
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comSimon Hughes
 
Configuring elasticsearch for performance and scale
Configuring elasticsearch for performance and scaleConfiguring elasticsearch for performance and scale
Configuring elasticsearch for performance and scaleBharvi Dixit
 
Webinar: Fusion for Data Science
Webinar: Fusion for Data ScienceWebinar: Fusion for Data Science
Webinar: Fusion for Data ScienceLucidworks
 
Semantics and Search by Upasna Gautam at PubCon Austin 2018
Semantics and Search by Upasna Gautam at PubCon Austin 2018Semantics and Search by Upasna Gautam at PubCon Austin 2018
Semantics and Search by Upasna Gautam at PubCon Austin 2018Upasna Gautam
 
Semantics and Search by Upasna Gautam at PubCon Austin 2018
Semantics and Search by Upasna Gautam at PubCon Austin 2018Semantics and Search by Upasna Gautam at PubCon Austin 2018
Semantics and Search by Upasna Gautam at PubCon Austin 2018Upasna Gautam
 
Computer Science Masters Library Training - June 2017
Computer Science Masters Library Training - June 2017Computer Science Masters Library Training - June 2017
Computer Science Masters Library Training - June 2017pvhead123
 
Make Text Search "Work" for Your Apps - JavaOne 2013
Make Text Search "Work" for Your Apps - JavaOne 2013Make Text Search "Work" for Your Apps - JavaOne 2013
Make Text Search "Work" for Your Apps - JavaOne 2013javagroup2006
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkSimon Hughes
 
Search summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slidesSearch summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slidesSujit Pal
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Lucidworks
 
Search Intelligently - Liferay Symposium North America 2016, Chicago, USA
Search Intelligently - Liferay Symposium North America 2016, Chicago, USASearch Intelligently - Liferay Symposium North America 2016, Chicago, USA
Search Intelligently - Liferay Symposium North America 2016, Chicago, USAAndré Ricardo Barreto de Oliveira
 
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr..."PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...Stefan Adam
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Lucidworks
 

Similar to TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world search problems (20)

Exploring MongoDB & Elasticsearch: Better Together
Exploring MongoDB & Elasticsearch: Better TogetherExploring MongoDB & Elasticsearch: Better Together
Exploring MongoDB & Elasticsearch: Better Together
 
Search Basics
Search BasicsSearch Basics
Search Basics
 
Advanced Schema Design Patterns
Advanced Schema Design PatternsAdvanced Schema Design Patterns
Advanced Schema Design Patterns
 
File000162
File000162File000162
File000162
 
DC presentation 1
DC presentation 1DC presentation 1
DC presentation 1
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data Lessons
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
 
Configuring elasticsearch for performance and scale
Configuring elasticsearch for performance and scaleConfiguring elasticsearch for performance and scale
Configuring elasticsearch for performance and scale
 
Elasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetupElasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetup
 
Webinar: Fusion for Data Science
Webinar: Fusion for Data ScienceWebinar: Fusion for Data Science
Webinar: Fusion for Data Science
 
Semantics and Search by Upasna Gautam at PubCon Austin 2018
Semantics and Search by Upasna Gautam at PubCon Austin 2018Semantics and Search by Upasna Gautam at PubCon Austin 2018
Semantics and Search by Upasna Gautam at PubCon Austin 2018
 
Semantics and Search by Upasna Gautam at PubCon Austin 2018
Semantics and Search by Upasna Gautam at PubCon Austin 2018Semantics and Search by Upasna Gautam at PubCon Austin 2018
Semantics and Search by Upasna Gautam at PubCon Austin 2018
 
Computer Science Masters Library Training - June 2017
Computer Science Masters Library Training - June 2017Computer Science Masters Library Training - June 2017
Computer Science Masters Library Training - June 2017
 
Make Text Search "Work" for Your Apps - JavaOne 2013
Make Text Search "Work" for Your Apps - JavaOne 2013Make Text Search "Work" for Your Apps - JavaOne 2013
Make Text Search "Work" for Your Apps - JavaOne 2013
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank Talk
 
Search summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slidesSearch summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slides
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
 
Search Intelligently - Liferay Symposium North America 2016, Chicago, USA
Search Intelligently - Liferay Symposium North America 2016, Chicago, USASearch Intelligently - Liferay Symposium North America 2016, Chicago, USA
Search Intelligently - Liferay Symposium North America 2016, Chicago, USA
 
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr..."PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
 

More from Grokking VN

Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banksGrokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banksGrokking VN
 
Grokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles ThinkingGrokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles ThinkingGrokking VN
 
Grokking Techtalk #42: Engineering challenges on building data platform for M...
Grokking Techtalk #42: Engineering challenges on building data platform for M...Grokking Techtalk #42: Engineering challenges on building data platform for M...
Grokking Techtalk #42: Engineering challenges on building data platform for M...Grokking VN
 
Grokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystifiedGrokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystifiedGrokking VN
 
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking VN
 
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platformGrokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platformGrokking VN
 
Grokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applicationsGrokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applicationsGrokking VN
 
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
 Grokking Techtalk #39: How to build an event driven architecture with Kafka ... Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...Grokking VN
 
Grokking Techtalk #38: Escape Analysis in Go compiler
 Grokking Techtalk #38: Escape Analysis in Go compiler Grokking Techtalk #38: Escape Analysis in Go compiler
Grokking Techtalk #38: Escape Analysis in Go compilerGrokking VN
 
Grokking Techtalk #37: Software design and refactoring
 Grokking Techtalk #37: Software design and refactoring Grokking Techtalk #37: Software design and refactoring
Grokking Techtalk #37: Software design and refactoringGrokking VN
 
Grokking TechTalk #35: Efficient spellchecking
Grokking TechTalk #35: Efficient spellcheckingGrokking TechTalk #35: Efficient spellchecking
Grokking TechTalk #35: Efficient spellcheckingGrokking VN
 
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
 Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer... Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...Grokking VN
 
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...Grokking VN
 
SOLID & Design Patterns
SOLID & Design PatternsSOLID & Design Patterns
SOLID & Design PatternsGrokking VN
 
Grokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous CommunicationsGrokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous CommunicationsGrokking VN
 
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at ScaleGrokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at ScaleGrokking VN
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking VN
 
Grokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search TreeGrokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search TreeGrokking VN
 
Grokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the MagicGrokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the MagicGrokking VN
 
Grokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platformGrokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platformGrokking VN
 

More from Grokking VN (20)

Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banksGrokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
 
Grokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles ThinkingGrokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles Thinking
 
Grokking Techtalk #42: Engineering challenges on building data platform for M...
Grokking Techtalk #42: Engineering challenges on building data platform for M...Grokking Techtalk #42: Engineering challenges on building data platform for M...
Grokking Techtalk #42: Engineering challenges on building data platform for M...
 
Grokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystifiedGrokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystified
 
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
 
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platformGrokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
 
Grokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applicationsGrokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applications
 
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
 Grokking Techtalk #39: How to build an event driven architecture with Kafka ... Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
 
Grokking Techtalk #38: Escape Analysis in Go compiler
 Grokking Techtalk #38: Escape Analysis in Go compiler Grokking Techtalk #38: Escape Analysis in Go compiler
Grokking Techtalk #38: Escape Analysis in Go compiler
 
Grokking Techtalk #37: Software design and refactoring
 Grokking Techtalk #37: Software design and refactoring Grokking Techtalk #37: Software design and refactoring
Grokking Techtalk #37: Software design and refactoring
 
Grokking TechTalk #35: Efficient spellchecking
Grokking TechTalk #35: Efficient spellcheckingGrokking TechTalk #35: Efficient spellchecking
Grokking TechTalk #35: Efficient spellchecking
 
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
 Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer... Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
 
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
 
SOLID & Design Patterns
SOLID & Design PatternsSOLID & Design Patterns
SOLID & Design Patterns
 
Grokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous CommunicationsGrokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous Communications
 
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at ScaleGrokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
 
Grokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search TreeGrokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search Tree
 
Grokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the MagicGrokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the Magic
 
Grokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platformGrokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platform
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 

Recently uploaded (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 

TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world search problems

  • 1. © 2016 Knorex Marrying Elasticsearch with NLP to solve real-world search problems Phu Le, Knorex @ Grokking TechTalk 25 June 2016 Web : http://knorex.com Email : info@knorex.com
  • 2. © 2016 Knorex Knorex Lumina Web ServicesTM 2 / 36
  • 3. © 2016 Knorex Knorex Lumina Web ServicesTM 3 / 36
  • 4. © 2016 Knorex Knorex Lumina Web ServicesTM 4 / 36
  • 5. © 2016 Knorex Knorex Lumina Web ServicesTM 5 / 36
  • 6. © 2016 Knorex 1. Architecture 2. Ingredients • Data gathering • Content extraction • Preprocessing • Modelling: terms -> phrases, entities -> documents 3. Elasticsearch • Basic analysis, faceting and filtering • Do you mean • Percolator • Recommendation • Deduplication 3. Summary Outline 6 / 36
  • 8. © 2016 Knorex 1. Data gathering • Deep crawler • Lazy crawler • Visual scraper • Social media adapters 2. Content extraction • Take news article as an example • Title • Content • Published date • Author • Image • … Ingredients 8 / 36
  • 9. © 2016 Knorex Content extraction 9 / 36
  • 10. © 2016 Knorex Content extraction 10 / 36
  • 11. © 2016 Knorex 3. Preprocessing • Sentence splitting, Tokenization • Stemming vs Lemmatizing • Stemming: cries, crying, cried => cri • Lemmatizing: dogs => dog; is, are => be Ingredients 11 / 36
  • 12. © 2016 Knorex 3. Modelling • Goal: synthesizing words, tokens into larger units and attach meaning to them • Key phrases extractions • Named entity recognition • Basic building block of knowledge • Basis for computing relatedness and extracting relations • Sentiment analysis • Social media snippet • General article or towards concepts / named entities • Emotion • Document classification • Group search results into faceted categories • Recommend related articles by category Ingredients 12 / 36
  • 16. © 2016 Knorex Document classification 16 / 36
  • 17. © 2016 Knorex • First released Feb 2010, among fastest-growing open- source projects, total funding $104M (3 rounds) • Based on Apache Lucene (same as Solr) • Written in Java, support HTTP interface, schema-free JSON document (yay no XML!) • Designed to be scalable, distributed in nature 17 / 36
  • 18. © 2016 Knorex Analysis ”analyzer”: “standard” ”analyzer”: “whitespace” ”analyzer”: “keyword” 18 / 36
  • 19. © 2016 Knorex Analysis Successful! [“https”, “www.facebook.com”, ”events”, “194454270949757“] No hits! WTH… it is not working!!!! Default analyzer as-is • url => not_analyzed / keyword analyzer • Use match query instead of term filter / term query: field analyzer awareness • Custom analyzer: e.g. keyword tokenizer + lowercase filter 19 / 36
  • 20. © 2016 Knorex Analysis I n Search analyzer Index analyzer Elasticsearch index Search Index • Design carefully what fields that search will be executed frequently on • Determine what analyzers to use for each field (experimental based on application needs) • Search analyzer and index analyzer might be different for the same field • Use match query instead of term filter / term query: field analyzer awareness • Exploit multi-field 20 / 36
  • 21. © 2016 Knorex Faceting and filtering 21 / 36
  • 22. © 2016 Knorex Do you mean • “grok” -> “grokking”, “sear” -> “search” • Natural approach: • Compute terms aggregation (facet) across all text fields • title • description • content • Use regex to filter matched terms, sort DESC by frequency, take most popular terms to suggest DON’T!!! 22 / 36
  • 23. © 2016 Knorex 23 / 36
  • 24. © 2016 Knorex Do you mean • Limitations • Single terms only. Cannot suggest phrases • Terms occurring frequently might not be useful • Improvements • Building another field “phrases” in the document • adding entire title • Using key phrases extraction, named entity recognition to populate meaningful phrases • Custom tokenizers: keyword, edgeNGram • edgeNGram example: “grokking” => “gro”, “grok”, “grokk” • Query: “burs mal” => matched: “bursa malaysia” • memory explosion!!! • Custom scoring (importance, popularity score) instead of term frequency 24 / 36
  • 25. © 2016 Knorex Do you mean • Elasticsearch built-in suggester • FST example. Source: https://www.elastic.co/blog/you-complete-me • Features: • Speed & scale: FST per-segment, build in real-time, scale horizontally • Analysis: synonym, fuzzy • Support custom ordering and scoring • Limitations: can’t find word anywhere within a phrase 25 / 36
  • 26. © 2016 Knorex Do you mean • Speed test: 1 millions articles, 2.7 GB index size on single laptop with SSD • Cautions • Don’t add all terms/phrases to suggestion (only meaningful ones!) • Don’t start suggesting immediately. How many words starting with “c”? • Don’t suggest terms that yield no search results • Apply same filter condition of current query to the term suggestion query Regex terms facet Terms suggester 296.5 ms 13 ms 26 / 36
  • 27. © 2016 Knorex Percolator • percolate: match documents against queries 27 / 36
  • 28. © 2016 Knorex Percolator • Sample use case: segmenting articles using keywords 28 / 36
  • 29. © 2016 Knorex Recommendation • Natural approach • More-like-this or fuzzy-like-this on title, content • Not accurate, bag-of-word approach. • Tricky in determining threshold. ”Good value” varies across different document types and domains • Slow. The more terms allowed in the queries, the slower it is. If cut off based on max terms, then accuracy drops • Proposed approaches • Utilize NLP results (modelling step): • Category: recommend articles from same categories • Key phrases: match and rank documents w.r.t target documents by key phrases • Named entities: model with parent/child relationship • Combine with function score feature to rescore results • Example: applying a Gauss decay function to favor more recent results 29 / 36
  • 30. © 2016 Knorex Recommendation • Sophisticated scoring and ranking can be done outside of Elasticsearch • Still, can tap on Elasticsearch for faceting and filtering capability 30 / 36
  • 31. © 2016 Knorex Deduplication • Natural approach • Term matching on URL, title • Failed if these are slightly different (very common!) • More-like-this or fuzzy-like-this on content, with high matching threshold: e.g. 70%, 80% • Not accurate, bag-of-word approach. • Tricky in determining threshold. ”Good value” varies across different dcoument types and domains • Slow. The more terms allowed in the queries, the slower it is. If cut off based on max terms, then accuracy drops • Proposed approach • Semantic hashing: minhash, simhash • for a document, compute a hash value • convert the hash value to binary string form • robust and efficient, can cater to near-duplicate • Implement Hamming distance search using Elasticsearch fuzzy_like_this 31 / 36
  • 32. © 2016 Knorex Deduplication • Do not index duplicate at all or • Collapse similar items in search results, display only the one with highest score • Assign same id for articles that are duplicate (called it groupid) • Use Elasticsearch Top Hits query to collapse result by groupid ⇒ 64-bit hash: 1000010001000111101001011011110010111101000011100 101101001011101 Modified version: 1010010001000111101011011011110010111101000011100 101101000011101 Hamming distance: 3 32 / 36
  • 33. © 2016 Knorex Further reading • Dismax vs bool queries • Term vs text queries • Filter vs filtered • Facets (old) vs aggregations (facets reborn + statistics) • Geo 33 / 36
  • 34. © 2016 Knorex Summary • ES is very flexible with numerous features and knobs • Critical to understand basic analysis, different types of queries • Indexing time and search time tradeoff • Precision and recall tradeoff • Complexity and memory estimation • Use NLP techniques as modelling step to improve search quality • Pay great attention to data input and data gathering step 34 / 36
  • 35. © 2016 Knorex About Knorex Founded in 2010 as spin-off from Data Mining Dept. of A*STAR, Singapore  Enabling our customers to make smarter discovery and turn it into actionable insight Mission 35 / 36

Editor's Notes

  1. This round, our team will give u more updates on Deep Learning effort and KGen, as we promised In between, we will also share about the integration status of Lumina Web Services RTB will leave to another session For each part, I will share some key challenges we face, and what’s next KGen will be covered in more details by Yiping
  2. Lazy crawler
  3. A definitive guide to Elasticsearch has to cover a lot of aspects and features This presentation focuses on some common use cases we experienced when building our search solutions I’ll first present basic ingredients needed before we even start building a search solution Crawlers: - different types of crawlers are required
  4. Never underestimate the complexity of data gathering. Search is completely data driven. Garbage in, garbage out. Automatically extracting information from websites are tricky. If the content is from image, scanned PDF file, even harder (OCR & layout analysis required)
  5. Never underestimate the complexity of data gathering. Search is completely data driven. Garbage in, garbage out. Automatically extracting information from websites are tricky. If the content is from image, scanned PDF file, even harder (OCR & layout analysis required)
  6. Any time we find something doesn’t match => examine its index / search analyzer configuration
  7. This terms matching and ranking is done in MongoDB. We took the ids of matched documents and compose another query to ES using those ids and enjoy faceting. Concern: will be problem if the list of ids are long