This document discusses marrying natural language processing (NLP) techniques with Elasticsearch to solve real-world search problems. It outlines the key ingredients of gathering and extracting content from data sources, preprocessing text, and modeling terms, phrases and entities. It then describes how Elasticsearch can be used for basic analysis, filtering, recommendations and deduplication. Specific NLP techniques like key phrase extraction, named entity recognition and semantic hashing are proposed to improve search quality beyond bag-of-words approaches. The document concludes with a summary of considering analysis, queries, indexing versus search tradeoffs, and paying attention to the data input step.
This round, our team will give u more updates on Deep Learning effort and KGen, as we promised
In between, we will also share about the integration status of Lumina Web Services
RTB will leave to another session
For each part, I will share some key challenges we face, and what’s next
KGen will be covered in more details by Yiping
Lazy crawler
A definitive guide to Elasticsearch has to cover a lot of aspects and features
This presentation focuses on some common use cases we experienced when building our search solutions
I’ll first present basic ingredients needed before we even start building a search solution
Crawlers:
- different types of crawlers are required
Never underestimate the complexity of data gathering. Search is completely data driven. Garbage in, garbage out.
Automatically extracting information from websites are tricky. If the content is from image, scanned PDF file, even harder (OCR & layout analysis required)
Never underestimate the complexity of data gathering. Search is completely data driven. Garbage in, garbage out.
Automatically extracting information from websites are tricky. If the content is from image, scanned PDF file, even harder (OCR & layout analysis required)
Any time we find something doesn’t match => examine its index / search analyzer configuration
This terms matching and ranking is done in MongoDB. We took the ids of matched documents and compose another query to ES using those ids and enjoy faceting.
Concern: will be problem if the list of ids are long