DIY Percolator
Jaideep Dhok
@jdhok
ElasticSearch
● Schema less (sort of)
● Single index can hold docs of multiple types
● Distributed index
● Query, and Document routing
● Faceting
● Scripting
● Percolator!
Percolator in ElasticSearch
● You add queries to the percolator
● ES will save these in an index
● Later, you can 'percolate' a document and get
matching queries in response
● Optionally percolate documents while indexing
esClient.preparePercolate(indexName, typeName)
.setSource(doc) // JSON document
.execute() // Gives a listenable Future
.addListener(new ActionListener<PercolateResponse>() {
@Override
public void onResponse(PercolateResponse response) {
// Get ID of matching queries
List<String> matchingQueries = response.matches();
// Have fun
}
});
Uses
● Standing queries
● Update alerts
● Streaming
● Log debugging
Log debugging
● Logstash pushes logs to a web server
● Clients register queries with the server
● Server routes incoming log messages to
matching queries
How does it work?
● MemoryIndex
● Hold a single document in the index
● For each incoming document
○ Clear the index
○ Add the doc to the index
○ Search all queries one by one
■ if score > 0: add query Id to matched list
○ return matched list
MemoryIndex
● Not RAMDirectory
● addField()
● IndexSearcher createSearcher()
● float search(Query)
● reset()
Workflow differences
● Directory: IndexWriter -> Docuement -> Query
● MemoryIndex: Index -> Fields -> Query
Let's write our own
● addQuery(Query)
● List<Query> getMatchingQueries(String
jsonDoc)
public Percolator() {
queries = new ArrayList<Query>();
index = new MemoryIndex();
}
public void addQuery(String query) throws ParseException {
Analyzer analyzer = new SimpleAnalyzer(VERSION);
QueryParser parser = new QueryParser(VERSION,
F_CONTENT, analyzer);
queries.add(parser.parse(query));
}
public List<Query> getMatchingQueries(String doc) {
synchronized (index) {
index.reset();
index.addField(F_CONTENT, doc,
new SimpleAnalyzer(VERSION));
}
List<Query> matching = new ArrayList<Query>();
for (Query qry : queries) {
if (index.search(qry) > 0.0f) {
matching.add(qry);
} else { // Didn't match }
}
return matching;
}
Miscellaneous
● Adding documents is not thread safe
● "Typically, it is about 10-100 times faster than
RAMDirectory"
● "Memory consumption is probably larger than
for RAMDirectory"
● Indexing a field is O(N) best case, O(Nlog(N))
worst case, where N = number of tokens
Resources
● ElasticSearch feature - http://www.elasticsearch.
org/blog/percolator/
● MemoryIndex - http://lucene.apache.
org/core/4_4_0/memory/index.html
● Code - github: jdhok/diypercolate
Thank You

DIY Percolator

  • 1.
  • 2.
    ElasticSearch ● Schema less(sort of) ● Single index can hold docs of multiple types ● Distributed index ● Query, and Document routing ● Faceting ● Scripting ● Percolator!
  • 3.
    Percolator in ElasticSearch ●You add queries to the percolator ● ES will save these in an index ● Later, you can 'percolate' a document and get matching queries in response ● Optionally percolate documents while indexing
  • 4.
    esClient.preparePercolate(indexName, typeName) .setSource(doc) //JSON document .execute() // Gives a listenable Future .addListener(new ActionListener<PercolateResponse>() { @Override public void onResponse(PercolateResponse response) { // Get ID of matching queries List<String> matchingQueries = response.matches(); // Have fun } });
  • 5.
    Uses ● Standing queries ●Update alerts ● Streaming ● Log debugging
  • 6.
    Log debugging ● Logstashpushes logs to a web server ● Clients register queries with the server ● Server routes incoming log messages to matching queries
  • 7.
    How does itwork? ● MemoryIndex ● Hold a single document in the index ● For each incoming document ○ Clear the index ○ Add the doc to the index ○ Search all queries one by one ■ if score > 0: add query Id to matched list ○ return matched list
  • 8.
    MemoryIndex ● Not RAMDirectory ●addField() ● IndexSearcher createSearcher() ● float search(Query) ● reset()
  • 9.
    Workflow differences ● Directory:IndexWriter -> Docuement -> Query ● MemoryIndex: Index -> Fields -> Query
  • 10.
    Let's write ourown ● addQuery(Query) ● List<Query> getMatchingQueries(String jsonDoc)
  • 11.
    public Percolator() { queries= new ArrayList<Query>(); index = new MemoryIndex(); } public void addQuery(String query) throws ParseException { Analyzer analyzer = new SimpleAnalyzer(VERSION); QueryParser parser = new QueryParser(VERSION, F_CONTENT, analyzer); queries.add(parser.parse(query)); }
  • 12.
    public List<Query> getMatchingQueries(Stringdoc) { synchronized (index) { index.reset(); index.addField(F_CONTENT, doc, new SimpleAnalyzer(VERSION)); } List<Query> matching = new ArrayList<Query>(); for (Query qry : queries) { if (index.search(qry) > 0.0f) { matching.add(qry); } else { // Didn't match } } return matching; }
  • 13.
    Miscellaneous ● Adding documentsis not thread safe ● "Typically, it is about 10-100 times faster than RAMDirectory" ● "Memory consumption is probably larger than for RAMDirectory" ● Indexing a field is O(N) best case, O(Nlog(N)) worst case, where N = number of tokens
  • 14.
    Resources ● ElasticSearch feature- http://www.elasticsearch. org/blog/percolator/ ● MemoryIndex - http://lucene.apache. org/core/4_4_0/memory/index.html ● Code - github: jdhok/diypercolate
  • 15.