Real-Time Inverted Search NYC ASLUG Oct 2014

REAL-TIME INVERTED SEARCH IN THE
CLOUD USING LUCENE AND STORM
Joshua Conlin, Bryan Bende, James Owen
conlin_joshua@bah.com
bende_bryan@bah.com
owen_james@bah.com

Table of Contents
 Problem Statement
 Storm
 Methodology
 Results

Who are we ?
Booz Allen Hamilton
– Large consulting firm supporting many industries
• Healthcare, Finance, Energy, Defense
– Strategic Innovation Group
• Focus on innovative solutions that can be applied across industries
• Major focus on data science, big data, & information retrieval
• Multiple clients utilizing Solr for implementing search capabilities
• Explore Date Science
• Self-paced data science training, launching TODAY!
• https://exploredatascience.com

Client Applications & Architecture
Ingest
SolrCloud
Web App
Typical client applications allow users to:
• Query document index using Lucene syntax
• Filter and facet results
• Save queries for future use

Problem Statement
How do we instantly notify users of new documents that match their
saved queries?
Constraints:
• Process documents in real-time, notify as soon as possible
• Scale with the number of saved queries (starting with tens of thousands)
• Result set of notifications must match saved queries
• Must not impact performance of the web application
• Data arrives at varying speeds and varying sizes

Possible Solutions
1. Fork ingest to a second Solr instance, run stored queries periodically
– Pros: Easy to setup, works for small amount of data data & small # of queries
– Cons: Bound by time to execute all queries
2. Same secondary Solr instance, but distribute queries to multiple servers
– Pros: Reduces query processing time by dividing across several servers
– Cons: Now writing custom code to distribute queries, possible synchronization issues
ensuring each server executes queries against the same data
3. Give each server its own Solr instance and subset of queries
– Pros: Very scalable, only bound by number of servers
– Cons: Difficult to maintain, still writing custom code to distribute data and queries

Possible Solutions
Is there a way we can set up this system so that it’s:
• easy to maintain,
• easy to scale, and
• easy to synchronize?

Candidate Solution
• Integrate Solr and/or Lucene with a stream processing framework
• Process data in real-time, leverage proven framework for distributed stream
processing
Ingest
SolrCloud
Web App
Storm
Notifications

Storm - Overview
• Storm is an open source stream processing framework.
• It’s a scalable platform that lets you distribute processes across a cluster quickly
and easily.
• You can add more resources to your cluster and easily utilize those resources in
your processing.

Storm - Components
• Nimbus – the control node for the cluster, distributes topology through the cluster
• Supervisor – one on each machines in the cluster, controls the allocation of worker
assignments on its machine
• Worker – JVM process for running topology components
Nimbus
Supervisor
Worker
Worker
Worker
Worker
Supervisor
Worker
Worker
Worker
Worker
Supervisor
Worker
Worker
Worker
Worker

Storm – Core Concepts
• Topology – defines a running process, which includes all of the processes to be
run, the connections between those processes, and their configuration
• Stream – the flow of data through a topology; it is an unbounded collection of
tuples that is passed from process to process
• Storm has 2 types of processing units:
– Spout – the start of a stream; it can be thought of as the source of the data;
that data can be read in however the spout wants—from a database, from a
message queue, etc.
– Bolt – the primary processing unit for a topology; it accepts any number of
streams, does whatever processing you’ve set it to do, and outputs any
number of streams based on how you configure it

Storm – Core Concepts (continued)
• Stream Groupings – defines how topology processing units (spouts and bolts) are
connected to each other; some common groupings are:
– All Grouping – stream is sent to all bolts
– Shuffle Grouping – stream is evenly distributed across bolts
– Fields grouping – sends tuples that match on the designated “field” to the
same bolt

Storm - Parallelism
Source: http://storm.incubator.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html

How to Utilize Storm
How can we use this framework to solve our problem?
Let Storm distribute out the data and queries between
processing nodes
…but we would still need to manage a Solr instance on each
VM, and we would even need to ensure synchronization
between query processing bolts running on the same VM.

How to Utilize Storm
What if instead of having a Solr installation on each machine we ran
Solr in memory inside each of the processing bolts?
• Use Storm spout to distribute new documents
• Use Storm bolt to execute queries against EmbeddedSolrServer with
RAMDirectory
– Incoming documents added to index
– Queries executed
– Documents removed from index
• Use Storm bolt to process query results
Bolt
EmbeddedSolrServer
RAMDirectory

Advantages
This has several advantages:
• It removes the need to maintain a Solr instance on each VM.
• It’s easier to scale and more flexible; it doesn’t matter which Supervisor the bolts
get sent to, all the processing is self-contained.
• It removes the need to synchronize processing between bolts.
• Documents are volatile, existing queries over new data

Execution Topology
Data
Spout
Data
Spout
Data
Spout
Query
Spout
Executor
Bolt
Executor
Bolt
Executor
Bolt
Executor
Bolt
Executor
Bolt
Notification
Bolt
Data Spout – Receives incoming
data files and sends to every
Executor Bolt
Query Spout – Coordinates
updates to queries
Executor Bolt – Loads and
executes queries
Notification Bolt – Generates
All notifications based on results
Grouping
Shuffle
Grouping

Executor Bolt
1. Queries are loaded into memory
2. Incoming documents are added to the
Lucene index
3. Documents are processed when one
of the following conditions are met:
a) The number of documents have
exceeded the max batch size
b) The time since the last execution
is longer than the max interval
time
4. Matching queries and document UIDs
are emitted
5. Remove all documents from index
Query List
Documents
emit()
1
2
3
4

Solr In-Memory Processing Bolt Issues
• Attempted to run Solr with in-memory index inside Storm bolt
• Solr 4.5 requires:
– http-client 4.2.3
– http-core 4.2.2
• Storm 0.8.2 & 0.9.0 require:
– http-client 4.1.1
– http-core 4.1
• Could exclude libraries from super jar and rely on storm/lib, but Solr
expecting SystemDefaultHttpClient from 4.2.3
• Could build Storm with newer version of libraries, but not
guaranteed to work

Lucene In-Memory Processing Bolt
Advantages:
• Fast, Lightweight
• No Dependency Conflicts
• RAMDirectory backed
• Easy Solr to Lucene Document Conversion
• Solr Schema based
Bolt
Lucene Index
RAMDirectory
1. Initialization
– Parse Common Solr Schema
– Replace Solr Classes
2. Add Documents
– Convert SolrInputDocument to Lucene
Document
– Add to index

Lucene In-Memory Processing Bolt
Parse Read/Parse/Update Solr Schema File using Stax
Create IndexSchema from new Solr Schema data
public void addDocument(SolrInputDocument doc) throws Exception {
if (doc != null) {
Document luceneDoc = solrDocumentConverter.convert(doc);
indexWriter.addDocument(luceneDoc);
indexWriter.commit();
}
}
public Document convert(SolrInputDocument solrDocument) throws
Exception {
return DocumentBuilder.toDocument(solrDocument, indexSchema);
}

Prototype Solution
• Infrastructure:
– 8 node cluster on Amazon EC2
– Each VM has 2 cores and 8G of memory
• Data:
– 92,000 news article summaries
– Average file size: ~1k
• Queries:
– Generated 1 million sample queries
– Randomly selected terms from document set
– Stored in MariaDB (username, query string)
– Query Executor Bolt configured to as any subset of these queries

Prototype Solution – Monitoring Performance
• Metrics Provided by Storm UI
– Emitted: number of tuples emitted
– Transferred: number of tuples transferred (emitted * # follow-on bolts)
– Acked: number of tuples acknowledged
– Execute Latency: timestamp when execute function ends - timestamp when execute is
passed tuple
– Process Latency: timestamp when ack is called - timestamp when execute is passed tuple
– Capacity: % of the time in the last 10 minutes the bolt spent executing tuples
• Many metrics are samples, don’t always indicate problems
• Good measurement is comparing number of tuples transferred from spout, to number
of tuples acknowledged in bolt
– If transferred number is getting increasingly higher than number of acknowledged tuples, then
the topology is not keeping up with the rate of data

Trial Runs – First Attempt
• 8 workers, 1 Spout, 8 Query Executor Bolts, 8 Result Bolts
• Article spout emitting as fast as possible
• Query execution at 1k docs or 60 seconds elapsed time
• Increased number of queries on each trial: 10k, 50k, 100k, 200k, 300k, 400k, 500k
Results:
• Articles emitted too fast for
bolts to keep up
• If data continued to stream
at this rate, topology would
back up and drop tuples
Worker QWuoerrkye rB ox x l4t
4
Worker RWeosruklte rB ox x l4t
4
Node 7
4
4
Worker x 4
4
4
Node 6
4
4
Worker x 4
Node 1
4
4
Worker x 4
Node 5
4
4
Worker x 4
Node 2
Worker x 4
Node 3
Worker x 4
Node 4
4
4
Worker x 4
Node 8
4
4
Worker x 4
Article Spout
Node 1

Trial Runs – Second Attempt
• 8 workers, 1 Spout, 8 Query Executor Bolts, 8 Result Bolts
• Article spout now places articles on queue in background thread every 100ms
• Everything else the same…
Results:
• Topology performing much
better, keeping up with data
flow for query size of 10k,
50k, 100k, 200k
• Slows down around 300k
queries, approx 37.5k
queries/bolt
4
4
Node 7
4
4
Worker x 4
4
4
Node 6
4
4
Worker x 4
Node 1
4
4
Worker x 4
Node 5
4
4
Worker x 4
Node 2
Worker x 4
Node 3
Worker x 4
Node 4
4
4
Worker x 4
Node 8
4
4
Worker x 4
Article Spout
Node 1

Trials Runs – Third Attempt
• Each node has 4 worker slots so lets scale up
• 16 workers, 1 spout, 16 Query Executor Bolts, 8 Result Bolts
• Everything else the same…
Results:
• 300k queries now keeping
up no problem
• 400k doing ok…
• 500k backing up a bit
QWuoerrkye Worker rB ox x l4t 4
x 2
4
Node 7
x 2
4
Worker x 4
x 2
4
Node 6
x 2
4
Worker x 4
Node 1
x 2
4
Worker x 4
Node 5
x 2
4
Worker x 4
Node 2
Worker x 4
Node 3
Worker x 4
Node 4
x 2
4
Worker x 4
Node 8
x 2
4
Worker x 4
Article Spout
Node 1

Trial Runs – Fourth Attempt
• Next logical step, 32 workers, 1 spout, 32 Query Executor Bolts
• Didn’t result in anticipated performance gain, 500k still too much
• Hypothesizing that 2-core VMs might not be enough to get full performance from 4
worker slots
x 4
4
Node 7
x 4
4
Worker x 4
x 4
4
Node 6
x 4
4
Worker x 4
Node 1
x 4
4
Worker x 4
Node 5
x 4
4
Worker x 4
Node 2
Worker x 4
Node 3
Worker x 4
Node 4
Worker x 4
QWuoerrkye Br ox l4t x 4
4
Worker x 4
Node 8
x 4
4
Worker x 4
Article Spout
Node 1

Trials Runs – Conclusions
• Most important factor affecting performance is relationship between data rate and
number of queries
• Ideal Storm configuration is dependent on hardware executing the topology
• Optimal configuration resulted in 250 queries per second per bolt, 4k queries per
second across topology
• High level of performance from relatively small cluster

Conclusions
• Low barrier to entry working with Storm
• Easy conversion of Solr indices to Lucene Indices
• Simple integration between Lucene and Storm; Solr more complicated
• Configuration is key, tune topology to your needs
• Overall strategy appears to scale well for our use case, limited only by hardware

Future Considerations
• Adjust the batch size on the query executor bolt
• Combine duplicate queries (between users) if your system has many duplicates
• Investigate additional optimizations during Solr to Lucene
• Run topology with more complex queries (fielded, filtered, etc.)
• Investigate handling of bolt failure
• If ratio of incoming data to queries was reversed, consider switching the groupings
between the spouts and executor bolts

Updates Since Solr Lucene Revolution 2013
• Storm has moved to top-level Apache project
– https://storm.incubator.apache.org/
– Released 0.9.1, 0.9.2, 0.9.3-rc1
– Newer releases resolve classpath issue with EmbeddedSolrServer
– Improved Netty transport, new topology visualization
Source: http://storm.incubator.apache.org/2014/06/25/storm092-released.html

How can we test our topology at various scales with minimal setup?
• Launch Storm clusters on Amazon Web Services
– storm-deploy - https://github.com/nathanmarz/storm-deploy
• Created before Storm moved to Apache, limited activity
• install-0.9.1 branch has updates to pull Storm from Apache repo
• lein deploy-storm --start --name mycluster --branch master --commit v0.9.2-incubating
• Always launches m1.small - https://github.com/nathanmarz/storm-deploy/issues/67
– storm-deploy-alternative- https://github.com/KasperMadsen/storm-deploy-alternative
• Java alternative to storm-deploy
• Latest Apache Storm releases not supported yet, works with 0.8.2 and 0.9.0
– wirbelsturm - https://github.com/miguno/wirbelsturm
• Based on Vagrant and Puppet
• http://www.michael-noll.com/blog/2014/03/17/wirbelsturm-one-click-deploy-storm-kafka-clusters-with-
vagrant-puppet/
• Steeper learning curve to get going

How can we test our topology at various scales with minimal setup?
• Make the topology independent of Storm cluster
– Previous spout required data to be on server where spout is running
• Better approach - poll an external source for data (Redis, Kafaka, etc)
– Previous executor bolt loaded queries from a database
• Better approach - package a file of queries into topology jar
– Previous executor bolt expected a Solr config directory on the server
• Better approach – package config into topology jar, extract from classpath to
disk on start up
Redis Spout
Executor Bolt
queries
SOLR_HOME
Result
Bolt
Storm Cluster

Luwak
• Presentation by Flax at Solr Lucene Revoltion 2013 in Dublin
– Turning Search Upside Down: Using Lucene for Very Fast Stored Queries
– https://www.youtube.com/watch?v=rmRCsrJp2A8&list=UUKuRrzEQYP8pfCgCN8il4gQ
• Open sourced by Flax shortly after
– https://github.com/flaxsearch/luwak
• True inverted search solution
– Index queries
– Turn an incoming document into a query
– Determine which queries match that document
• Easy to integrate into existing Storm solution
• Clean API and documentation
Monitor monitor = new Monitor(
new LuceneQueryParser("field"),
new TermFilteredPresearcher());
MonitorQuery mq = new MonitorQuery(
"query1", "field:text");
monitor.update(mq);
InputDocument doc = InputDocument.builder("doc1”)
.addField(textfield, document,
new StandardTokenizer(Version.LUCENE_50))
.build();
SimpleMatcher matches = monitor.match(
doc, SimpleMatcher.FACTORY);

Performance Comparison
• How fast can we process all 92k
articles with varying query sizes?
• Performance comparison outside of
Storm, single-thread Java process
• Solr & Lucene solutions batch docs
– Allow 1,000 docs to be added to in-memory
index
– Execute all queries, clear, start over
• Luwak evaluates one document at a
time against indexed queries

Wrap-Up
• Conclusion
– Storm = scalable stream processing framework
– Luwak = high performance inverted search solution
– Luwak + Storm = scalable, high performance, inverted search solution!
• Contact Info
– bende_bryan@bah.com / Twitter @bbende
– conlin_joshua@bah.com / Twitter @jmconlin
• Thanks for having us!

Real-Time Inverted Search NYC ASLUG Oct 2014

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Real-Time Inverted Search NYC ASLUG Oct 2014

Similar to Real-Time Inverted Search NYC ASLUG Oct 2014 (20)

More from Bryan Bende

More from Bryan Bende (9)

Recently uploaded

Recently uploaded (20)

Real-Time Inverted Search NYC ASLUG Oct 2014

Editor's Notes