Real-time Inverted Search in the Cloud Using Lucene and Storm
Upcoming SlideShare
Loading in...5
×
 

Real-time Inverted Search in the Cloud Using Lucene and Storm

on

  • 1,258 views

Building real-time notification systems is often limited to basic filtering and pattern matching against incoming records. Allowing users to query incoming documents using Solr's full range of ...

Building real-time notification systems is often limited to basic filtering and pattern matching against incoming records. Allowing users to query incoming documents using Solr's full range of capabilities is much more powerful. In our environment we needed a way to allow for tens of thousands of such query subscriptions, meaning we needed to find a way to distribute the query processing in the cloud. By creating in-memory Lucene indices from our Solr configuration, we were able to parallelize our queries across our cluster. To achieve this distribution, we wrapped the processing in a Storm topology to provide a flexible way to scale and manage our infrastructure. This presentation will describe our experiences creating this distributed, real-time inverted search notification framework.

Statistics

Views

Total Views
1,258
Views on SlideShare
973
Embed Views
285

Actions

Likes
2
Downloads
28
Comments
0

4 Embeds 285

http://www.lucenerevolution.org 279
https://twitter.com 3
http://lucenerevolution.org 2
http://dschool.co 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Real-time Inverted Search in the Cloud Using Lucene and Storm Real-time Inverted Search in the Cloud Using Lucene and Storm Presentation Transcript

  • conlin_joshua@bah.com bende_bryan@bah.com owen_james@bah.com REAL-TIME INVERTED SEARCH IN THE CLOUD USING LUCENE AND STORM Joshua Conlin, Bryan Bende, James Owen
  • Table of Contents   Problem Statement   Storm   Methodology   Results
  • Who are we ? Booz Allen Hamilton –  Large consulting firm supporting many industries •  Healthcare, Finance, Energy, Defense –  Strategic Innovation Group •  Focus on innovative solutions that can be applied across industries •  Major focus on data science, big data, & information retrieval •  Multiple clients utilizing Solr for implementing search capabilities
  • Client Applications & Architecture Ingest   Typical client applications allow users to: •  Query document index using Lucene syntax SolrCloud   •  Filter and facet results •  Save queries for future use Web  App  
  • Problem Statement How do we instantly notify users of new documents that match their saved queries? Constraints: •  Process documents in real-time, notify as soon as possible •  Scale with the number of saved queries (starting with tens of thousands) •  Result set of notifications must match saved queries •  Must not impact performance of the web application •  Data arrives at varying speeds and varying sizes
  • Possible Solutions •  •  Second Solr instance to handle background execution of saved queries Fork ingest to primary and secondary Solr instances, execute all the saved queries against secondary instance lotsOfQueries.size() = 1 X 109 //Milliard? for (Query q : lotsOfQueries) { q //*A* OR *B* OR … Pros   •  Easy  to  set  up,  Simple   •  Works  for  a  consistent,  small  data  flow   } //… This will take forever Cons   •  Query  bound  
  • Possible Solutions •  •  Distribute queries amongst multiple machines Execute queries against a shared Solr (or SolrCloud) instance lotsOfQueries.size()  =  2.5  X  108     for  (Query  q  :  lotsOfQueries)  {            q  //*A*  OR  *B*  OR  …            }     lotsOfQueries.size()  =  2.5  X  108     for  (Query  q  :  lotsOfQueries)  {            q  //*C*  OR  *D*  OR  …            }     Pros   •  Scalable,  only  bound  by  the  processing  of  the   Solr  instance   Cons   •  lotsOfQueries.size()  =  2.5  X  108     for  (Query  q  :  lotsOfQueries)  {            q  //*E*  OR  *F*  OR  …            }     lotsOfQueries.size()  =  2.5  X  108    for  (Query  q  :  lotsOfQueries)  {            q  //*G*  OR  *H*  OR  …            }     Who  is  maintaining  this  code???   •  SynchronizaCon  issues,  Index  cannot  be   updated  during  query  execuCon  
  • Possible Solutions One way to deal with the synchronization issues is to do away with a shared Solr instance, giving each VM its own instance, then distribute the data or queries evenly across the VMs. Pros   lotsOfQueries.size()  =  5  X  108     for  (Query  q  :  lotsOfQueries)  {            q  //*A*  OR  *B*  OR  …                  }   lotsOfQueries.size()  =  5  X  108     for  (Query  q  :  lotsOfQueries)  {            q  //*C*  OR  *D*  OR  …                  }   •  Scalable,  processing  power  only  bound  by   number  of  VMs   •  Can  handle  variable  data  flow,  query   processing  would  not  need  to  be   synchronized   Cons   •  Difficult  to  maintain  
  • Possible Solutions Is there a way we can set up this system so that it’s: •  easy to maintain, •  easy to scale, and •  easy to synchronize?
  • Candidate Solution •  •  Integrate Solr and/or Lucene with a stream processing framework Process data in real-time, leverage proven framework for distributed stream processing Ingest   SolrCloud   Storm   Web  App   NoCficaCons  
  • Storm - Overview •  Storm is an open source stream processing framework. •  It’s a scalable platform that lets you distribute processes across a cluster quickly and easily. •  You can add more resources to your cluster and easily utilize those resources in your processing.
  • Storm - Components •  •  •  Nimbus – the control node for the cluster, distributes jobs through the cluster Supervisor – one on each machine in the cluster , controls the allocation of worker assignments on its machine Worker – JVM process for running topology components Nimbus   Supervisor   Supervisor   Supervisor   Worker   Worker   Worker   Worker   Worker   Worker   Worker   Worker   Worker   Worker   Worker   Worker  
  • Storm – Core Concepts •  Topology – defines a running process, which includes all of the processes to be run, the connections between those processes, and their configuration •  Stream – the flow of data through a topology; it is an unbounded collection of tuples that is passed from process to process •  Storm has 2 types of processing units: –  Spout – the start of a stream; it can be thought of as the source of the data; that data can be read in however the spout wants—from a database, from a message queue, etc. –  Bolt – the primary processing unit for a topology; it accepts any number of streams, does whatever processing you’ve set it to do, and outputs any number of streams based on how you configure it
  • Storm – Core Concepts (continued) •  Stream Groupings – defines how topology processing units (spouts and bolts) are connected to each other; some common groupings are: –  All Grouping – stream is sent to all bolts –  Shuffle Grouping – stream is evenly distributed across bolts –  Fields grouping – sends tuples that match on the designated “field” to the same bolt
  • How to Utilize Storm How can we use this framework to solve our problem? Let  Storm  distribute  out  the  data  and  queries  between   processing  nodes   …but  we  would  sCll  need  to  manage  a  Solr  instance  on  each   VM,  and  we  would  even  need  to  ensure  synchronizaCon   between  query  processing  bolts  running  on  the  same  VM.  
  • How to Utilize Storm What if instead of having a Solr installation on each machine we ran Solr in memory inside each of the processing bolts? •  Use Storm spout to distribute new documents •  Use Storm bolt to execute queries against EmbeddedSolrServer with RAMDirectory –  Incoming documents added to index –  Queries executed –  Documents removed from index •  Use Storm bolt to process query results Bolt   EmbeddedSolrServer   RAMDirectory  
  • Advantages This has several advantages: •  It removes the need to maintain a Solr instance on each VM. •  It’s easier to scale and more flexible; it doesn’t matter which Supervisor the bolts get sent to, all the processing is self-contained. •  It removes the need to synchronize processing between bolts. •  Documents are volatile, existing queries over new data
  • Execution Topology Data   Spout   Data   Spout   Data   Spout   Query   Spout   Data  Spout – Receives incoming data files and sends to every Executor Bolt Query Spout – Coordinates updates to queries Executor   Bolt   Executor   Bolt   All   Grouping   Shuffle Grouping Executor   Bolt   NoCficaCon   Bolt   Executor   Bolt   Executor   Bolt   Executor Bolt – Loads and executes queries Notification Bolt – Generates notifications based on results
  • Executor Bolt Documents   1.  Queries are loaded into memory 2.  Incoming documents are added to the Lucene index 3.  Documents are processed when one of the following conditions are met: a)  The number of documents have exceeded the max batch size b)  The time since the last execution is longer than the max interval time 4.  Matching queries and document UIDs are emitted 5.  Remove all documents from index 2 1 Query  List   3 4 emit()  
  • Solr In-Memory Processing Bolt Issues •  •  •  •  •  Attempted to run Solr with in-memory index inside Storm bolt Solr 4.5 requires: –  http-client 4.2.3 –  http-core 4.2.2 Storm 0.8.2 & 0.9.0 require: –  http-client 4.1.1 –  http-core 4.1 Could exclude libraries from super jar and rely on storm/lib, but Solr expecting SystemDefaultHttpClient from 4.2.3 Could build Storm with newer version of libraries, but not guaranteed to work
  • Lucene  In-­‐Memory  Processing  Bolt   1.  IniCalizaCon   –  Parse  Common  Solr  Schema   –  Replace  Solr  Classes   2.  Add  Documents   –  Convert  SolrInputDocument  to  Lucene   Document   –  Add  to  index   Advantages:     •   Fast,  Lightweight   •   No  Dependency  Conflicts   •   RAMDirectory  backed   •   Easy  Solr  to  Lucene  Document  Conversion   •   Solr  Schema  based   Bolt   Lucene  Index   RAMDirectory  
  • Lucene In-Memory Processing Bolt Parse  Read/Parse/Update  Solr  Schema  File  using  Stax   Create  IndexSchema  from  new  Solr  Schema  data       public void addDocument(SolrInputDocument doc) throws Exception { if (doc != null) { Document luceneDoc = solrDocumentConverter.convert(doc); indexWriter.addDocument(luceneDoc); indexWriter.commit(); } }
  • Prototype Solution •  •  •  Infrastructure: –  8 node cluster on Amazon EC2 –  Each VM has 2 cores and 8G of memory Data: –  92,000 news article summaries –  Average file size: ~1k Queries: –  Generated 1 million sample queries –  Randomly selected terms from document set –  Stored in MariaDB (username, query string) –  Query Executor Bolt configured to as any subset of these queries
  • Prototype Solution – Monitoring Performance •  Metrics Provided by Storm UI –  –  –  –  Emitted: number of tuples emitted Transferred: number of tuples transferred (emitted * # follow-on bolts) Acked: number of tuples acknowledged Execute Latency: timestamp when execute function ends - timestamp when execute is passed tuple –  Process Latency: timestamp when ack is called - timestamp when execute is passed tuple –  Capacity: % of the time in the last 10 minutes the bolt spent executing tuples •  •  Many metrics are samples, don’t always indicate problems Good measurement is comparing number of tuples transferred from spout, to number of tuples acknowledged in bolt –  If transferred number is getting increasingly higher than number of acknowledged tuples, then the topology is not keeping up with the rate of data
  • Trial Runs – First Attempt Node  1   •  •  •  •  Node  1   ArCcle  Spout   8 workers, 1 Spout, 8 Query Executor Bolts, 8 Result Bolts Article spout emitting as fast as possible Query execution at 1k docs or 60 seconds elapsed time Increased number of queries on each trial: 10k, 50k, 100k, 200k, 300k, 400k, 500k Node  2   Node  3   Node  4   Query  Bx  4 Worker  olt     Query  Bx  4 Worker  olt     Query  Bx  4 Worker  olt     Query  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Worker  x  4   Worker  x  4   Worker  x  4   Worker  x  4   Node  5   Node  6   Node  7   Node  8   Query  Bx  4 Worker  olt     Query  Bx  4 Worker  olt     Query  Bx  4 Worker  olt     Query  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Worker  x  4   Worker  x  4   Worker  x  4   Worker  x  4   Results: •  •  Articles emitted too fast for bolts to keep up If data continued to stream at this rate, topology would back up and drop tuples
  • Trial Runs – Second Attempt Node  1   •  •  •  8 workers, 1 Spout, 8 Query Executor Bolts, 8 Result Bolts Article spout now places articles on queue in background thread every 100ms Everything else the same… Node  1   ArCcle  Spout   Node  2   Node  3   Node  4   Query  Bx  4 Worker  olt     Query  Bx  4 Worker  olt     Query  Bx  4 Worker  olt     Query  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Worker  x  4   Worker  x  4   •  Result  Bx  4 Worker  olt     Worker  x  4   Results: Worker  x  4   •  Node  5   Node  6   Node  7   Node  8   Query  Bx  4 Worker  olt     Query  Bx  4 Worker  olt     Query  Bx  4 Worker  olt     Query  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Worker  x  4   Worker  x  4   Worker  x  4   Worker  x  4   Topology performing much better, keeping up with data flow for query size of 10k, 50k, 100k, 200k Slows down around 300k queries, approx 37.5k queries/bolt
  • Trials Runs – Third Attempt Node  1   •  •  •  Each node has 4 worker slots so lets scale up 16 workers, 1 spout, 16 Query Executor Bolts, 8 Result Bolts Everything else the same… Node  1   ArCcle  Spout   Node  2   Node  3   Node  4   Query  Bx  4 Worker  olt    x  2   Query  Bx  4 Worker  olt    x  2   Query  Bx  4 Worker  olt    x  2   Query  Bx  4 Worker  olt    x  2   Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Worker  x  4   Worker  x  4   Worker  x  4   Worker  x  4   Node  5   Node  6   Node  7   Node  8   Query  Bx  4 Worker  olt    x  2   Query  Bx  4 Worker  olt    x  2   Query  Bx  4 Worker  olt    x  2   Query  Bx  4 Worker  olt    x  2   Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Worker  x  4   Worker  x  4   Worker  x  4   Worker  x  4   Results: •  •  •  300k queries now keeping up no problem 400k doing ok… 500k backing up a bit
  • Trial Runs – Fourth Attempt Node  1   •  •  •  Next logical step, 32 workers, 1 spout, 32 Query Executor Bolts Didn’t result in anticipated performance gain, 500k still too much Hypothesizing that 2-core VMs might not be enough to get full performance from 4 worker slots Node  1   ArCcle  Spout   Node  2   Node  3   Node  4   Query  Bx  4 Worker  olt    x  4   Query  Bx  4 Worker  olt    x  4   Query  Bx  4 Worker  olt    x  4   Worker  olt    x  4   Query  Bx  4 Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Worker  x  4   Worker  x  4   Worker  x  4   Worker  x  4   Node  5   Node  6   Node  7   Node  8   Query  Bx  4 Worker  olt    x  4   Query  Bx  4 Worker  olt    x  4   Query  Bx  4 Worker  olt    x  4   Query  Bx  4 Worker  olt    x  4   Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Worker  x  4   Worker  x  4   Worker  x  4   Worker  x  4  
  • Trials Runs – Conclusions •  Most important factor affecting performance is relationship between data rate and number of queries •  Ideal Storm configuration is dependent on hardware executing the topology •  Optimal configuration resulted in 250 queries per second per bolt, 4k queries per second across topology •  High level of performance from relatively small cluster
  • Conclusions •  Low barrier to entry working with Storm •  Easy conversion of Solr indices to Lucene Indices •  Simple integration between Lucene and Storm; Solr more complicated •  Configuration is key, tune topology to your needs •  Overall strategy appears to scale well for our use case, limited only by hardware
  • Future Considerations •  Adjust the batch size on the query executor bolt •  Combine duplicate queries (between users) if your system has many duplicates •  Investigate additional optimizations during Solr to Lucene •  Run topology with more complex queries (fielded, filtered, etc.) •  Investigate handling of bolt failure •  If ratio of incoming data to queries was reversed, consider switching the groupings between the spouts and executor bolts
  • Questions?