SlideShare a Scribd company logo
REAL-TIME INVERTED SEARCH IN THE 
CLOUD USING LUCENE AND STORM 
Joshua Conlin, Bryan Bende, James Owen 
conlin_joshua@bah.com 
bende_bryan@bah.com 
owen_james@bah.com
Table of Contents 
 Problem Statement 
 Storm 
 Methodology 
 Results
Who are we ? 
Booz Allen Hamilton 
– Large consulting firm supporting many industries 
• Healthcare, Finance, Energy, Defense 
– Strategic Innovation Group 
• Focus on innovative solutions that can be applied across industries 
• Major focus on data science, big data, & information retrieval 
• Multiple clients utilizing Solr for implementing search capabilities 
• Explore Date Science 
• Self-paced data science training, launching TODAY! 
• https://exploredatascience.com
Client Applications & Architecture 
Ingest 
SolrCloud 
Web App 
Typical client applications allow users to: 
• Query document index using Lucene syntax 
• Filter and facet results 
• Save queries for future use
Problem Statement 
How do we instantly notify users of new documents that match their 
saved queries? 
Constraints: 
• Process documents in real-time, notify as soon as possible 
• Scale with the number of saved queries (starting with tens of thousands) 
• Result set of notifications must match saved queries 
• Must not impact performance of the web application 
• Data arrives at varying speeds and varying sizes
Possible Solutions 
1. Fork ingest to a second Solr instance, run stored queries periodically 
– Pros: Easy to setup, works for small amount of data data & small # of queries 
– Cons: Bound by time to execute all queries 
2. Same secondary Solr instance, but distribute queries to multiple servers 
– Pros: Reduces query processing time by dividing across several servers 
– Cons: Now writing custom code to distribute queries, possible synchronization issues 
ensuring each server executes queries against the same data 
3. Give each server its own Solr instance and subset of queries 
– Pros: Very scalable, only bound by number of servers 
– Cons: Difficult to maintain, still writing custom code to distribute data and queries
Possible Solutions 
Is there a way we can set up this system so that it’s: 
• easy to maintain, 
• easy to scale, and 
• easy to synchronize?
Candidate Solution 
• Integrate Solr and/or Lucene with a stream processing framework 
• Process data in real-time, leverage proven framework for distributed stream 
processing 
Ingest 
SolrCloud 
Web App 
Storm 
Notifications
Storm - Overview 
• Storm is an open source stream processing framework. 
• It’s a scalable platform that lets you distribute processes across a cluster quickly 
and easily. 
• You can add more resources to your cluster and easily utilize those resources in 
your processing.
Storm - Components 
• Nimbus – the control node for the cluster, distributes topology through the cluster 
• Supervisor – one on each machines in the cluster, controls the allocation of worker 
assignments on its machine 
• Worker – JVM process for running topology components 
Nimbus 
Supervisor 
Worker 
Worker 
Worker 
Worker 
Supervisor 
Worker 
Worker 
Worker 
Worker 
Supervisor 
Worker 
Worker 
Worker 
Worker
Storm – Core Concepts 
• Topology – defines a running process, which includes all of the processes to be 
run, the connections between those processes, and their configuration 
• Stream – the flow of data through a topology; it is an unbounded collection of 
tuples that is passed from process to process 
• Storm has 2 types of processing units: 
– Spout – the start of a stream; it can be thought of as the source of the data; 
that data can be read in however the spout wants—from a database, from a 
message queue, etc. 
– Bolt – the primary processing unit for a topology; it accepts any number of 
streams, does whatever processing you’ve set it to do, and outputs any 
number of streams based on how you configure it
Storm – Core Concepts (continued) 
• Stream Groupings – defines how topology processing units (spouts and bolts) are 
connected to each other; some common groupings are: 
– All Grouping – stream is sent to all bolts 
– Shuffle Grouping – stream is evenly distributed across bolts 
– Fields grouping – sends tuples that match on the designated “field” to the 
same bolt
Storm - Parallelism 
Source: http://storm.incubator.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html
How to Utilize Storm 
How can we use this framework to solve our problem? 
Let Storm distribute out the data and queries between 
processing nodes 
…but we would still need to manage a Solr instance on each 
VM, and we would even need to ensure synchronization 
between query processing bolts running on the same VM.
How to Utilize Storm 
What if instead of having a Solr installation on each machine we ran 
Solr in memory inside each of the processing bolts? 
• Use Storm spout to distribute new documents 
• Use Storm bolt to execute queries against EmbeddedSolrServer with 
RAMDirectory 
– Incoming documents added to index 
– Queries executed 
– Documents removed from index 
• Use Storm bolt to process query results 
Bolt 
EmbeddedSolrServer 
RAMDirectory
Advantages 
This has several advantages: 
• It removes the need to maintain a Solr instance on each VM. 
• It’s easier to scale and more flexible; it doesn’t matter which Supervisor the bolts 
get sent to, all the processing is self-contained. 
• It removes the need to synchronize processing between bolts. 
• Documents are volatile, existing queries over new data
Execution Topology 
Data 
Spout 
Data 
Spout 
Data 
Spout 
Query 
Spout 
Executor 
Bolt 
Executor 
Bolt 
Executor 
Bolt 
Executor 
Bolt 
Executor 
Bolt 
Notification 
Bolt 
Data Spout – Receives incoming 
data files and sends to every 
Executor Bolt 
Query Spout – Coordinates 
updates to queries 
Executor Bolt – Loads and 
executes queries 
Notification Bolt – Generates 
All notifications based on results 
Grouping 
Shuffle 
Grouping
Executor Bolt 
1. Queries are loaded into memory 
2. Incoming documents are added to the 
Lucene index 
3. Documents are processed when one 
of the following conditions are met: 
a) The number of documents have 
exceeded the max batch size 
b) The time since the last execution 
is longer than the max interval 
time 
4. Matching queries and document UIDs 
are emitted 
5. Remove all documents from index 
Query List 
Documents 
emit() 
1 
2 
3 
4
Solr In-Memory Processing Bolt Issues 
• Attempted to run Solr with in-memory index inside Storm bolt 
• Solr 4.5 requires: 
– http-client 4.2.3 
– http-core 4.2.2 
• Storm 0.8.2 & 0.9.0 require: 
– http-client 4.1.1 
– http-core 4.1 
• Could exclude libraries from super jar and rely on storm/lib, but Solr 
expecting SystemDefaultHttpClient from 4.2.3 
• Could build Storm with newer version of libraries, but not 
guaranteed to work
Lucene In-Memory Processing Bolt 
Advantages: 
• Fast, Lightweight 
• No Dependency Conflicts 
• RAMDirectory backed 
• Easy Solr to Lucene Document Conversion 
• Solr Schema based 
Bolt 
Lucene Index 
RAMDirectory 
1. Initialization 
– Parse Common Solr Schema 
– Replace Solr Classes 
2. Add Documents 
– Convert SolrInputDocument to Lucene 
Document 
– Add to index
Lucene In-Memory Processing Bolt 
Parse Read/Parse/Update Solr Schema File using Stax 
Create IndexSchema from new Solr Schema data 
public void addDocument(SolrInputDocument doc) throws Exception { 
if (doc != null) { 
Document luceneDoc = solrDocumentConverter.convert(doc); 
indexWriter.addDocument(luceneDoc); 
indexWriter.commit(); 
} 
} 
public Document convert(SolrInputDocument solrDocument) throws 
Exception { 
return DocumentBuilder.toDocument(solrDocument, indexSchema); 
}
Prototype Solution 
• Infrastructure: 
– 8 node cluster on Amazon EC2 
– Each VM has 2 cores and 8G of memory 
• Data: 
– 92,000 news article summaries 
– Average file size: ~1k 
• Queries: 
– Generated 1 million sample queries 
– Randomly selected terms from document set 
– Stored in MariaDB (username, query string) 
– Query Executor Bolt configured to as any subset of these queries
Prototype Solution – Monitoring Performance 
• Metrics Provided by Storm UI 
– Emitted: number of tuples emitted 
– Transferred: number of tuples transferred (emitted * # follow-on bolts) 
– Acked: number of tuples acknowledged 
– Execute Latency: timestamp when execute function ends - timestamp when execute is 
passed tuple 
– Process Latency: timestamp when ack is called - timestamp when execute is passed tuple 
– Capacity: % of the time in the last 10 minutes the bolt spent executing tuples 
• Many metrics are samples, don’t always indicate problems 
• Good measurement is comparing number of tuples transferred from spout, to number 
of tuples acknowledged in bolt 
– If transferred number is getting increasingly higher than number of acknowledged tuples, then 
the topology is not keeping up with the rate of data
Trial Runs – First Attempt 
• 8 workers, 1 Spout, 8 Query Executor Bolts, 8 Result Bolts 
• Article spout emitting as fast as possible 
• Query execution at 1k docs or 60 seconds elapsed time 
• Increased number of queries on each trial: 10k, 50k, 100k, 200k, 300k, 400k, 500k 
Results: 
• Articles emitted too fast for 
bolts to keep up 
• If data continued to stream 
at this rate, topology would 
back up and drop tuples 
Worker QWuoerrkye rB ox x l4t 
4 
Worker RWeosruklte rB ox x l4t 
4 
Node 7 
Worker QWuoerrkye rB ox x l4t 
4 
Worker RWeosruklte rB ox x l4t 
4 
Worker x 4 
Worker QWuoerrkye rB ox x l4t 
4 
Worker RWeosruklte rB ox x l4t 
4 
Node 6 
Worker QWuoerrkye rB ox x l4t 
4 
Worker RWeosruklte rB ox x l4t 
4 
Worker x 4 
Node 1 
Worker QWuoerrkye rB ox x l4t 
4 
Worker RWeosruklte rB ox x l4t 
4 
Worker x 4 
Node 5 
Worker QWuoerrkye rB ox x l4t 
4 
Worker RWeosruklte rB ox x l4t 
4 
Worker x 4 
Node 2 
Worker x 4 
Node 3 
Worker x 4 
Node 4 
Worker QWuoerrkye rB ox x l4t 
4 
Worker RWeosruklte rB ox x l4t 
4 
Worker x 4 
Node 8 
Worker QWuoerrkye rB ox x l4t 
4 
Worker RWeosruklte rB ox x l4t 
4 
Worker x 4 
Article Spout 
Node 1
Trial Runs – Second Attempt 
• 8 workers, 1 Spout, 8 Query Executor Bolts, 8 Result Bolts 
• Article spout now places articles on queue in background thread every 100ms 
• Everything else the same… 
Results: 
• Topology performing much 
better, keeping up with data 
flow for query size of 10k, 
50k, 100k, 200k 
• Slows down around 300k 
queries, approx 37.5k 
queries/bolt 
Worker QWuoerrkye rB ox x l4t 
4 
Worker RWeosruklte rB ox x l4t 
4 
Node 7 
Worker QWuoerrkye rB ox x l4t 
4 
Worker RWeosruklte rB ox x l4t 
4 
Worker x 4 
Worker QWuoerrkye rB ox x l4t 
4 
Worker RWeosruklte rB ox x l4t 
4 
Node 6 
Worker QWuoerrkye rB ox x l4t 
4 
Worker RWeosruklte rB ox x l4t 
4 
Worker x 4 
Node 1 
Worker QWuoerrkye rB ox x l4t 
4 
Worker RWeosruklte rB ox x l4t 
4 
Worker x 4 
Node 5 
Worker QWuoerrkye rB ox x l4t 
4 
Worker RWeosruklte rB ox x l4t 
4 
Worker x 4 
Node 2 
Worker x 4 
Node 3 
Worker x 4 
Node 4 
Worker QWuoerrkye rB ox x l4t 
4 
Worker RWeosruklte rB ox x l4t 
4 
Worker x 4 
Node 8 
Worker QWuoerrkye rB ox x l4t 
4 
Worker RWeosruklte rB ox x l4t 
4 
Worker x 4 
Article Spout 
Node 1
Trials Runs – Third Attempt 
• Each node has 4 worker slots so lets scale up 
• 16 workers, 1 spout, 16 Query Executor Bolts, 8 Result Bolts 
• Everything else the same… 
Results: 
• 300k queries now keeping 
up no problem 
• 400k doing ok… 
• 500k backing up a bit 
QWuoerrkye Worker rB ox x l4t 4 
x 2 
Worker RWeosruklte rB ox x l4t 
4 
Node 7 
QWuoerrkye Worker rB ox x l4t 4 
x 2 
Worker RWeosruklte rB ox x l4t 
4 
Worker x 4 
QWuoerrkye Worker rB ox x l4t 4 
x 2 
Worker RWeosruklte rB ox x l4t 
4 
Node 6 
QWuoerrkye Worker rB ox x l4t 4 
x 2 
Worker RWeosruklte rB ox x l4t 
4 
Worker x 4 
Node 1 
QWuoerrkye Worker rB ox x l4t 4 
x 2 
Worker RWeosruklte rB ox x l4t 
4 
Worker x 4 
Node 5 
QWuoerrkye Worker rB ox x l4t 4 
x 2 
Worker RWeosruklte rB ox x l4t 
4 
Worker x 4 
Node 2 
Worker x 4 
Node 3 
Worker x 4 
Node 4 
QWuoerrkye Worker rB ox x l4t 4 
x 2 
Worker RWeosruklte rB ox x l4t 
4 
Worker x 4 
Node 8 
QWuoerrkye Worker rB ox x l4t 4 
x 2 
Worker RWeosruklte rB ox x l4t 
4 
Worker x 4 
Article Spout 
Node 1
Trial Runs – Fourth Attempt 
• Next logical step, 32 workers, 1 spout, 32 Query Executor Bolts 
• Didn’t result in anticipated performance gain, 500k still too much 
• Hypothesizing that 2-core VMs might not be enough to get full performance from 4 
worker slots 
QWuoerrkye Worker rB ox x l4t 4 
x 4 
Worker RWeosruklte rB ox x l4t 
4 
Node 7 
QWuoerrkye Worker rB ox x l4t 4 
x 4 
Worker RWeosruklte rB ox x l4t 
4 
Worker x 4 
QWuoerrkye Worker rB ox x l4t 4 
x 4 
Worker RWeosruklte rB ox x l4t 
4 
Node 6 
QWuoerrkye Worker rB ox x l4t 4 
x 4 
Worker RWeosruklte rB ox x l4t 
4 
Worker x 4 
Node 1 
QWuoerrkye Worker rB ox x l4t 4 
x 4 
Worker RWeosruklte rB ox x l4t 
4 
Worker x 4 
Node 5 
QWuoerrkye Worker rB ox x l4t 4 
x 4 
Worker RWeosruklte rB ox x l4t 
4 
Worker x 4 
Node 2 
Worker x 4 
Node 3 
Worker x 4 
Node 4 
Worker x 4 
QWuoerrkye Br ox l4t x 4 
Worker RWeosruklte rB ox x l4t 
4 
Worker x 4 
Node 8 
QWuoerrkye Worker rB ox x l4t 4 
x 4 
Worker RWeosruklte rB ox x l4t 
4 
Worker x 4 
Article Spout 
Node 1
Trials Runs – Conclusions 
• Most important factor affecting performance is relationship between data rate and 
number of queries 
• Ideal Storm configuration is dependent on hardware executing the topology 
• Optimal configuration resulted in 250 queries per second per bolt, 4k queries per 
second across topology 
• High level of performance from relatively small cluster
Conclusions 
• Low barrier to entry working with Storm 
• Easy conversion of Solr indices to Lucene Indices 
• Simple integration between Lucene and Storm; Solr more complicated 
• Configuration is key, tune topology to your needs 
• Overall strategy appears to scale well for our use case, limited only by hardware
Future Considerations 
• Adjust the batch size on the query executor bolt 
• Combine duplicate queries (between users) if your system has many duplicates 
• Investigate additional optimizations during Solr to Lucene 
• Run topology with more complex queries (fielded, filtered, etc.) 
• Investigate handling of bolt failure 
• If ratio of incoming data to queries was reversed, consider switching the groupings 
between the spouts and executor bolts
Questions?
Updates Since Solr Lucene Revolution 2013 
• Storm has moved to top-level Apache project 
– https://storm.incubator.apache.org/ 
– Released 0.9.1, 0.9.2, 0.9.3-rc1 
– Newer releases resolve classpath issue with EmbeddedSolrServer 
– Improved Netty transport, new topology visualization 
Source: http://storm.incubator.apache.org/2014/06/25/storm092-released.html
How can we test our topology at various scales with minimal setup? 
• Launch Storm clusters on Amazon Web Services 
– storm-deploy - https://github.com/nathanmarz/storm-deploy 
• Created before Storm moved to Apache, limited activity 
• install-0.9.1 branch has updates to pull Storm from Apache repo 
• lein deploy-storm --start --name mycluster --branch master --commit v0.9.2-incubating 
• Always launches m1.small - https://github.com/nathanmarz/storm-deploy/issues/67 
– storm-deploy-alternative- https://github.com/KasperMadsen/storm-deploy-alternative 
• Java alternative to storm-deploy 
• Latest Apache Storm releases not supported yet, works with 0.8.2 and 0.9.0 
– wirbelsturm - https://github.com/miguno/wirbelsturm 
• Based on Vagrant and Puppet 
• http://www.michael-noll.com/blog/2014/03/17/wirbelsturm-one-click-deploy-storm-kafka-clusters-with- 
vagrant-puppet/ 
• Steeper learning curve to get going
How can we test our topology at various scales with minimal setup? 
• Make the topology independent of Storm cluster 
– Previous spout required data to be on server where spout is running 
• Better approach - poll an external source for data (Redis, Kafaka, etc) 
– Previous executor bolt loaded queries from a database 
• Better approach - package a file of queries into topology jar 
– Previous executor bolt expected a Solr config directory on the server 
• Better approach – package config into topology jar, extract from classpath to 
disk on start up 
Redis Spout 
Executor Bolt 
queries 
SOLR_HOME 
Result 
Bolt 
Storm Cluster
Luwak 
• Presentation by Flax at Solr Lucene Revoltion 2013 in Dublin 
– Turning Search Upside Down: Using Lucene for Very Fast Stored Queries 
– https://www.youtube.com/watch?v=rmRCsrJp2A8&list=UUKuRrzEQYP8pfCgCN8il4gQ 
• Open sourced by Flax shortly after 
– https://github.com/flaxsearch/luwak 
• True inverted search solution 
– Index queries 
– Turn an incoming document into a query 
– Determine which queries match that document 
• Easy to integrate into existing Storm solution 
• Clean API and documentation 
Monitor monitor = new Monitor( 
new LuceneQueryParser("field"), 
new TermFilteredPresearcher()); 
MonitorQuery mq = new MonitorQuery( 
"query1", "field:text"); 
monitor.update(mq); 
InputDocument doc = InputDocument.builder("doc1”) 
.addField(textfield, document, 
new StandardTokenizer(Version.LUCENE_50)) 
.build(); 
SimpleMatcher matches = monitor.match( 
doc, SimpleMatcher.FACTORY);
Performance Comparison 
• How fast can we process all 92k 
articles with varying query sizes? 
• Performance comparison outside of 
Storm, single-thread Java process 
• Solr & Lucene solutions batch docs 
– Allow 1,000 docs to be added to in-memory 
index 
– Execute all queries, clear, start over 
• Luwak evaluates one document at a 
time against indexed queries
Wrap-Up 
• Conclusion 
– Storm = scalable stream processing framework 
– Luwak = high performance inverted search solution 
– Luwak + Storm = scalable, high performance, inverted search solution! 
• Contact Info 
– bende_bryan@bah.com / Twitter @bbende 
– conlin_joshua@bah.com / Twitter @jmconlin 
• Thanks for having us!

More Related Content

What's hot

Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
Chris Nauroth
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
DataWorks Summit/Hadoop Summit
 
Sub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleSub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scale
Yifeng Jiang
 
The Power of Intelligent Flows: Real-Time IoT Botnet Classification with Apac...
The Power of Intelligent Flows: Real-Time IoT Botnet Classification with Apac...The Power of Intelligent Flows: Real-Time IoT Botnet Classification with Apac...
The Power of Intelligent Flows: Real-Time IoT Botnet Classification with Apac...
DataWorks Summit
 
Best practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at RenaultBest practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at Renault
DataWorks Summit
 
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage SubsystemEvolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
DataWorks Summit/Hadoop Summit
 
A TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoA TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with Presto
Yu Liu
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
gluent.
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariStreamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
DataWorks Summit/Hadoop Summit
 
Apache flume by Swapnil Dubey
Apache flume by Swapnil DubeyApache flume by Swapnil Dubey
Apache flume by Swapnil DubeySwapnil Dubey
 
Ozone- Object store for Apache Hadoop
Ozone- Object store for Apache HadoopOzone- Object store for Apache Hadoop
Ozone- Object store for Apache Hadoop
Hortonworks
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
Julian Hyde
 
Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0
DataWorks Summit
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BI
DataWorks Summit
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
DataWorks Summit
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
DataWorks Summit/Hadoop Summit
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduce
Chris Nauroth
 
Keep your hadoop cluster at its best! v4
Keep your hadoop cluster at its best! v4Keep your hadoop cluster at its best! v4
Keep your hadoop cluster at its best! v4
Chris Nauroth
 
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPANNetwork for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
DataWorks Summit/Hadoop Summit
 
Hive present-and-feature-shanghai
Hive present-and-feature-shanghaiHive present-and-feature-shanghai
Hive present-and-feature-shanghai
Yifeng Jiang
 

What's hot (20)

Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Sub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleSub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scale
 
The Power of Intelligent Flows: Real-Time IoT Botnet Classification with Apac...
The Power of Intelligent Flows: Real-Time IoT Botnet Classification with Apac...The Power of Intelligent Flows: Real-Time IoT Botnet Classification with Apac...
The Power of Intelligent Flows: Real-Time IoT Botnet Classification with Apac...
 
Best practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at RenaultBest practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at Renault
 
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage SubsystemEvolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
 
A TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoA TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with Presto
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariStreamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
 
Apache flume by Swapnil Dubey
Apache flume by Swapnil DubeyApache flume by Swapnil Dubey
Apache flume by Swapnil Dubey
 
Ozone- Object store for Apache Hadoop
Ozone- Object store for Apache HadoopOzone- Object store for Apache Hadoop
Ozone- Object store for Apache Hadoop
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
 
Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BI
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduce
 
Keep your hadoop cluster at its best! v4
Keep your hadoop cluster at its best! v4Keep your hadoop cluster at its best! v4
Keep your hadoop cluster at its best! v4
 
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPANNetwork for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
 
Hive present-and-feature-shanghai
Hive present-and-feature-shanghaiHive present-and-feature-shanghai
Hive present-and-feature-shanghai
 

Similar to Real-Time Inverted Search NYC ASLUG Oct 2014

Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
lucenerevolution
 
Deploying and managing Solr at scale
Deploying and managing Solr at scaleDeploying and managing Solr at scale
Deploying and managing Solr at scale
Anshum Gupta
 
Top 20 FAQs on the Autonomous Database
Top 20 FAQs on the Autonomous DatabaseTop 20 FAQs on the Autonomous Database
Top 20 FAQs on the Autonomous Database
Sandesh Rao
 
Building a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solrBuilding a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solr
lucenerevolution
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
musrath mohammad
 
Cloud Architecture & Distributed Systems Trivia
Cloud Architecture & Distributed Systems TriviaCloud Architecture & Distributed Systems Trivia
Cloud Architecture & Distributed Systems Trivia
Dr.-Ing. Michael Menzel
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
thelabdude
 
Hadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming ArchitectureHadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming Architecture
InSemble
 
Internals of Presto Service
Internals of Presto ServiceInternals of Presto Service
Internals of Presto Service
Treasure Data, Inc.
 
05. performance-concepts-26-slides
05. performance-concepts-26-slides05. performance-concepts-26-slides
05. performance-concepts-26-slides
Muhammad Ahad
 
Cleveland HUG - Storm
Cleveland HUG - StormCleveland HUG - Storm
Cleveland HUG - Storm
justinjleet
 
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Lucidworks
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
Splunk
 
Project Deimos
Project DeimosProject Deimos
Project DeimosSimon Suo
 
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012
TEST Huddle
 
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander DibboOpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
OpenNebula Project
 
Load Test Drupal Site Using JMeter and Amazon AWS
Load Test Drupal Site Using JMeter and Amazon AWSLoad Test Drupal Site Using JMeter and Amazon AWS
Load Test Drupal Site Using JMeter and Amazon AWS
Vladimir Ilic
 
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Silicon Valley Code Camp 2015 - Advanced MongoDB - The SequelSilicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Daniel Coupal
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
MongoDB
 
Storm - SpaaS
Storm - SpaaSStorm - SpaaS

Similar to Real-Time Inverted Search NYC ASLUG Oct 2014 (20)

Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Deploying and managing Solr at scale
Deploying and managing Solr at scaleDeploying and managing Solr at scale
Deploying and managing Solr at scale
 
Top 20 FAQs on the Autonomous Database
Top 20 FAQs on the Autonomous DatabaseTop 20 FAQs on the Autonomous Database
Top 20 FAQs on the Autonomous Database
 
Building a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solrBuilding a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solr
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Cloud Architecture & Distributed Systems Trivia
Cloud Architecture & Distributed Systems TriviaCloud Architecture & Distributed Systems Trivia
Cloud Architecture & Distributed Systems Trivia
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
Hadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming ArchitectureHadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming Architecture
 
Internals of Presto Service
Internals of Presto ServiceInternals of Presto Service
Internals of Presto Service
 
05. performance-concepts-26-slides
05. performance-concepts-26-slides05. performance-concepts-26-slides
05. performance-concepts-26-slides
 
Cleveland HUG - Storm
Cleveland HUG - StormCleveland HUG - Storm
Cleveland HUG - Storm
 
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
Project Deimos
Project DeimosProject Deimos
Project Deimos
 
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012
 
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander DibboOpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
 
Load Test Drupal Site Using JMeter and Amazon AWS
Load Test Drupal Site Using JMeter and Amazon AWSLoad Test Drupal Site Using JMeter and Amazon AWS
Load Test Drupal Site Using JMeter and Amazon AWS
 
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Silicon Valley Code Camp 2015 - Advanced MongoDB - The SequelSilicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
Storm - SpaaS
Storm - SpaaSStorm - SpaaS
Storm - SpaaS
 

More from Bryan Bende

Apache NiFi SDLC Improvements
Apache NiFi SDLC ImprovementsApache NiFi SDLC Improvements
Apache NiFi SDLC Improvements
Bryan Bende
 
Apache NiFi Meetup - Introduction to NiFi Registry
Apache NiFi Meetup - Introduction to NiFi RegistryApache NiFi Meetup - Introduction to NiFi Registry
Apache NiFi Meetup - Introduction to NiFi Registry
Bryan Bende
 
Devnexus 2018 - Let Your Data Flow with Apache NiFi
Devnexus 2018 - Let Your Data Flow with Apache NiFiDevnexus 2018 - Let Your Data Flow with Apache NiFi
Devnexus 2018 - Let Your Data Flow with Apache NiFi
Bryan Bende
 
Apache NiFi Record Processing
Apache NiFi Record ProcessingApache NiFi Record Processing
Apache NiFi Record Processing
Bryan Bende
 
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFiTaking DataFlow Management to the Edge with Apache NiFi/MiNiFi
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi
Bryan Bende
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop EcosystemApache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
Bryan Bende
 
Integrating NiFi and Apex
Integrating NiFi and ApexIntegrating NiFi and Apex
Integrating NiFi and Apex
Bryan Bende
 
Integrating NiFi and Flink
Integrating NiFi and FlinkIntegrating NiFi and Flink
Integrating NiFi and Flink
Bryan Bende
 
Document Similarity with Cloud Computing
Document Similarity with Cloud ComputingDocument Similarity with Cloud Computing
Document Similarity with Cloud Computing
Bryan Bende
 

More from Bryan Bende (9)

Apache NiFi SDLC Improvements
Apache NiFi SDLC ImprovementsApache NiFi SDLC Improvements
Apache NiFi SDLC Improvements
 
Apache NiFi Meetup - Introduction to NiFi Registry
Apache NiFi Meetup - Introduction to NiFi RegistryApache NiFi Meetup - Introduction to NiFi Registry
Apache NiFi Meetup - Introduction to NiFi Registry
 
Devnexus 2018 - Let Your Data Flow with Apache NiFi
Devnexus 2018 - Let Your Data Flow with Apache NiFiDevnexus 2018 - Let Your Data Flow with Apache NiFi
Devnexus 2018 - Let Your Data Flow with Apache NiFi
 
Apache NiFi Record Processing
Apache NiFi Record ProcessingApache NiFi Record Processing
Apache NiFi Record Processing
 
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFiTaking DataFlow Management to the Edge with Apache NiFi/MiNiFi
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop EcosystemApache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
 
Integrating NiFi and Apex
Integrating NiFi and ApexIntegrating NiFi and Apex
Integrating NiFi and Apex
 
Integrating NiFi and Flink
Integrating NiFi and FlinkIntegrating NiFi and Flink
Integrating NiFi and Flink
 
Document Similarity with Cloud Computing
Document Similarity with Cloud ComputingDocument Similarity with Cloud Computing
Document Similarity with Cloud Computing
 

Recently uploaded

Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
Tendenci - The Open Source AMS (Association Management Software)
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Jay Das
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
informapgpstrackings
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Globus
 

Recently uploaded (20)

Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 

Real-Time Inverted Search NYC ASLUG Oct 2014

  • 1.
  • 2. REAL-TIME INVERTED SEARCH IN THE CLOUD USING LUCENE AND STORM Joshua Conlin, Bryan Bende, James Owen conlin_joshua@bah.com bende_bryan@bah.com owen_james@bah.com
  • 3. Table of Contents  Problem Statement  Storm  Methodology  Results
  • 4. Who are we ? Booz Allen Hamilton – Large consulting firm supporting many industries • Healthcare, Finance, Energy, Defense – Strategic Innovation Group • Focus on innovative solutions that can be applied across industries • Major focus on data science, big data, & information retrieval • Multiple clients utilizing Solr for implementing search capabilities • Explore Date Science • Self-paced data science training, launching TODAY! • https://exploredatascience.com
  • 5. Client Applications & Architecture Ingest SolrCloud Web App Typical client applications allow users to: • Query document index using Lucene syntax • Filter and facet results • Save queries for future use
  • 6. Problem Statement How do we instantly notify users of new documents that match their saved queries? Constraints: • Process documents in real-time, notify as soon as possible • Scale with the number of saved queries (starting with tens of thousands) • Result set of notifications must match saved queries • Must not impact performance of the web application • Data arrives at varying speeds and varying sizes
  • 7. Possible Solutions 1. Fork ingest to a second Solr instance, run stored queries periodically – Pros: Easy to setup, works for small amount of data data & small # of queries – Cons: Bound by time to execute all queries 2. Same secondary Solr instance, but distribute queries to multiple servers – Pros: Reduces query processing time by dividing across several servers – Cons: Now writing custom code to distribute queries, possible synchronization issues ensuring each server executes queries against the same data 3. Give each server its own Solr instance and subset of queries – Pros: Very scalable, only bound by number of servers – Cons: Difficult to maintain, still writing custom code to distribute data and queries
  • 8. Possible Solutions Is there a way we can set up this system so that it’s: • easy to maintain, • easy to scale, and • easy to synchronize?
  • 9. Candidate Solution • Integrate Solr and/or Lucene with a stream processing framework • Process data in real-time, leverage proven framework for distributed stream processing Ingest SolrCloud Web App Storm Notifications
  • 10. Storm - Overview • Storm is an open source stream processing framework. • It’s a scalable platform that lets you distribute processes across a cluster quickly and easily. • You can add more resources to your cluster and easily utilize those resources in your processing.
  • 11. Storm - Components • Nimbus – the control node for the cluster, distributes topology through the cluster • Supervisor – one on each machines in the cluster, controls the allocation of worker assignments on its machine • Worker – JVM process for running topology components Nimbus Supervisor Worker Worker Worker Worker Supervisor Worker Worker Worker Worker Supervisor Worker Worker Worker Worker
  • 12. Storm – Core Concepts • Topology – defines a running process, which includes all of the processes to be run, the connections between those processes, and their configuration • Stream – the flow of data through a topology; it is an unbounded collection of tuples that is passed from process to process • Storm has 2 types of processing units: – Spout – the start of a stream; it can be thought of as the source of the data; that data can be read in however the spout wants—from a database, from a message queue, etc. – Bolt – the primary processing unit for a topology; it accepts any number of streams, does whatever processing you’ve set it to do, and outputs any number of streams based on how you configure it
  • 13. Storm – Core Concepts (continued) • Stream Groupings – defines how topology processing units (spouts and bolts) are connected to each other; some common groupings are: – All Grouping – stream is sent to all bolts – Shuffle Grouping – stream is evenly distributed across bolts – Fields grouping – sends tuples that match on the designated “field” to the same bolt
  • 14. Storm - Parallelism Source: http://storm.incubator.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html
  • 15. How to Utilize Storm How can we use this framework to solve our problem? Let Storm distribute out the data and queries between processing nodes …but we would still need to manage a Solr instance on each VM, and we would even need to ensure synchronization between query processing bolts running on the same VM.
  • 16. How to Utilize Storm What if instead of having a Solr installation on each machine we ran Solr in memory inside each of the processing bolts? • Use Storm spout to distribute new documents • Use Storm bolt to execute queries against EmbeddedSolrServer with RAMDirectory – Incoming documents added to index – Queries executed – Documents removed from index • Use Storm bolt to process query results Bolt EmbeddedSolrServer RAMDirectory
  • 17. Advantages This has several advantages: • It removes the need to maintain a Solr instance on each VM. • It’s easier to scale and more flexible; it doesn’t matter which Supervisor the bolts get sent to, all the processing is self-contained. • It removes the need to synchronize processing between bolts. • Documents are volatile, existing queries over new data
  • 18. Execution Topology Data Spout Data Spout Data Spout Query Spout Executor Bolt Executor Bolt Executor Bolt Executor Bolt Executor Bolt Notification Bolt Data Spout – Receives incoming data files and sends to every Executor Bolt Query Spout – Coordinates updates to queries Executor Bolt – Loads and executes queries Notification Bolt – Generates All notifications based on results Grouping Shuffle Grouping
  • 19. Executor Bolt 1. Queries are loaded into memory 2. Incoming documents are added to the Lucene index 3. Documents are processed when one of the following conditions are met: a) The number of documents have exceeded the max batch size b) The time since the last execution is longer than the max interval time 4. Matching queries and document UIDs are emitted 5. Remove all documents from index Query List Documents emit() 1 2 3 4
  • 20. Solr In-Memory Processing Bolt Issues • Attempted to run Solr with in-memory index inside Storm bolt • Solr 4.5 requires: – http-client 4.2.3 – http-core 4.2.2 • Storm 0.8.2 & 0.9.0 require: – http-client 4.1.1 – http-core 4.1 • Could exclude libraries from super jar and rely on storm/lib, but Solr expecting SystemDefaultHttpClient from 4.2.3 • Could build Storm with newer version of libraries, but not guaranteed to work
  • 21. Lucene In-Memory Processing Bolt Advantages: • Fast, Lightweight • No Dependency Conflicts • RAMDirectory backed • Easy Solr to Lucene Document Conversion • Solr Schema based Bolt Lucene Index RAMDirectory 1. Initialization – Parse Common Solr Schema – Replace Solr Classes 2. Add Documents – Convert SolrInputDocument to Lucene Document – Add to index
  • 22. Lucene In-Memory Processing Bolt Parse Read/Parse/Update Solr Schema File using Stax Create IndexSchema from new Solr Schema data public void addDocument(SolrInputDocument doc) throws Exception { if (doc != null) { Document luceneDoc = solrDocumentConverter.convert(doc); indexWriter.addDocument(luceneDoc); indexWriter.commit(); } } public Document convert(SolrInputDocument solrDocument) throws Exception { return DocumentBuilder.toDocument(solrDocument, indexSchema); }
  • 23. Prototype Solution • Infrastructure: – 8 node cluster on Amazon EC2 – Each VM has 2 cores and 8G of memory • Data: – 92,000 news article summaries – Average file size: ~1k • Queries: – Generated 1 million sample queries – Randomly selected terms from document set – Stored in MariaDB (username, query string) – Query Executor Bolt configured to as any subset of these queries
  • 24. Prototype Solution – Monitoring Performance • Metrics Provided by Storm UI – Emitted: number of tuples emitted – Transferred: number of tuples transferred (emitted * # follow-on bolts) – Acked: number of tuples acknowledged – Execute Latency: timestamp when execute function ends - timestamp when execute is passed tuple – Process Latency: timestamp when ack is called - timestamp when execute is passed tuple – Capacity: % of the time in the last 10 minutes the bolt spent executing tuples • Many metrics are samples, don’t always indicate problems • Good measurement is comparing number of tuples transferred from spout, to number of tuples acknowledged in bolt – If transferred number is getting increasingly higher than number of acknowledged tuples, then the topology is not keeping up with the rate of data
  • 25. Trial Runs – First Attempt • 8 workers, 1 Spout, 8 Query Executor Bolts, 8 Result Bolts • Article spout emitting as fast as possible • Query execution at 1k docs or 60 seconds elapsed time • Increased number of queries on each trial: 10k, 50k, 100k, 200k, 300k, 400k, 500k Results: • Articles emitted too fast for bolts to keep up • If data continued to stream at this rate, topology would back up and drop tuples Worker QWuoerrkye rB ox x l4t 4 Worker RWeosruklte rB ox x l4t 4 Node 7 Worker QWuoerrkye rB ox x l4t 4 Worker RWeosruklte rB ox x l4t 4 Worker x 4 Worker QWuoerrkye rB ox x l4t 4 Worker RWeosruklte rB ox x l4t 4 Node 6 Worker QWuoerrkye rB ox x l4t 4 Worker RWeosruklte rB ox x l4t 4 Worker x 4 Node 1 Worker QWuoerrkye rB ox x l4t 4 Worker RWeosruklte rB ox x l4t 4 Worker x 4 Node 5 Worker QWuoerrkye rB ox x l4t 4 Worker RWeosruklte rB ox x l4t 4 Worker x 4 Node 2 Worker x 4 Node 3 Worker x 4 Node 4 Worker QWuoerrkye rB ox x l4t 4 Worker RWeosruklte rB ox x l4t 4 Worker x 4 Node 8 Worker QWuoerrkye rB ox x l4t 4 Worker RWeosruklte rB ox x l4t 4 Worker x 4 Article Spout Node 1
  • 26. Trial Runs – Second Attempt • 8 workers, 1 Spout, 8 Query Executor Bolts, 8 Result Bolts • Article spout now places articles on queue in background thread every 100ms • Everything else the same… Results: • Topology performing much better, keeping up with data flow for query size of 10k, 50k, 100k, 200k • Slows down around 300k queries, approx 37.5k queries/bolt Worker QWuoerrkye rB ox x l4t 4 Worker RWeosruklte rB ox x l4t 4 Node 7 Worker QWuoerrkye rB ox x l4t 4 Worker RWeosruklte rB ox x l4t 4 Worker x 4 Worker QWuoerrkye rB ox x l4t 4 Worker RWeosruklte rB ox x l4t 4 Node 6 Worker QWuoerrkye rB ox x l4t 4 Worker RWeosruklte rB ox x l4t 4 Worker x 4 Node 1 Worker QWuoerrkye rB ox x l4t 4 Worker RWeosruklte rB ox x l4t 4 Worker x 4 Node 5 Worker QWuoerrkye rB ox x l4t 4 Worker RWeosruklte rB ox x l4t 4 Worker x 4 Node 2 Worker x 4 Node 3 Worker x 4 Node 4 Worker QWuoerrkye rB ox x l4t 4 Worker RWeosruklte rB ox x l4t 4 Worker x 4 Node 8 Worker QWuoerrkye rB ox x l4t 4 Worker RWeosruklte rB ox x l4t 4 Worker x 4 Article Spout Node 1
  • 27. Trials Runs – Third Attempt • Each node has 4 worker slots so lets scale up • 16 workers, 1 spout, 16 Query Executor Bolts, 8 Result Bolts • Everything else the same… Results: • 300k queries now keeping up no problem • 400k doing ok… • 500k backing up a bit QWuoerrkye Worker rB ox x l4t 4 x 2 Worker RWeosruklte rB ox x l4t 4 Node 7 QWuoerrkye Worker rB ox x l4t 4 x 2 Worker RWeosruklte rB ox x l4t 4 Worker x 4 QWuoerrkye Worker rB ox x l4t 4 x 2 Worker RWeosruklte rB ox x l4t 4 Node 6 QWuoerrkye Worker rB ox x l4t 4 x 2 Worker RWeosruklte rB ox x l4t 4 Worker x 4 Node 1 QWuoerrkye Worker rB ox x l4t 4 x 2 Worker RWeosruklte rB ox x l4t 4 Worker x 4 Node 5 QWuoerrkye Worker rB ox x l4t 4 x 2 Worker RWeosruklte rB ox x l4t 4 Worker x 4 Node 2 Worker x 4 Node 3 Worker x 4 Node 4 QWuoerrkye Worker rB ox x l4t 4 x 2 Worker RWeosruklte rB ox x l4t 4 Worker x 4 Node 8 QWuoerrkye Worker rB ox x l4t 4 x 2 Worker RWeosruklte rB ox x l4t 4 Worker x 4 Article Spout Node 1
  • 28. Trial Runs – Fourth Attempt • Next logical step, 32 workers, 1 spout, 32 Query Executor Bolts • Didn’t result in anticipated performance gain, 500k still too much • Hypothesizing that 2-core VMs might not be enough to get full performance from 4 worker slots QWuoerrkye Worker rB ox x l4t 4 x 4 Worker RWeosruklte rB ox x l4t 4 Node 7 QWuoerrkye Worker rB ox x l4t 4 x 4 Worker RWeosruklte rB ox x l4t 4 Worker x 4 QWuoerrkye Worker rB ox x l4t 4 x 4 Worker RWeosruklte rB ox x l4t 4 Node 6 QWuoerrkye Worker rB ox x l4t 4 x 4 Worker RWeosruklte rB ox x l4t 4 Worker x 4 Node 1 QWuoerrkye Worker rB ox x l4t 4 x 4 Worker RWeosruklte rB ox x l4t 4 Worker x 4 Node 5 QWuoerrkye Worker rB ox x l4t 4 x 4 Worker RWeosruklte rB ox x l4t 4 Worker x 4 Node 2 Worker x 4 Node 3 Worker x 4 Node 4 Worker x 4 QWuoerrkye Br ox l4t x 4 Worker RWeosruklte rB ox x l4t 4 Worker x 4 Node 8 QWuoerrkye Worker rB ox x l4t 4 x 4 Worker RWeosruklte rB ox x l4t 4 Worker x 4 Article Spout Node 1
  • 29. Trials Runs – Conclusions • Most important factor affecting performance is relationship between data rate and number of queries • Ideal Storm configuration is dependent on hardware executing the topology • Optimal configuration resulted in 250 queries per second per bolt, 4k queries per second across topology • High level of performance from relatively small cluster
  • 30. Conclusions • Low barrier to entry working with Storm • Easy conversion of Solr indices to Lucene Indices • Simple integration between Lucene and Storm; Solr more complicated • Configuration is key, tune topology to your needs • Overall strategy appears to scale well for our use case, limited only by hardware
  • 31. Future Considerations • Adjust the batch size on the query executor bolt • Combine duplicate queries (between users) if your system has many duplicates • Investigate additional optimizations during Solr to Lucene • Run topology with more complex queries (fielded, filtered, etc.) • Investigate handling of bolt failure • If ratio of incoming data to queries was reversed, consider switching the groupings between the spouts and executor bolts
  • 33. Updates Since Solr Lucene Revolution 2013 • Storm has moved to top-level Apache project – https://storm.incubator.apache.org/ – Released 0.9.1, 0.9.2, 0.9.3-rc1 – Newer releases resolve classpath issue with EmbeddedSolrServer – Improved Netty transport, new topology visualization Source: http://storm.incubator.apache.org/2014/06/25/storm092-released.html
  • 34. How can we test our topology at various scales with minimal setup? • Launch Storm clusters on Amazon Web Services – storm-deploy - https://github.com/nathanmarz/storm-deploy • Created before Storm moved to Apache, limited activity • install-0.9.1 branch has updates to pull Storm from Apache repo • lein deploy-storm --start --name mycluster --branch master --commit v0.9.2-incubating • Always launches m1.small - https://github.com/nathanmarz/storm-deploy/issues/67 – storm-deploy-alternative- https://github.com/KasperMadsen/storm-deploy-alternative • Java alternative to storm-deploy • Latest Apache Storm releases not supported yet, works with 0.8.2 and 0.9.0 – wirbelsturm - https://github.com/miguno/wirbelsturm • Based on Vagrant and Puppet • http://www.michael-noll.com/blog/2014/03/17/wirbelsturm-one-click-deploy-storm-kafka-clusters-with- vagrant-puppet/ • Steeper learning curve to get going
  • 35. How can we test our topology at various scales with minimal setup? • Make the topology independent of Storm cluster – Previous spout required data to be on server where spout is running • Better approach - poll an external source for data (Redis, Kafaka, etc) – Previous executor bolt loaded queries from a database • Better approach - package a file of queries into topology jar – Previous executor bolt expected a Solr config directory on the server • Better approach – package config into topology jar, extract from classpath to disk on start up Redis Spout Executor Bolt queries SOLR_HOME Result Bolt Storm Cluster
  • 36. Luwak • Presentation by Flax at Solr Lucene Revoltion 2013 in Dublin – Turning Search Upside Down: Using Lucene for Very Fast Stored Queries – https://www.youtube.com/watch?v=rmRCsrJp2A8&list=UUKuRrzEQYP8pfCgCN8il4gQ • Open sourced by Flax shortly after – https://github.com/flaxsearch/luwak • True inverted search solution – Index queries – Turn an incoming document into a query – Determine which queries match that document • Easy to integrate into existing Storm solution • Clean API and documentation Monitor monitor = new Monitor( new LuceneQueryParser("field"), new TermFilteredPresearcher()); MonitorQuery mq = new MonitorQuery( "query1", "field:text"); monitor.update(mq); InputDocument doc = InputDocument.builder("doc1”) .addField(textfield, document, new StandardTokenizer(Version.LUCENE_50)) .build(); SimpleMatcher matches = monitor.match( doc, SimpleMatcher.FACTORY);
  • 37. Performance Comparison • How fast can we process all 92k articles with varying query sizes? • Performance comparison outside of Storm, single-thread Java process • Solr & Lucene solutions batch docs – Allow 1,000 docs to be added to in-memory index – Execute all queries, clear, start over • Luwak evaluates one document at a time against indexed queries
  • 38. Wrap-Up • Conclusion – Storm = scalable stream processing framework – Luwak = high performance inverted search solution – Luwak + Storm = scalable, high performance, inverted search solution! • Contact Info – bende_bryan@bah.com / Twitter @bbende – conlin_joshua@bah.com / Twitter @jmconlin • Thanks for having us!

Editor's Notes

  1. Updated