Hadoop-scale Search with Solr

Grant Ingersoll, CTO Lucidworks April 15, 2015
Hadoop-scale Search
with Solr

10M+total
downloads
Solr is both established & growing
250,000+
monthly downloads
Largest community of developers.
2500+open Solr jobs.
Solr most widely used search
solution on the planet.
Lucidworks
Unmatched Solr expertise.
1/3
of the active
committers
70%
of the open source
code is committed
Lucene/Solr Revolution
world’s largest open source user
conference dedicated to Lucene/Solr.
Solr has tens of thousands
of applications in production.
You use
Solr everyday.
Solr in a Nutshell

• Full text search (Info Retr.)
• Facets/Guided Nav galore!
• Lots of data types
• Spelling, auto-complete,
highlighting
• Cursors
• More Like This
• De-duplication
• Apache Lucene
• Grouping and Joins
• Stats, expressions, transformations
and more
• Lang. Detection
• Extensible
• Massive Scale/Fault tolerance
Solr Key Features

What’s old is new again!
• Build/Store indexes in HDFS
• https://cwiki.apache.org/conﬂuence/display/solr/Running+Solr+on+HDFS
• Block cache is your friend
Deployment and Security support
• https://github.com/LucidWorks/yarn-proto
• Slider and Ambari support coming soon
• Authz, Authc and Doc Filtering coming in April/May
Hadoop Basics

Lucidworks + Hadoop
• Ingestion tools for various ﬁle formats, etc.
• Hive 2-way Load/Store support
• Pig Load/Store
• http://lucidworks.com/product/integrations/hadoop/
• (More on Spark in a bit)

Case 1: Compliance
• Monitoring and customer service search for large volume transactional data
• Initial Setup:
• 20 machines, 32 GB RAM, 800 GB SSD, 2 Solr nodes per machine
• Indexing from Kafka to Solr (Lucidworks Fusion)
• 14B+ docs indexed/searchable in POC (disk limited)
• Growth to 4B+ per day w/ 6 month life expectancy

Case 2: Web Analytics
• Large scale ad-hoc analytics over weblogs using Tableau as a
front end BI tool for Solr
• Initial setup:
• 4 machines, 128 GB of RAM, several Solr nodes per machine
• Data originally in Hive
• POC: 10s of Billions of events growing to 150B+ per week

current_log_writer collection alias rolls
over to a new transient collection every
two hours; the shards in the transient
collection are merged into the 2-hour
shard and added to the daily collection
Connector writes to the collection alias,
up to 50K docs / sec
Latest 2-hour shard gets built from
merging shards at time bucket boundary
Multiple shards needed
to support 50K writes per second
Every daily collection has 12 (or 24) shards, each covering 2-
hour blocks of log messages
Sample ArchitectureFusion Logstash
connector
current_log_writer
(Collection Alias)
logs_feb26_h24 
(Transient Collection)
Shard
1
Shard
2
Shard
4
Shard
3
logs_feb01 
(daily collection)
logs_feb25 
(daily collection)
logs_feb26 
(daily collection)
h02 
Shard
h24 
Shard
h22 
Shard

Every daily collection has 12 (or 24) shards, each covering 2-
hour blocks of log messages
h02 
Shard
h24 
Shard
h22 
Shard
h02 
Shard
h24 
Shard
h22 
Shard Can add replicas
to support higher
query volume &
fault-tolerance
Sample Query Execution
recent_logs
(collection alias)
logs_feb01 
(daily collection)
logs_feb25 
(daily collection)
todays_logs
(collection alias)
Fusion SiLK
Dashboard
todays_logs
collection alias rolls
over to a new day
automatically at
day boundary
logs_feb26 
(daily collection)

Case 3: Lots of Users, Lots of Data
• Search of consumer data storage
• Key challenges: not all users are equals. Users grow and change all the
time
• Petabytes of data, millions of users, 1000’s of nodes
• Learn more: https://www.youtube.com/watch?v=_Erkln5WWLw 
and 
http://www.slideshare.net/lucidworks/scaling-solrcloud-to-a-large-
number-of-collections-shalin-shekhar-mangar
• Search of consumer cloud storage
• Key challenges: not all users are equals. Users grow and change all the
time
• Petabytes of data, millions of users, 1000’s of nodes,
• 1000’s of collections while isolating access
• Learn more: https://www.youtube.com/watch?v=_Erkln5WWLw 
and 
http://www.slideshare.net/lucidworks/scaling-solrcloud-to-a-large-
number-of-collections-shalin-shekhar-mangar

• Improve Zookeeper interactions
and performance to handle
thousands of collections
• Deep paging
• Split shards on arbitrary hash
ranges
• Large scale testing
• Collection migration
Case 3: Key Solr Improvements

https://github.com/LucidWorks/solr-scale-tk
Testing: Solr Scale Toolkit
Python w/ Fabric, Boto Etc. 
Test Automation Scripts
Kafka 
MQ / Data Integration
Logstash
Log Agg / Analysis
CollectD/SiLK
System / JMX Monitoring
Test
Results
DB
Support Services
Client Node 1
JMeter / Client Nodes
Client Node 2
Zookeeper
Test Data
Stored in
Amazon S3
Node 1: Custom AMI
Solr Cluster (NxM Nodes)
Solr Node 1 8983 Core Core
Solr Node M 898X Core Core
SolrCloud Traffic Between All
Solr Nodes and ZK
Key Point: Each test will define the
density of cores per node and
number of Solr nodes per machine,
as well as the instance type and
number of machines
ZK Ensemble
ZooKeeper-1
ZooKeeper-2
ZooKeeper-3
System monitoring 
of N Machines
JMX Notifications
Logs aggregated from
NxM Solr Nodes
ZK JMX Stats
Easily deploy clusters of
nodes using our custom
AMI’s
Indexing & query
requests from tests

• Power user search and recommendations over news content and
engagement signals (shares, views, etc.) using Lucidworks Fusion
• Combines content and collaborative ﬁltering approaches to
calculate search boosts and “people who did X also did Y” results
• Data: ~10M events (POC) growing to 3-4B per month
Case 4: Signals for Search and Discovery

• Lucidworks Fusion 1.4 (May ’15) will ship Apache Spark as a
scheduled service for large scale aggregations, machine
learning and more
• We’ve already seen 3x speedup in some tests
• Will ship w/ ALS rec algo and Mahout algos
• Solr as Spark RRD
• https://github.com/LucidWorks/spark-solr
Next Level Signals

Billions of Docs
Optional
REST
Security woven
throughout
Proxy
Recs
Worker
Pipes Metrics
NLP Sched.
Blobs Admin
Connectors
Worker Cluster Mgr.
Spark
Shards Shards
Solr
HDFS
Shared Conﬁg
Mgmt
Leader
Election
Load
Balancing
ZK 1
Zookeeper
ZK N
Signals
Fusion Architecture
Millions of Users

• Native, pluggable Security in Solr (April/May)
• Numerous performance enhancements for replication in
shards
• Cross core ValueSources
• Many new extensions for facets and analytics
• Percentiles (t-digest)
• Facet combinations
• Dynamic expressions over result sets
Roadmap

Next steps
Download Fusion: http://www.lucidworks.com/products/fusion
Contact Lucidworks: http://lucidworks.com/company/contact/
Contact Me: grant@lucidworks.com @gsingers

Hadoop-scale Search with Solr

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Hadoop-scale Search with Solr

Similar to Hadoop-scale Search with Solr (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Hadoop-scale Search with Solr