Grant Ingersoll, CTO Lucidworks April 15, 2015
Hadoop-scale Search
with Solr
Viva La Evolución
10M+total
downloads
Solr is both established & growing
250,000+
monthly downloads
Largest community of developers.
2500+open Solr jobs.
Solr most widely used search
solution on the planet.
Lucidworks
Unmatched Solr expertise.
1/3
of the active
committers
70%
of the open source
code is committed
Lucene/Solr Revolution
world’s largest open source user
conference dedicated to Lucene/Solr.
Solr has tens of thousands
of applications in production.
You use
Solr everyday.
Solr in a Nutshell
• Full text search (Info Retr.)
• Facets/Guided Nav galore!
• Lots of data types
• Spelling, auto-complete,
highlighting
• Cursors
• More Like This
• De-duplication
• Apache Lucene
• Grouping and Joins
• Stats, expressions, transformations
and more
• Lang. Detection
• Extensible
• Massive Scale/Fault tolerance
Solr Key Features
What’s old is new again!
• Build/Store indexes in HDFS
• https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS
• Block cache is your friend
Deployment and Security support
• https://github.com/LucidWorks/yarn-proto
• Slider and Ambari support coming soon
• Authz, Authc and Doc Filtering coming in April/May
Hadoop Basics
Lucidworks + Hadoop
• Ingestion tools for various file formats, etc.
• Hive 2-way Load/Store support
• Pig Load/Store
• http://lucidworks.com/product/integrations/hadoop/
• (More on Spark in a bit)
Case 1: Compliance
• Monitoring and customer service search for large volume transactional data
• Initial Setup:
• 20 machines, 32 GB RAM, 800 GB SSD, 2 Solr nodes per machine
• Indexing from Kafka to Solr (Lucidworks Fusion)
• 14B+ docs indexed/searchable in POC (disk limited)
• Growth to 4B+ per day w/ 6 month life expectancy
Case 2: Web Analytics
• Large scale ad-hoc analytics over weblogs using Tableau as a
front end BI tool for Solr
• Initial setup:
• 4 machines, 128 GB of RAM, several Solr nodes per machine
• Data originally in Hive
• POC: 10s of Billions of events growing to 150B+ per week
current_log_writer collection alias rolls
over to a new transient collection every
two hours; the shards in the transient
collection are merged into the 2-hour
shard and added to the daily collection
Connector writes to the collection alias,
up to 50K docs / sec
Latest 2-hour shard gets built from
merging shards at time bucket boundary
Multiple shards needed
to support 50K writes per second
Every daily collection has 12 (or 24) shards, each covering 2-
hour blocks of log messages
Sample ArchitectureFusion Logstash
connector
current_log_writer
(Collection Alias)
logs_feb26_h24

(Transient Collection)
Shard
1
Shard
2
Shard
4
Shard
3
logs_feb01

(daily collection)
logs_feb25

(daily collection)
logs_feb26

(daily collection)
h02

Shard
h24

Shard
h22

Shard
Every daily collection has 12 (or 24) shards, each covering 2-
hour blocks of log messages
h02

Shard
h24

Shard
h22

Shard
h02

Shard
h24

Shard
h22

Shard Can add replicas
to support higher
query volume &
fault-tolerance
Sample Query Execution
recent_logs
(collection alias)
logs_feb01

(daily collection)
logs_feb25

(daily collection)
todays_logs
(collection alias)
Fusion SiLK
Dashboard
todays_logs
collection alias rolls
over to a new day
automatically at
day boundary
logs_feb26

(daily collection)
Case 3: Lots of Users, Lots of Data
• Search of consumer data storage
• Key challenges: not all users are equals. Users grow and change all the
time
• Petabytes of data, millions of users, 1000’s of nodes
• Learn more: https://www.youtube.com/watch?v=_Erkln5WWLw

and

http://www.slideshare.net/lucidworks/scaling-solrcloud-to-a-large-
number-of-collections-shalin-shekhar-mangar
• Search of consumer cloud storage
• Key challenges: not all users are equals. Users grow and change all the
time
• Petabytes of data, millions of users, 1000’s of nodes,
• 1000’s of collections while isolating access
• Learn more: https://www.youtube.com/watch?v=_Erkln5WWLw

and

http://www.slideshare.net/lucidworks/scaling-solrcloud-to-a-large-
number-of-collections-shalin-shekhar-mangar
• Improve Zookeeper interactions
and performance to handle
thousands of collections
• Deep paging
• Split shards on arbitrary hash
ranges
• Large scale testing
• Collection migration
Case 3: Key Solr Improvements
https://github.com/LucidWorks/solr-scale-tk
Testing: Solr Scale Toolkit
Python w/ Fabric, Boto Etc.

Test Automation Scripts
Kafka

MQ / Data Integration
Logstash
Log Agg / Analysis
CollectD/SiLK
System / JMX Monitoring
Test
Results
DB
Support Services
Client Node 1
JMeter / Client Nodes
Client Node 2
Zookeeper
Test Data
Stored in
Amazon S3
Node 1: Custom AMI
Solr Cluster (NxM Nodes)
Solr Node 1 8983 Core Core
Solr Node M 898X Core Core
SolrCloud Traffic Between All
Solr Nodes and ZK
Key Point: Each test will define the
density of cores per node and
number of Solr nodes per machine,
as well as the instance type and
number of machines
ZK Ensemble
ZooKeeper-1
ZooKeeper-2
ZooKeeper-3
System monitoring

of N Machines
JMX Notifications
Logs aggregated from
NxM Solr Nodes
ZK JMX Stats
Easily deploy clusters of
nodes using our custom
AMI’s
Indexing & query
requests from tests
• Power user search and recommendations over news content and
engagement signals (shares, views, etc.) using Lucidworks Fusion
• Combines content and collaborative filtering approaches to
calculate search boosts and “people who did X also did Y” results
• Data: ~10M events (POC) growing to 3-4B per month
Case 4: Signals for Search and Discovery
• Lucidworks Fusion 1.4 (May ’15) will ship Apache Spark as a
scheduled service for large scale aggregations, machine
learning and more
• We’ve already seen 3x speedup in some tests
• Will ship w/ ALS rec algo and Mahout algos
• Solr as Spark RRD
• https://github.com/LucidWorks/spark-solr
Next Level Signals
Billions of Docs
Optional
REST
Security woven
throughout
Proxy
Recs
Worker
Pipes Metrics
NLP Sched.
Blobs Admin
Connectors
Worker Cluster Mgr.
Spark
Shards Shards
Solr
HDFS
Shared Config
Mgmt
Leader
Election
Load
Balancing
ZK 1
Zookeeper
ZK N
Signals
Fusion Architecture
Millions of Users
• Native, pluggable Security in Solr (April/May)
• Numerous performance enhancements for replication in
shards
• Cross core ValueSources
• Many new extensions for facets and analytics
• Percentiles (t-digest)
• Facet combinations
• Dynamic expressions over result sets
Roadmap
Next steps
Download Fusion: http://www.lucidworks.com/products/fusion
Contact Lucidworks: http://lucidworks.com/company/contact/
Contact Me: grant@lucidworks.com @gsingers
Hadoop-scale Search with Solr

Hadoop-scale Search with Solr

  • 1.
    Grant Ingersoll, CTOLucidworks April 15, 2015 Hadoop-scale Search with Solr
  • 2.
  • 3.
    10M+total downloads Solr is bothestablished & growing 250,000+ monthly downloads Largest community of developers. 2500+open Solr jobs. Solr most widely used search solution on the planet. Lucidworks Unmatched Solr expertise. 1/3 of the active committers 70% of the open source code is committed Lucene/Solr Revolution world’s largest open source user conference dedicated to Lucene/Solr. Solr has tens of thousands of applications in production. You use Solr everyday. Solr in a Nutshell
  • 4.
    • Full textsearch (Info Retr.) • Facets/Guided Nav galore! • Lots of data types • Spelling, auto-complete, highlighting • Cursors • More Like This • De-duplication • Apache Lucene • Grouping and Joins • Stats, expressions, transformations and more • Lang. Detection • Extensible • Massive Scale/Fault tolerance Solr Key Features
  • 5.
    What’s old isnew again! • Build/Store indexes in HDFS • https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS • Block cache is your friend Deployment and Security support • https://github.com/LucidWorks/yarn-proto • Slider and Ambari support coming soon • Authz, Authc and Doc Filtering coming in April/May Hadoop Basics
  • 6.
    Lucidworks + Hadoop •Ingestion tools for various file formats, etc. • Hive 2-way Load/Store support • Pig Load/Store • http://lucidworks.com/product/integrations/hadoop/ • (More on Spark in a bit)
  • 7.
    Case 1: Compliance •Monitoring and customer service search for large volume transactional data • Initial Setup: • 20 machines, 32 GB RAM, 800 GB SSD, 2 Solr nodes per machine • Indexing from Kafka to Solr (Lucidworks Fusion) • 14B+ docs indexed/searchable in POC (disk limited) • Growth to 4B+ per day w/ 6 month life expectancy
  • 8.
    Case 2: WebAnalytics • Large scale ad-hoc analytics over weblogs using Tableau as a front end BI tool for Solr • Initial setup: • 4 machines, 128 GB of RAM, several Solr nodes per machine • Data originally in Hive • POC: 10s of Billions of events growing to 150B+ per week
  • 9.
    current_log_writer collection aliasrolls over to a new transient collection every two hours; the shards in the transient collection are merged into the 2-hour shard and added to the daily collection Connector writes to the collection alias, up to 50K docs / sec Latest 2-hour shard gets built from merging shards at time bucket boundary Multiple shards needed to support 50K writes per second Every daily collection has 12 (or 24) shards, each covering 2- hour blocks of log messages Sample ArchitectureFusion Logstash connector current_log_writer (Collection Alias) logs_feb26_h24
 (Transient Collection) Shard 1 Shard 2 Shard 4 Shard 3 logs_feb01
 (daily collection) logs_feb25
 (daily collection) logs_feb26
 (daily collection) h02
 Shard h24
 Shard h22
 Shard
  • 10.
    Every daily collectionhas 12 (or 24) shards, each covering 2- hour blocks of log messages h02
 Shard h24
 Shard h22
 Shard h02
 Shard h24
 Shard h22
 Shard Can add replicas to support higher query volume & fault-tolerance Sample Query Execution recent_logs (collection alias) logs_feb01
 (daily collection) logs_feb25
 (daily collection) todays_logs (collection alias) Fusion SiLK Dashboard todays_logs collection alias rolls over to a new day automatically at day boundary logs_feb26
 (daily collection)
  • 11.
    Case 3: Lotsof Users, Lots of Data • Search of consumer data storage • Key challenges: not all users are equals. Users grow and change all the time • Petabytes of data, millions of users, 1000’s of nodes • Learn more: https://www.youtube.com/watch?v=_Erkln5WWLw
 and
 http://www.slideshare.net/lucidworks/scaling-solrcloud-to-a-large- number-of-collections-shalin-shekhar-mangar • Search of consumer cloud storage • Key challenges: not all users are equals. Users grow and change all the time • Petabytes of data, millions of users, 1000’s of nodes, • 1000’s of collections while isolating access • Learn more: https://www.youtube.com/watch?v=_Erkln5WWLw
 and
 http://www.slideshare.net/lucidworks/scaling-solrcloud-to-a-large- number-of-collections-shalin-shekhar-mangar
  • 12.
    • Improve Zookeeperinteractions and performance to handle thousands of collections • Deep paging • Split shards on arbitrary hash ranges • Large scale testing • Collection migration Case 3: Key Solr Improvements
  • 13.
    https://github.com/LucidWorks/solr-scale-tk Testing: Solr ScaleToolkit Python w/ Fabric, Boto Etc.
 Test Automation Scripts Kafka
 MQ / Data Integration Logstash Log Agg / Analysis CollectD/SiLK System / JMX Monitoring Test Results DB Support Services Client Node 1 JMeter / Client Nodes Client Node 2 Zookeeper Test Data Stored in Amazon S3 Node 1: Custom AMI Solr Cluster (NxM Nodes) Solr Node 1 8983 Core Core Solr Node M 898X Core Core SolrCloud Traffic Between All Solr Nodes and ZK Key Point: Each test will define the density of cores per node and number of Solr nodes per machine, as well as the instance type and number of machines ZK Ensemble ZooKeeper-1 ZooKeeper-2 ZooKeeper-3 System monitoring
 of N Machines JMX Notifications Logs aggregated from NxM Solr Nodes ZK JMX Stats Easily deploy clusters of nodes using our custom AMI’s Indexing & query requests from tests
  • 14.
    • Power usersearch and recommendations over news content and engagement signals (shares, views, etc.) using Lucidworks Fusion • Combines content and collaborative filtering approaches to calculate search boosts and “people who did X also did Y” results • Data: ~10M events (POC) growing to 3-4B per month Case 4: Signals for Search and Discovery
  • 15.
    • Lucidworks Fusion1.4 (May ’15) will ship Apache Spark as a scheduled service for large scale aggregations, machine learning and more • We’ve already seen 3x speedup in some tests • Will ship w/ ALS rec algo and Mahout algos • Solr as Spark RRD • https://github.com/LucidWorks/spark-solr Next Level Signals
  • 16.
    Billions of Docs Optional REST Securitywoven throughout Proxy Recs Worker Pipes Metrics NLP Sched. Blobs Admin Connectors Worker Cluster Mgr. Spark Shards Shards Solr HDFS Shared Config Mgmt Leader Election Load Balancing ZK 1 Zookeeper ZK N Signals Fusion Architecture Millions of Users
  • 17.
    • Native, pluggableSecurity in Solr (April/May) • Numerous performance enhancements for replication in shards • Cross core ValueSources • Many new extensions for facets and analytics • Percentiles (t-digest) • Facet combinations • Dynamic expressions over result sets Roadmap
  • 18.
    Next steps Download Fusion:http://www.lucidworks.com/products/fusion Contact Lucidworks: http://lucidworks.com/company/contact/ Contact Me: grant@lucidworks.com @gsingers