3. 10M+total
downloads
Solr is both established & growing
250,000+
monthly downloads
Largest community of developers.
2500+open Solr jobs.
Solr most widely used search
solution on the planet.
Lucidworks
Unmatched Solr expertise.
1/3
of the active
committers
70%
of the open source
code is committed
Lucene/Solr Revolution
world’s largest open source user
conference dedicated to Lucene/Solr.
Solr has tens of thousands
of applications in production.
You use
Solr everyday.
Solr in a Nutshell
4. • Full text search (Info Retr.)
• Facets/Guided Nav galore!
• Lots of data types
• Spelling, auto-complete,
highlighting
• Cursors
• More Like This
• De-duplication
• Apache Lucene
• Grouping and Joins
• Stats, expressions, transformations
and more
• Lang. Detection
• Extensible
• Massive Scale/Fault tolerance
Solr Key Features
5. What’s old is new again!
• Build/Store indexes in HDFS
• https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS
• Block cache is your friend
Deployment and Security support
• https://github.com/LucidWorks/yarn-proto
• Slider and Ambari support coming soon
• Authz, Authc and Doc Filtering coming in April/May
Hadoop Basics
6. Lucidworks + Hadoop
• Ingestion tools for various file formats, etc.
• Hive 2-way Load/Store support
• Pig Load/Store
• http://lucidworks.com/product/integrations/hadoop/
• (More on Spark in a bit)
7. Case 1: Compliance
• Monitoring and customer service search for large volume transactional data
• Initial Setup:
• 20 machines, 32 GB RAM, 800 GB SSD, 2 Solr nodes per machine
• Indexing from Kafka to Solr (Lucidworks Fusion)
• 14B+ docs indexed/searchable in POC (disk limited)
• Growth to 4B+ per day w/ 6 month life expectancy
8. Case 2: Web Analytics
• Large scale ad-hoc analytics over weblogs using Tableau as a
front end BI tool for Solr
• Initial setup:
• 4 machines, 128 GB of RAM, several Solr nodes per machine
• Data originally in Hive
• POC: 10s of Billions of events growing to 150B+ per week
9. current_log_writer collection alias rolls
over to a new transient collection every
two hours; the shards in the transient
collection are merged into the 2-hour
shard and added to the daily collection
Connector writes to the collection alias,
up to 50K docs / sec
Latest 2-hour shard gets built from
merging shards at time bucket boundary
Multiple shards needed
to support 50K writes per second
Every daily collection has 12 (or 24) shards, each covering 2-
hour blocks of log messages
Sample ArchitectureFusion Logstash
connector
current_log_writer
(Collection Alias)
logs_feb26_h24
(Transient Collection)
Shard
1
Shard
2
Shard
4
Shard
3
logs_feb01
(daily collection)
logs_feb25
(daily collection)
logs_feb26
(daily collection)
h02
Shard
h24
Shard
h22
Shard
10. Every daily collection has 12 (or 24) shards, each covering 2-
hour blocks of log messages
h02
Shard
h24
Shard
h22
Shard
h02
Shard
h24
Shard
h22
Shard Can add replicas
to support higher
query volume &
fault-tolerance
Sample Query Execution
recent_logs
(collection alias)
logs_feb01
(daily collection)
logs_feb25
(daily collection)
todays_logs
(collection alias)
Fusion SiLK
Dashboard
todays_logs
collection alias rolls
over to a new day
automatically at
day boundary
logs_feb26
(daily collection)
11. Case 3: Lots of Users, Lots of Data
• Search of consumer data storage
• Key challenges: not all users are equals. Users grow and change all the
time
• Petabytes of data, millions of users, 1000’s of nodes
• Learn more: https://www.youtube.com/watch?v=_Erkln5WWLw
and
http://www.slideshare.net/lucidworks/scaling-solrcloud-to-a-large-
number-of-collections-shalin-shekhar-mangar
• Search of consumer cloud storage
• Key challenges: not all users are equals. Users grow and change all the
time
• Petabytes of data, millions of users, 1000’s of nodes,
• 1000’s of collections while isolating access
• Learn more: https://www.youtube.com/watch?v=_Erkln5WWLw
and
http://www.slideshare.net/lucidworks/scaling-solrcloud-to-a-large-
number-of-collections-shalin-shekhar-mangar
12. • Improve Zookeeper interactions
and performance to handle
thousands of collections
• Deep paging
• Split shards on arbitrary hash
ranges
• Large scale testing
• Collection migration
Case 3: Key Solr Improvements
13. https://github.com/LucidWorks/solr-scale-tk
Testing: Solr Scale Toolkit
Python w/ Fabric, Boto Etc.
Test Automation Scripts
Kafka
MQ / Data Integration
Logstash
Log Agg / Analysis
CollectD/SiLK
System / JMX Monitoring
Test
Results
DB
Support Services
Client Node 1
JMeter / Client Nodes
Client Node 2
Zookeeper
Test Data
Stored in
Amazon S3
Node 1: Custom AMI
Solr Cluster (NxM Nodes)
Solr Node 1 8983 Core Core
Solr Node M 898X Core Core
SolrCloud Traffic Between All
Solr Nodes and ZK
Key Point: Each test will define the
density of cores per node and
number of Solr nodes per machine,
as well as the instance type and
number of machines
ZK Ensemble
ZooKeeper-1
ZooKeeper-2
ZooKeeper-3
System monitoring
of N Machines
JMX Notifications
Logs aggregated from
NxM Solr Nodes
ZK JMX Stats
Easily deploy clusters of
nodes using our custom
AMI’s
Indexing & query
requests from tests
14. • Power user search and recommendations over news content and
engagement signals (shares, views, etc.) using Lucidworks Fusion
• Combines content and collaborative filtering approaches to
calculate search boosts and “people who did X also did Y” results
• Data: ~10M events (POC) growing to 3-4B per month
Case 4: Signals for Search and Discovery
15. • Lucidworks Fusion 1.4 (May ’15) will ship Apache Spark as a
scheduled service for large scale aggregations, machine
learning and more
• We’ve already seen 3x speedup in some tests
• Will ship w/ ALS rec algo and Mahout algos
• Solr as Spark RRD
• https://github.com/LucidWorks/spark-solr
Next Level Signals
16. Billions of Docs
Optional
REST
Security woven
throughout
Proxy
Recs
Worker
Pipes Metrics
NLP Sched.
Blobs Admin
Connectors
Worker Cluster Mgr.
Spark
Shards Shards
Solr
HDFS
Shared Config
Mgmt
Leader
Election
Load
Balancing
ZK 1
Zookeeper
ZK N
Signals
Fusion Architecture
Millions of Users
17. • Native, pluggable Security in Solr (April/May)
• Numerous performance enhancements for replication in
shards
• Cross core ValueSources
• Many new extensions for facets and analytics
• Percentiles (t-digest)
• Facet combinations
• Dynamic expressions over result sets
Roadmap