Solr & Fusion for Big Data Search and Analytics

Solr & Fusion for Big Data
• Where search fits in the
big data landscape?
• Solr on HDFS
• Indexing strategies
• End-to-end security
• Lambda architecture
• Spark and how we use it
in Fusion

The standard
for enterprise
search.
of Fortune 500
uses Solr.
90%

Why search for big data?
• Speed at scale
• Basic analytics (facets, pivot facets, facets + stats) +
visualizations
• Query structured and unstructured data
• Ad hoc exploration is inherent in big data
• People grok search
• Context for aggregations (drill into the numbers)

Common use case:
log analysis
• Time-ordered data
• Raw data stored in
HDFS
• How much data? How
fast?
• Access patterns?
• Schema design ~ no free
lunch at scale

Time-based Partitioning Scheme
Fusion
Log Analytics
Dashboard
logs_feb26
(daily collection)
logs_feb25
(daily collection)
logs_feb01
(daily collection)
h00
(shard)
h22
(shard)
h23
(shard)
h00
(shard)
h22
(shard)
h23
(shard)
Add replicas
to support higher
query volume &
fault-tolerance
recent_logs
(colllection alias)
Use a collection
alias to make multiple
collections look like a
single collection; minimize
exposure to partitioning
strategy in client layer
Every daily collection has 24 shards (h00-h23), each covering 1-hour blocks of log messages

Solr on HDFS
• Maturing solution still some issues
• My test showed ~23-25% slower than local SSD
• Better ROI, operational efficiency, security
• Needed for YARN
• Enables auto add replicas
• Interesting features coming soon: ZooKeeper lock (SOLR-
8169) and replicas share index (SOLR-6237)

Solr on HDFS
Solr
shard1 / replica1
block cache
Solr
shard1 / replica2
block cache
writes
reads
HDFS
DataNode C
HDFS
DataNode B
HDFS
DataNode A writes
reads
HDFS block replication
Solr replication

Auto Add Replica
HDFS
DataNode C
block cache
Solr
shard1 / replica1
writes
reads
HDFS
DataNode A
HDFS block replication
Solr
shard1 / replica2
block cache
HDFS
DataNode Bwrites
reads
Solr replication
overseer
ZooKeeper
watches
Solr
shard1 / replica3
writes
reads

Indexing Strategies
• Many tools available!
• MapReduce indexer (Solr contrib)
• LWOutputFormat, Hive SerDe, Pig StoreFunc, HBase
• Storm to Solr or Fusion (github.com/LucidWorks/storm-solr)
• Spark to Solr or Fusion (github.com/LucidWorks/spark-solr)
• Lucidworks Fusion Connectors

Fusion Indexing Pipelines in MapReduce
Solr
Map Task (or reducer if needed)
ZooKeeper
CloudSolr
Client
HDFS
Get collection metadata
from ZooKeeper
(e.g. shard leader URL)
Send updates to shard
leaders in parallel
Fusion Pipeline
docs
…N map tasks (1 per block)
30+ index stages
- Field mapping
- JavaScript
- Tika parsing
- NLP
- Regex
- JDBC lookup
Many common file formats supported:
CSV, SequenceFile, grok, XML, warc

Security
• End-to-end security is now a reality for Hadoop
• Kerberos authentication (ZK, Solr, HDFS, jobs)
• Pluggable authorization framework
• Collection and document-level access controls (via Fusion)
• SSL
• Apache Ranger (centralized admin, auditing, monitoring for
Hadoop)

Cluster Sizing Worksheet
• There is no formula, only guidelines!
• # of documents / avg. doc size / number of fields
• Updates per second / soft-commit frequency
• Storage type (local SSD vs. HDFS)
• Sharding scheme (time-based vs. hash-based)
• Peak QPS / 95th percentile response time / query complexity
• Must test your data on your servers ;-)

• Search engine fits
perfectly with lambda
• Use batch layer to build
indexes instead of
“views”
• Speed layer uses Spark
streaming to build near
real-time index
• Aggregation collections
for historical data
Lambda Architecture
source: http://lambda-architecture.net/

Spark
Spark Core
Spark
SQL
Spark
Streaming
MLlib
(machine
learning)
GraphX
(BSP)
Hadoop YARN Mesos Standalone
HDFS
Execution
Model
The Shuffle Caching
engine
cluster
mgmt
Tachyon
languages Scala Java Python R
shared
memory

The most relevant results
every single time.
Massive scale. Real-time.
Secure.
Any data. Any source.

Any questions?
• Try Fusion http://lucidworks.com/products/fusion/download
• LinkedIn / Twitter / Solr JIRA: @thelabdude

Solr & Fusion for Big Data Search and Analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Solr & Fusion for Big Data Search and Analytics

Similar to Solr & Fusion for Big Data Search and Analytics (20)

More from Lucidworks

More from Lucidworks (20)

Recently uploaded

Recently uploaded (20)

Solr & Fusion for Big Data Search and Analytics

Editor's Notes