This document summarizes how Solr and Lucidworks Fusion can be used for big data search and analytics. It discusses indexing strategies like using MapReduce, Spark, and Fusion connectors to index structured and unstructured data from HDFS. It also covers topics like Solr on HDFS, auto add replicas, security, cluster sizing, and using the lambda architecture with Spark streaming to enable real-time search over batch-processed historical data. The document promotes Lucidworks Fusion as a search platform that can handle massive scales of data, provide real-time search capabilities, and work with any data source securely.
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
Solr & Fusion for Big Data Search and Analytics
1.
2. Solr & Fusion for Big Data
• Where search fits in the
big data landscape?
• Solr on HDFS
• Indexing strategies
• End-to-end security
• Lambda architecture
• Spark and how we use it
in Fusion
4. Why search for big data?
• Speed at scale
• Basic analytics (facets, pivot facets, facets + stats) +
visualizations
• Query structured and unstructured data
• Ad hoc exploration is inherent in big data
• People grok search
• Context for aggregations (drill into the numbers)
5. Common use case:
log analysis
• Time-ordered data
• Raw data stored in
HDFS
• How much data? How
fast?
• Access patterns?
• Schema design ~ no free
lunch at scale
6. Time-based Partitioning Scheme
Fusion
Log Analytics
Dashboard
logs_feb26
(daily collection)
logs_feb25
(daily collection)
logs_feb01
(daily collection)
h00
(shard)
h22
(shard)
h23
(shard)
h00
(shard)
h22
(shard)
h23
(shard)
Add replicas
to support higher
query volume &
fault-tolerance
recent_logs
(colllection alias)
Use a collection
alias to make multiple
collections look like a
single collection; minimize
exposure to partitioning
strategy in client layer
Every daily collection has 24 shards (h00-h23), each covering 1-hour blocks of log messages
7. Solr on HDFS
• Maturing solution still some issues
• My test showed ~23-25% slower than local SSD
• Better ROI, operational efficiency, security
• Needed for YARN
• Enables auto add replicas
• Interesting features coming soon: ZooKeeper lock (SOLR-
8169) and replicas share index (SOLR-6237)
8. Solr on HDFS
Solr
shard1 / replica1
block cache
Solr
shard1 / replica2
block cache
writes
reads
HDFS
DataNode C
HDFS
DataNode B
HDFS
DataNode A writes
reads
HDFS block replication
Solr replication
12. Fusion Indexing Pipelines in MapReduce
Solr
Map Task (or reducer if needed)
ZooKeeper
CloudSolr
Client
HDFS
Get collection metadata
from ZooKeeper
(e.g. shard leader URL)
Send updates to shard
leaders in parallel
Fusion Pipeline
docs
…N map tasks (1 per block)
30+ index stages
- Field mapping
- JavaScript
- Tika parsing
- NLP
- Regex
- JDBC lookup
Many common file formats supported:
CSV, SequenceFile, grok, XML, warc
13. Security
• End-to-end security is now a reality for Hadoop
• Kerberos authentication (ZK, Solr, HDFS, jobs)
• Pluggable authorization framework
• Collection and document-level access controls (via Fusion)
• SSL
• Apache Ranger (centralized admin, auditing, monitoring for
Hadoop)
14. Cluster Sizing Worksheet
• There is no formula, only guidelines!
• # of documents / avg. doc size / number of fields
• Updates per second / soft-commit frequency
• Storage type (local SSD vs. HDFS)
• Sharding scheme (time-based vs. hash-based)
• Peak QPS / 95th percentile response time / query complexity
• Must test your data on your servers ;-)
15. • Search engine fits
perfectly with lambda
• Use batch layer to build
indexes instead of
“views”
• Speed layer uses Spark
streaming to build near
real-time index
• Aggregation collections
for historical data
Lambda Architecture
source: http://lambda-architecture.net/
I don’t have to tell you that big data is a popular topic of discussion in IT circles these days.
What I want to talk about today is how Solr and Lucidworks Fusion fit into the big data landscape. Don’t worry, I’ll try to keep the hype and grandiose statements to a minimum. I will get technical in a few places because it’s important to understand the details.
Search is a critical component of any big data strategy
Fusion & Solr are first-class citizens in the Hadoop ecosystem
Big data doesn’t have to be hard – Fusion makes it easy
Search engines contain mission-critical data and are typically on the front-line, directly serving users
Before I was a Solr committer, I was a Solr user, one of the first adopters of SolrCloud actually. I worked on a team that built and supported a big data framework built on Hadoop, Storm, Cassandra, Solr, and Postgres. Effectively, we computed performance metrics for brands by analyzing social media data
IT organizations are consolidating data infrastructure for improved ROI, efficiency, security, and governance.
Solr is included as part of Cloudera, Hortonworks, and MapR Hadoop distributions
Users get search because they see it everyday; BI / dashboards / SQL are powerful, but not necessarily intuitive
Vast amount of exhaust created by users interacting with searchable content
Often times, it’s a small department in a larger organization that uses search to expose medium data to deliver business insights and then the “search engine” evolves into an insights engine on larger and larger data sets.
If you could plan for all the possible queries you need to serve, then traditional BI / data warehousing techniques will still serve you well. Search fills the void where users need fast, ad hoc query capabilities to do exploratory analysis.
Let’s imagine we have time-ordered data, such as logs of user activity. You can insert any scale that fits your needs here. We work with customers that have a billions log events per day up to 10’s of billions.
Let’s work through a quick example to illustrate some of the questions that come up and how we tackle them at Lucidworks
Bunch of log data HDFS, want to index it for ad hoc queries and basic visualizations, i.e. the kinds that you can power with simple analytical functions like faceting
First thing we have to identify is what data are we indexing? where is it coming from?
how much data is there?
how quickly do we need it to be indexed?
But wait … step back a sec … how are people going to search this data?
There are three important decisions emerge when designing your search solution:
data partitioning scheme (time-based: hourly, daily, 15-minutes, etc)
doc values: fields you need to sort and facet on should have doc values
range queries need trie-fields to be indexed
what fields must be stored / indexed
What type of visualizations make sense for this data? What type of aggregations do you want to perform and at what time granularity?
So here we’re starting to see some of the same considerations when designing data warehouse, i.e. there’s no free lunch, esp. at scale
The key-takeaway here is that you use your investment in Hadoop to scale complex document processing using Fusion pipelines by running a pipeline in each map or reduce task