Integrating Hadoop & Solr

Who am I?
Yann Yu
Systems Engineer @ Lucidworks

Lucidworks is search.
Technology Retail Financial
Healthcare Services Industrial

Lucidworks is the commercial entity of the Lucene/Solr
project.
8M+total
Solr is both established &
growing
downloads
250,000+ monthly downloads
You use
Solr everyday.
Largest community of developers.
2500+ open Solr jobs.
Solrmost widely used
search
solution on the planet.
Lucidworks Unmatched Solr
expertise.
1/3 of the active
committers
70% of the open source
code is committed
Lucene/Solr Revolution world’s largest open source user
conference dedicated to Lucene/
Solr.
Solr has tens of thousands
of applications in production.

Why would you integrate Hadoop and Solr?
(and how would you do that?)

• Open-source
• Enterprise support
• Cheap, scalable storage
• Distributed computation
• Farm animals and many other
related projects for extensibility
• Open-source, Lucene based
• Enterprise support
• Real-time queries
• Full-text search
• NoSQL capabilities
• Repeatedly proven in production
environments at massive scales
• Uses ZooKeeper for clustering

I have Hadoop, why do I need Solr?
Hadoop excels in storing and working with large amounts of data,
but has difficulty with frequent, random access to it
• NoSQL front-end to Hadoop: Enable fast, ad-hoc, search across
structured and unstructured big data
• Empower users of all technical ability to interact with, and derive
value from, big data — all using a natural language search interface
(no MapReduce, Pig, SQL, etc.)
• Preliminary data exploration and analysis
• Near real-time indexing and querying
• Thousands of simultaneous, parallel requests
• Share machine-learning insights created on Hadoop to a broad
audience through an interactive medium

I have Solr, why do I need Hadoop?
As Solr indexes grow in size, the size and number of the machines hosting Solr
must also grow, increasing index time and complexity
• Least expensive storage solution in market
• Leverage Hadoop processing power (MapReduce) to build
indexes or send document updates to Solr
• Store Solr indexes and transaction logs within HDFS
• Augment Solr data by storing additional information for last-second
retrieval in Hadoop

The enterprise storage situation today
⚒ • Large enterprises often have data
distributed in many different stores, making
it hard to know where to start looking
• Employees have to check with others to
verify versions of documents
• Even with hosting, knowledge is still largely
tribal

Enterprise data deployment
Lucidworks HDFS connector
processes documents and
sends to SolrCloud
Enterprise documents
are stored in HDFS
And retrieve source
files directly from
HDFS as necessary
Users make ad-hoc, full-text
queries across the full content
of all documents in Solr
Standard document storage and search

• Documents can be migrated from other file
storage systems via Flume or other scripts
• MapReduce allows for batch processing of
documents (e.g. OCR, NER, clustering, etc.)
Sink documents into HDFS

Index document contents into Solr
• The Lucidworks Hadoop
connector parses content from
files using many different tools
• Tika, GrokIngest, CSV
mapping, Pig, etc.
• Content and data are added to
fields in a Solr document
• The resulting document is sent
to Solr for indexing

Enable users to search and access content
• Users are empowered with ad-hoc,
full-text search in Solr
• Provides standard search tools
such as autocomplete, more-like-this,
spellchecking, faceting, etc.
• Users only access HDFS as needed

The data warehouse
• Enterprises are storing data without a clear
plan on how to access it
• The “data warehouse” is full of files, but with
no way to pull documents, or to find what
you’re looking for
• In some cases, the data is required for
compliance and isn’t used otherwise

Log record search
Machine generated log records
are sent to Flume.
Flume forwards raw log record
to Hadoop for archiving.
Flume simultaneously parses out
data in record into a Solr document,
forwarding resulting document to Solr
Lucidworks SiLK exposes real-time
statistics and analytics to end-users,
as well as full-text search
High volume indexing of many small records

Flume archives data in HDFS
• Flume performs minimal work on log
files and sends them directly into
HDFS for archival
• Under optimal circumstances, the log
files are sized to the block size of
HDFS

Flume submits records to Solr
• Flume processes records, extracting
strings, ints, dates, times, and other
information into Solr fields
• Once the Solr document is created, it
is submitted to Solr for indexing
• This process happens in real-time,
allowing for near real-time search

Real-time analytics dashboard
• Lucidworks SiLK allows users to create
simple dashboards through a GUI
• The SiLK dashboard will issue queries to
Solr, rendering the received data in
tables, graphs, and other plots
• Users can also perform full-text search
across the data, allowing for extremely
fine granularity

High traffic Solr deployments
• Some users of Solr, especially in the e-commerce
case, are running high query
volume sites with small document sets
• Master-slave works well enough, but
doesn’t allow for NRT and similar features
form SolrCloud

E-commerce search
Lots of queries, not a lot of updates
Solr is pointed at an index on
HDFS, and pulls it up to
begin serving queries
Additional Solr machines can be
spun-up on demand, pulling the
index directly from HDFS
Load balancer (or SolrJ)
distributes query
to active nodes

MapReduce Solr index generation
• Existing product tables or catalogs can stored in
HDFS or HBase, and can continue to be updated as
necessary
• Hadoop can utilize the MapReduceIndexerTool to
parallelize building of indexes
• As many indexes as necessary can be built in this way

Ad-hoc scaling without manual replication
• Independent Solr nodes (not
SolrCloud) can be started up
and use the stored index data
on HDFS
• These can be spun up in an
ad-hoc fashion, allowing for
an elastically scalable cluster
• Updates to indexes are
versatile, can be pushed in via
new collections or as updates
to existing collections

Highly-available search
• New search nodes are simply added to the
load balancer or smart-client
• Distributed queries allow for sharded data-sets
• Results from all nodes are guaranteed to be
consistent with one-another

End
Find me at:
yann.yu@lucidworks.com
@yawnyou
Any questions?

Integrating Hadoop & Solr

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Integrating Hadoop & Solr

Similar to Integrating Hadoop & Solr (20)

More from Lucidworks (Archived)

More from Lucidworks (Archived) (20)

Recently uploaded

Recently uploaded (20)

Integrating Hadoop & Solr