The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

The Search Is Over:
Integrating SOLR and Hadoop to
Simplify Big Data Analytics
©MapR Technologies - Confidential 1

Evolution of Search

Documents
•Models
•Feature Selection

User
Content Interaction
Relationships •Clicks
•Page Rank, etc. •Ratings/Reviews
•Organization •Learning to Rank
•Social Graph

Queries
•Phrases
•NLP


Search Discovery and Analytics

Search

Analytics Discovery


Data is Growing Quickly

Business Analytics Requires a New Approach

Data Volume
Growing 44x
2010:
1.2
Zettabytes 2020: 35.2
Zettabytes IDC
Digital Universe
Study 2011
Data is Growing Faster than Moore’s Law
Source: IDC Digital Universe Study, sponsored by EMC, May 2010

MapReduce: A Paradigm Shift
 Distributed computing platform
– Large clusters
– Commodity hardware
 Pioneered at Google
– Bigtable and Google File System
 Commercially available as Hadoop


Hadoop Explosion

6

How does Map/Reduce work?
1. Map
– Spread data across servers based on key/value pairs
– Each node independently scans local data
2. Servers produce Map results
3. Reduce - combine/merge Map results
4. Process complete or Map a new function

Like shuffling
multiple decks
of playing
cards


The Cost of Enterprise Storage

SAN Storage NAS Filers Local Storage

$2 - $10/Gigabyte $1 - $5/Gigabyte $0.05/Gigabyte

$1M gets: $1M gets: $1M gets:
0.5Petabytes 1 Petabyte 20 Petabytes
200,000 IOPS 400,000 IOPS 10,000,000 IOPS
1Gbyte/sec 2Gbyte/sec 800 Gbytes/sec


Deep Object Store
 Billions and Billions of Files
 For some use cases it’s not the storage
capacity it’s the number of objects
– Messages
– Attachments
– Images
– Recordings
 Provides a deep storage pool that is analytic ready
– Store it until you need it
– Derive secondary value from analytic processing
 Makes more sense to perform analytics on the data and
send results over the network

9

Problems with Integrating Solr with Hadoop

 Simple to integrate with Hadoop as a data source
 Difficult to integrate distributed search and scale
 SolrCloud simplifies Sharding and Replication coordination
 Integration limitations based on capabilities of large scale storage
– High availability
– Data protection
– Ease of Access


Sharded text Indexing
Assign documents Index text to local disk
to shards and then copy index to
distributed file store

Clustered
Reducer index storage
Input Map
documents
Copy to local disk
Local
typically disk
required before Local Search
index can be loaded disk Engine


Problems with Solr and Hadoop

Failure of search
engine requires
Failure of a reducer another download
causes garbage to of the index from
accumulate in the clustered storage.
Clustered
local disk Reducer index storage
Input Map
documents
Local
disk Local Search
disk Engine


Limitations of HDFS

 HDFS is Append Only NAS
appliance

 Data Access is through the HDFS API
A B
 High Availability is a challenge NameNode

 Single points of failure
DataNode DataNode DataNode
 Limited to 50-200 million files
 Performance bottleneck DataNode DataNode DataNode

DataNode DataNode DataNode


Logs, Flume, aggregates incoming events to Solr –
Requires Multi-Step, Batch Process

Hadoop
Application Cluster
Server

Application
Server

Application
Server


What’s Required for SDA?

 Ease of Data Access through Open Standards
Search

 Large Scale, Reliable Storage

 Ease of Integration Analytics Discovery
– Management ( REST)
– Security (LDAP, NIS, Linux PAM…)
– Analytics (NFS, ODBC, HDFS)


Ease of Data Access

HDFS ENTERPRISE
API NFS Access


Multiple Architectures Possible

 Export to the world
– NFS gateway runs on selected gateway hosts
 Local server
– NFS gateway runs on local host
– Enables local compression and check summing
 Export to self
– NFS gateway runs on all data nodes, mounted from localhost


Data Access through Standard Protocols

NFS
NFS
Server
NFS
Server
NFS
Server
NFS Server
Client


NFS Access through a Local server

Application

NFS
Server
Client

Cluster
Nodes


Universal export to self

Cluster Nodes

Task

NFS
Cluster Server
Node


Nodes are identical

Task
Task
NFS
NFS
Cluster Server
Node Cluster Server
Node

Task

NFS
Cluster Server
Node


Simplifies Solr Hadoop Integration

Search
Engine
Reducer
Input Map Clustered
documents
index storage
Failure of a reducer Search engine
is cleaned up by reads mirrored
map-reduce index directly.
framework


How Does this Integration Happen?

 Elegantly simple
 Direct Integration a result of leveraging architectures
 Data in the Hadoop cluster is written to a Volume
 Solr Crawler discovers content being entered into
Hadoop
 Accesses the data in the cluster through NFS
 Builds Search Index
 Users access Solr to find data directly into Hadoop


Distributed Shard Indexing

shard#1,doc
doc1
1
doc2 shard#1,[doc3,doc1]
shard#2,doc
doc3 shard#2,[doc2] index/s1
2
shard#3, [doc5]index/s2
shard#1,doc
… index/s3
3
shard#3,doc …
Input Map 4 Combine
Shuffle Reduce Output
and sort
shard#3,doc
5 Reduce
…

24

How Does this Work at Scale with
Distributed Indices?
 MapReduce jobs analyze distributed, disparate data in a cluster
 In distributed indexing, the input is split arbitrarily into chunks
and each chunk is handled separately. There can be many more
chunks than there are shards to be created.
 Mapper assigns document to shard
– Shard is usually hash of document id
 Reducer indexes all documents for a shard
– Indexes created on local disk
– On success, copy index to DFS
 Zookeeper is used to manage Solr instances
 A large Solr Search is distributed across multiple shards


What about HA and Data Protection?

 Cluster Capabilities can Extend to Integrated Search and Discovery

Reliable Compute Dependable Storage

 Automated re-replication  Business continuity with snapshots
and mirrors
 Self-healing from HW and SW failures
 Recover to a point in time
 Load balancing
 End-to-end check summing
 Rolling upgrades
 Strong consistency
 No lost jobs or data
 Mirror across sites to meet
 99999’s of uptime
Recovery Time Objectives


MapReduce failure to write the Index

 Highly Available JobTracker and TaskTracker ensures
that any failures are recovered with state to
completion
 MapReduce will clean up partially written indexes
 No administrator intervention required


Solr Node Fails

 Other Solr nodes start
serving shards that
were being served by
failed node


Node Containing the Index Fails

 Data is already replicated across the cluster
 Zookeeper assigns Solr instance on the replicated node to the
replicated shard


Additional High Availability and Replication

 Snapshots are available
 Administrator sets frequency at the Volume
 Snapshots with automatic
de-duplication
 Saves space by sharing blocks
 Redirect on write, fast with no performance or
storage penalty
 Zero performance loss on writing to original
 Scheduled, or on-demand
 Easy recovery with drag and drop


Mirroring Support in Hadoop Cluster
Business Continuity
and Efficiency
Production Research

Efficient design
 Differential deltas are updated
Datacenter 1
WAN
Datacenter 2  Compressed and
check-summed

Easy to manage
WAN
Production  Scheduled or on-demand
EC2
 WAN, Remote Seeding
 Consistent point-in-time


Simplified NFS data flows for Distributed
Search
Search
Mirroring allows Engine
exact placement
of index data

Reducer
Input Map
documents Search
Engine
Aribitrary levels
of replication
also possible Mirrors


Improving Search Relevancy

 Requires a continuous Feedback
Loop Search

– The quality of the search is
influenced by the end-user
selections Analytics Discovery

– Fully automated process that
improves with use
– Does not require manual tags or
classification


Recommendations

 Often referred to as collaborative filtering
 Actors interact with items
– observe successful interaction
 We want to suggest additional successful interactions
 Observations inherently very sparse


Examples

 Customers buying books (Linden et al)
 Web visitors rating music (Shardanand and Maes) or movies (Riedl,
et al), (Netflix)
 Internet radio listeners not skipping songs (Musicmatch)
 Internet video watchers watching >30 s


Examples

 Query for Friends results in links to Seinfeld
 Search for kittens, get results for baby otters


Dyadic Structure

 Functional
– Interaction: actor -> item*
 Relational
– Interaction ⊆ Actors x Items
 Matrix
– Rows indexed by actor, columns by item
– Value is count of interactions
 Predict missing observations


Fundamental Algorithmics

 Co-occurrence
 A is actors x items, K is items x items

 Product has general shape of matrix

 K tells us “users who interacted with x also interacted with y”


Why not Expand it?

 Users enter queries (A)
– (actor = user, item=query)
 Users view videos (B)
– (actor = user, item=video)
 A’A gives query recommendation
– “did you mean to ask for”
 B’B gives video recommendation
– “you might like these videos”


The punch-line

 B’A recommends videos in response to a query
– (isn’t that a search engine?)
– (not quite, it doesn’t look at content or meta-data)


Real-life example

 Query: “Paco de Lucia”
 Conventional meta-data search results:
– “hombres del paco” times 400
– not much else
 Recommendation based search:
– Flamenco guitar and dancers
– Spanish and classical guitar
– Van Halen doing a classical/flamenco riff


Real-life example


The Search for Relevancy
 Updating Search to Reflect Relevancy
– Big Map Reduce jobs can use behaviorial traces in logs to improve results
and identify Importance

Search

Analytics Discovery

 The power of this virtuous loop depends on ease of frictionless
data access, high availability, performance


The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

Similar to The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics (20)

More from lucenerevolution

More from lucenerevolution (20)

Recently uploaded

Recently uploaded (20)

The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics