ApacheCon Europe 2012 -Big Search 4 Big Data

Big Search w/ Big Data
Principles
ApacheCon Euro 2012 Sinsheim
Eric Pugh | epugh@o19s.com | @dep4b

Who am I?
• Principal of OpenSource Connections
- Solr/Lucene Search Consultancy

• Member of Apache Software
Foundation

• SOLR-284 UpdateRichDocuments
(July 07)

• Fascinated by the art of software
development

0!
4.
on
ng
ki
or
W
CO-AUTHOR

war
Telling some stories ^

• Prototyping
• Application Development
• Maintaining Your Big Search Indexes

Not an intro to
SolrCloud/ElasticSearch!
• Great round table discussion yesterday led
by Mark Miller
• SolrCloud 4 Architecture talk in this room
NEXT!
• Solr4 vs Elastic Search at 4:45 PM TODAY!

Background for Client
X’s Project
• Big Data is any data set that is primarily at
rest due to the difﬁculty of working with it.
• 100’s of millions of documents to search
• Aggressive timeline.
• All the data must be searched per query.
• Limited selection of tools available.
• On Solr 3.x line

Telling some stories

• Prototyping
• Application Development
• Maintaining Your Big Search Indexes

Boy meets Girl Story
Content
Files

Ingest Solr
Solr
Pipeline Solr
Solr

Metadata

Bash Rocks
• Remote Solr stop/start scripts
• Remote Indexer stop/start scripts
• Performance Monitoring
• Content Extraction scripts (+Java)
• Ingestor Scripts (+Java)
• Artifact Deployment (CM)

Make it easy to change
approach

Make it easy to change
sharding
public void run(Map options, List<SolrInputDocument> docs) throws
InstantiationException, IllegalAccessException, ClassNotFoundException {
IndexStrategy indexStrategy = (IndexStrategy) Class.forName(
"com.o19s.solr.ModShardIndexStrategy").newInstance();
indexStrategy.configure(options);

for (SolrInputDocument doc:docs){
indexStrategy.addDocument(doc);
}
}

Separate JVM from Solr
Cores
• Step 1: Fire up empty Solr’s on all the
servers (nohup &).
• Step 2:Verify they started cleanly.
• Step 3: Create Cores (curl http://
search1.o19s.com:8983/solr/admin?
action=create&name=run2)
• Step 4: Create a “aggregator” core, passing
in urls of Cores. (&property.shards=)

search1.o19s.com
search1.o19s.com
shard1
shard1
shard1
shard1 :8983
shard1
shard1
shard1
shard1 :8983
search2.o19s.com
shard1
shard1
shard1
shard8 :8984 shard1
shard1
shard1 :8983
shard8

shard1
shard1
shard1 :8985
shard12 search3.o19s.com
shard1
shard1
shard1 :8985
shard12
shard1
shard1
shard1 :8983
shard12

Simple Pipeline

• Simple pipeline

• mv is atomic

Don’t Move Files
• SCP across machines is slow/error prone
• NFS share, single point of failure.
• Clustered ﬁle system like GFS (Global File
System) can have “fencing” issues
• HDFS shines here.
• ZooKeeper shines here.

Can you test your
changes?

JVM tuning is black art
-verbose:gc
-XX:+PrintGCDetails
-server
-Xmx8G
-Xms8G
-XX:MaxPermSize=256m
-XX:PermSize=256m
-XX:+AggressiveHeap
-XX:+DisableExplicitGC
-XX:ParallelGCThreads=16
-XX:+UseParallelOldGC

Using Solr as key/value store
Solr Key/
Value Cache
Metadata

Ingest Solr
Solr
Pipeline Solr
Solr
Content
Files

Using Solr as key/value store
• thousands of queries per second without
real time get.
http://localhost:8983/solr/run2_enrichment/select?
q=id:DOC45242&fl=entities,html

• how fast with real time get?
http://localhost:8983/solr/run2_enrichment/get?
id=DOC45242&fl=entities,html

Push schema deﬁnition
to the application
• Not “schema less”
• Just different owner of schema!
• Schema may have common set of ﬁelds like
id, type, timestamp, version
• Nothing required.
q=intensity_i:[70 TO 0]&fq=TYPE:streetlamp_monitor

Don’t do expensive
things in Solr

• Tika content extraction aka Solr Cell

• UpdateRequestProcessorChain

Beware JavaBin
Solr Key/
Value Cache
Metadata

Ingest Solr
Solr
Pipeline Solr
Solr
Content
Files

Beware JavaBin
Solr Key/ Solr 3.4
Value Cache
Metadata

Ingest Solr
Solr
Pipeline Solr
Solr
Content
Files

Beware JavaBin
Solr Key/ Solr 3.4
Value Cache
Metadata
Solr 4
Ingest Solr
Solr
Pipeline Solr
Solr
Content
Files

Beware JavaBin
Solr Key/ Solr 3.4
Value Cache
Metadata
Solr 4
Ingest Solr
Solr
Pipeline Solr
Solr
Content
Which SolrJ
Files
version do I
use?

No JavaBin

/u
G te
p
iv /
da
e av
m r
e o!
• Avoid Jarmaggeddon
• Reﬂection? Ugh.

Avro!
• Supports serialization of data readable from
multiple languages
• It’s smart XML, w/o the XML!
• Handles forward and reverse versions of an
object
• Compact and fast to read.

Avro!
Solr Key/
Value Cache
.avro

Metadata Ingest Solr
Solr
Pipeline Solr
Solr

Content
Files

Tika as a pipeline?

• Auto detects content type
• Metadata structure has all the
key/value needed for Solr
• Allows us to scale up with
Behemoth project.

Upgrade Lucene
Indexes Easily
• Don’t reindex!
• Try out new versions of
Lucene based search engines.
David Lyle
java -cp lucene-core.jar
org.apache.lucene.index.IndexUpgrader [-delete-prior-
commits] [-verbose] indexDir

Indexing is Easy and
Quick

CHEAP AND CHEERFUL

< >

The tension between
scale and update rate

10 million Bad Place 100’s of millions

Delayed Replication
<requestHandler name="/replication" class="solr.ReplicationHandler" >
<lst name="slave">
<str name="masterUrl">http://localhost:8983/solr/replication</str>
<str name="pollInterval">36:00:00</str>
</lst>
</requestHandler>

Enable/Disable
<requestHandler name="/admin/ping" class="solr.PingRequestHandler">
<lst name="invariants">
<str name="q">MY HARD QUERY</str>
<str name="shards">http://search1.o19s.com:8983/solr/run2_1,http://
search1.o19s.com:8983/solr/run2_2,http://search1.o19s.com:8983/solr/run2_2
</lst>
<lst name="defaults">
<str name="echoParams">all</str>
</lst>
<str name="healthcheckFile">server-enabled.txt</str>
</requestHandler>

Provisioning

• Chef/Puppet
• ZooKeeper
• Have you versioned everything to build an
index over again?

th
in
POOLED ENVIRONMENT

k
Cl
ou
d!

Building a Patents Index
300
300

225
Machine Count

150

75

1 5
0
5 days 3 days 30 Minutes
What happens when we want to index 2 million patents in 30 minutes?

Amazon AWS is Good but...
• EC2 is costly
• Issues of access to internal data
• Firewall and security

Can we Cycle Scavenge?
• Data Center is heavily used 9 to 5 EST.
• Lesser, but signiﬁcant load 8 to 10 PM
EST
• Minimal CPU load at night.
• Amazon Spot Pricing for EC2
• Seti @HOME
• JavaGenes - Genetics processing
• Condor Platform (http://
research.cs.wisc.edu/condor/)

49

Balancing Load
Production Load Batch Jobs
100

75

50

25

0
1 AM 3 AM 5 AM 9AM 3PM 9PM 11 PM
50

Do I need Failover?

• Can I build quickly?
• Do I have a reliable cluster of servers?
• Am I spread across data centers?
• Is sooo 90’s....

Measuring the impact
of our algorithms
changes is just getting
harder with Big Data.

Thank you!

Questions?

Nervous about • epugh@o19s.com
speaking up? Ask
me on later! • @dep4b
about ask • www.opensourceconnections.com

ApacheCon Europe 2012 -Big Search 4 Big Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to ApacheCon Europe 2012 -Big Search 4 Big Data

Similar to ApacheCon Europe 2012 -Big Search 4 Big Data (20)

More from OpenSource Connections

More from OpenSource Connections (20)

Recently uploaded

Recently uploaded (20)

ApacheCon Europe 2012 -Big Search 4 Big Data

Editor's Notes