OSSCON: Big Search 4 Big Data
Upcoming SlideShare
Loading in...5
×
 

OSSCON: Big Search 4 Big Data

on

  • 909 views

At Basis Technologies Open Source Search conference I talked about a project I did this past year, and talked about the lessons, both good and the bad that we learned.

At Basis Technologies Open Source Search conference I talked about a project I did this past year, and talked about the lessons, both good and the bad that we learned.

Statistics

Views

Total Views
909
Slideshare-icon Views on SlideShare
902
Embed Views
7

Actions

Likes
1
Downloads
12
Comments
0

3 Embeds 7

https://twitter.com 4
http://www.twylah.com 2
https://si0.twimg.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    OSSCON: Big Search 4 Big Data OSSCON: Big Search 4 Big Data Presentation Transcript

    • Big Search w/ Big Data Principles Basis Technology Open Source Search 2012 Eric Pugh | epugh@o19s.com | @dep4bTuesday, October 2, 2012
    • What is Big Search?Tuesday, October 2, 2012
    • Who am I? • Principal of OpenSource Connections - Solr/Lucene Search Consultancy • Member of Apache Software Foundation • SOLR-284 UpdateRichDocuments (July 07) • Fascinated by the art of software developmentTuesday, October 2, 2012
    • 2n d ed it io n! CO-AUTHORTuesday, October 2, 2012
    • war Telling some stories ^ • Prototyping • Application Development • Maintaining Your Big Search IndexesTuesday, October 2, 2012
    • Not an intro to SolrCloud! • Great tutorials given by Tomás Fernández Löbbe from LucidWorks yesterday!Tuesday, October 2, 2012
    • Background for Client X’s Project • Big Data is any data set that is primarily at rest due to the difficulty of working with it. • 100’s of millions of documents to search • Limited selection of tools available. • Aggressive timeline. • All the data must be searched per query. • On Solr 3.x lineTuesday, October 2, 2012
    • Telling some stories • Prototyping • Application Development • Maintaining Your Big Search IndexesTuesday, October 2, 2012
    • Boy meets Girl Story Metadata Ingest Solr Solr Pipeline Solr Solr Content FilesTuesday, October 2, 2012
    • Bash RocksTuesday, October 2, 2012
    • Bash Rocks • Remote Solr stop/start scripts • Remote Indexer stop/start scripts • Performance Monitoring • Content Extraction scripts (+Java) • Ingestor Scripts (+Java) • Artifact Deployment (CM)Tuesday, October 2, 2012
    • Make it easy to change approachTuesday, October 2, 2012
    • Make it easy to change sharding public void run(Map options, List<SolrInputDocument> docs) throws InstantiationException, IllegalAccessException, ClassNotFoundException { IndexStrategy indexStrategy = (IndexStrategy) Class.forName( "com.o19s.solr.ModShardIndexStrategy").newInstance(); indexStrategy.configure(options); for (SolrInputDocument doc:docs){ indexStrategy.addDocument(doc); } }Tuesday, October 2, 2012
    • Separate JVM from Solr Cores • Step 1: Fire up empty Solr’s on all the servers (nohup &). • Step 2:Verify they started cleanly. • Step 3: Create Cores (curl http:// search1.o19s.com:8983/solr/admin? action=create&name=run2) • Step 4: Create a “aggregator” core, passing in urls of Cores. (&property.shards=)Tuesday, October 2, 2012
    • Go Wide QuicklyTuesday, October 2, 2012
    • search1.o19s.com search1.o19s.com shard1 shard1 shard1 shard1 :8983 shard1 shard1 shard1 shard1 :8983 search2.o19s.com shard1 shard1 shard1 shard8 :8984 shard1 shard1 shard1 :8983 shard8 shard1 shard1 shard1 :8985 shard12 search3.o19s.com shard1 shard1 shard1 :8985 shard12 shard1 shard1 shard1 :8983 shard12Tuesday, October 2, 2012
    • Simple Pipeline • Simple pipeline • mv is atomicTuesday, October 2, 2012
    • Don’t Move Files • SCP across machines is slow/error prone • NFS share, single point of failure. • Clustered file system like GFS (Global File System) can have “fencing” issues • HDFS shines here. • ZooKeeper shines here.Tuesday, October 2, 2012
    • Can you test your changes?Tuesday, October 2, 2012
    • JVM tuning is black art -verbose:gc -XX:+PrintGCDetails -server -Xmx8G -Xms8G -XX:MaxPermSize=256m -XX:PermSize=256m -XX:+AggressiveHeap -XX:+DisableExplicitGC -XX:ParallelGCThreads=16 -XX:+UseParallelOldGCTuesday, October 2, 2012
    • Tuesday, October 2, 2012
    • Run, don’t WalkTuesday, October 2, 2012
    • Telling some stories • Prototyping • Application Development • Maintaining Your Big Search IndexesTuesday, October 2, 2012
    • Using Solr as key/value store Solr Key/ Value Cache Metadata Ingest Solr Solr Pipeline Solr Solr Content FilesTuesday, October 2, 2012
    • Using Solr as key/value store • thousands of queries per second without real time get. http://localhost:8983/solr/run2_enrichment/select? q=id:DOC45242&fl=entities,html • how fast with real time get? http://localhost:8983/solr/run2_enrichment/get? id=DOC45242&fl=entities,htmlTuesday, October 2, 2012
    • Push schema definition to the application • Not “schema less” • Just different owner of schema! • Schema may have common set of fields like id, type, timestamp, version • Nothing required. q=intensity_i:[70 TO 0]&fq=TYPE:streetlamp_monitorTuesday, October 2, 2012
    • Don’t do expensive things in Solr • Tika content extraction aka Solr Cell • UpdateRequestProcessorChainTuesday, October 2, 2012
    • Don’t do expensive things in Solr • Tika content extraction aka Solr Cell • UpdateRequestProcessorChainTuesday, October 2, 2012
    • Beware JavaBin Solr Key/ Value Cache Metadata Ingest Solr Solr Pipeline Solr Solr Content FilesTuesday, October 2, 2012
    • Beware JavaBin Solr Key/ Solr 3.4 Value Cache Metadata Ingest Solr Solr Pipeline Solr Solr Content FilesTuesday, October 2, 2012
    • Beware JavaBin Solr Key/ Solr 3.4 Value Cache Metadata Solr 4 Ingest Solr Solr Pipeline Solr Solr Content FilesTuesday, October 2, 2012
    • Beware JavaBin Solr Key/ Solr 3.4 Value Cache Metadata Solr 4 Ingest Solr Solr Pipeline Solr Solr Content Which SolrJ Files version do I use?Tuesday, October 2, 2012
    • No JavaBin /u G te p iv / da e av m r e o! • Avoid Jarmaggeddon • Reflection? Ugh.Tuesday, October 2, 2012
    • Avro! • Supports serialization of data readable from multiple languages • It’s smart XML, w/o the XML! • Handles forward and reverse versions of an object • Compact and fast to read.Tuesday, October 2, 2012
    • Avro! Solr Key/ Value Cache .avro Metadata Ingest Solr Solr Pipeline Solr Solr Content FilesTuesday, October 2, 2012
    • Telling some stories • Prototyping • Application Development • Maintaining Your Big Search IndexesTuesday, October 2, 2012
    • Upgrade Lucene Indexes Easily • Don’t reindex! • Try out new versions of Lucene based search engines. David Lyle java -cp lucene-core.jar org.apache.lucene.index.IndexUpgrader [-delete-prior- commits] [-verbose] indexDirTuesday, October 2, 2012
    • Indexing is Easy and QuickTuesday, October 2, 2012
    • CHEAP AND CHEERFUL < >Tuesday, October 2, 2012
    • NRT versus BigDataTuesday, October 2, 2012
    • The tension between scale and update rate 10 million Bad Place 100’s of millionsTuesday, October 2, 2012
    • Grim ReaperTuesday, October 2, 2012
    • Delayed Replication <requestHandler name="/replication" class="solr.ReplicationHandler" > <lst name="slave"> <str name="masterUrl">http://localhost:8983/solr/replication</str> <str name="pollInterval">36:00:00</str> </lst> </requestHandler>Tuesday, October 2, 2012
    • Enable/Disable • Solr-3301Tuesday, October 2, 2012
    • Enable/Disable <requestHandler name="/admin/ping" class="solr.PingRequestHandler"> <lst name="invariants"> <str name="q">MY HARD QUERY</str> <str name="shards">http://search1.o19s.com:8983/solr/run2_1,http:// search1.o19s.com:8983/solr/run2_2,http://search1.o19s.com:8983/solr/run2_2 </lst> <lst name="defaults"> <str name="echoParams">all</str> </lst> <str name="healthcheckFile">server-enabled.txt</str> </requestHandler>Tuesday, October 2, 2012
    • Provisioning • Chef/Puppet • ZooKeeper • Have you versioned everything to build an index over again?Tuesday, October 2, 2012
    • TRADITIONAL ENVIRONMENTTuesday, October 2, 2012
    • th in POOLED ENVIRONMENT k Cl ou d!Tuesday, October 2, 2012
    • Do I need Failover? • Can I build quickly? • Do I have a reliable cluster of servers? • Am I spread across data centers? • Is sooo 90’s....Tuesday, October 2, 2012
    • Telling some stories • Prototyping • Application Development • Maintaining Your Big Search IndexesTuesday, October 2, 2012
    • One more thought...Tuesday, October 2, 2012
    • Measuring the impact of our algorithms changes is just getting harder with Big Data.Tuesday, October 2, 2012
    • Project SolrPanlTuesday, October 2, 2012
    • Thank you! Questions? • epugh@o19s.com • @dep4b • www.opensourceconnections.comTuesday, October 2, 2012