OSSCON: Big Search 4 Big Data

  • 699 views
Uploaded on

At Basis Technologies Open Source Search conference I talked about a project I did this past year, and talked about the lessons, both good and the bad that we learned.

At Basis Technologies Open Source Search conference I talked about a project I did this past year, and talked about the lessons, both good and the bad that we learned.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
699
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
13
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Big Search w/ Big Data Principles Basis Technology Open Source Search 2012 Eric Pugh | epugh@o19s.com | @dep4bTuesday, October 2, 2012
  • 2. What is Big Search?Tuesday, October 2, 2012
  • 3. Who am I? • Principal of OpenSource Connections - Solr/Lucene Search Consultancy • Member of Apache Software Foundation • SOLR-284 UpdateRichDocuments (July 07) • Fascinated by the art of software developmentTuesday, October 2, 2012
  • 4. 2n d ed it io n! CO-AUTHORTuesday, October 2, 2012
  • 5. war Telling some stories ^ • Prototyping • Application Development • Maintaining Your Big Search IndexesTuesday, October 2, 2012
  • 6. Not an intro to SolrCloud! • Great tutorials given by Tomás Fernández Löbbe from LucidWorks yesterday!Tuesday, October 2, 2012
  • 7. Background for Client X’s Project • Big Data is any data set that is primarily at rest due to the difficulty of working with it. • 100’s of millions of documents to search • Limited selection of tools available. • Aggressive timeline. • All the data must be searched per query. • On Solr 3.x lineTuesday, October 2, 2012
  • 8. Telling some stories • Prototyping • Application Development • Maintaining Your Big Search IndexesTuesday, October 2, 2012
  • 9. Boy meets Girl Story Metadata Ingest Solr Solr Pipeline Solr Solr Content FilesTuesday, October 2, 2012
  • 10. Bash RocksTuesday, October 2, 2012
  • 11. Bash Rocks • Remote Solr stop/start scripts • Remote Indexer stop/start scripts • Performance Monitoring • Content Extraction scripts (+Java) • Ingestor Scripts (+Java) • Artifact Deployment (CM)Tuesday, October 2, 2012
  • 12. Make it easy to change approachTuesday, October 2, 2012
  • 13. Make it easy to change sharding public void run(Map options, List<SolrInputDocument> docs) throws InstantiationException, IllegalAccessException, ClassNotFoundException { IndexStrategy indexStrategy = (IndexStrategy) Class.forName( "com.o19s.solr.ModShardIndexStrategy").newInstance(); indexStrategy.configure(options); for (SolrInputDocument doc:docs){ indexStrategy.addDocument(doc); } }Tuesday, October 2, 2012
  • 14. Separate JVM from Solr Cores • Step 1: Fire up empty Solr’s on all the servers (nohup &). • Step 2:Verify they started cleanly. • Step 3: Create Cores (curl http:// search1.o19s.com:8983/solr/admin? action=create&name=run2) • Step 4: Create a “aggregator” core, passing in urls of Cores. (&property.shards=)Tuesday, October 2, 2012
  • 15. Go Wide QuicklyTuesday, October 2, 2012
  • 16. search1.o19s.com search1.o19s.com shard1 shard1 shard1 shard1 :8983 shard1 shard1 shard1 shard1 :8983 search2.o19s.com shard1 shard1 shard1 shard8 :8984 shard1 shard1 shard1 :8983 shard8 shard1 shard1 shard1 :8985 shard12 search3.o19s.com shard1 shard1 shard1 :8985 shard12 shard1 shard1 shard1 :8983 shard12Tuesday, October 2, 2012
  • 17. Simple Pipeline • Simple pipeline • mv is atomicTuesday, October 2, 2012
  • 18. Don’t Move Files • SCP across machines is slow/error prone • NFS share, single point of failure. • Clustered file system like GFS (Global File System) can have “fencing” issues • HDFS shines here. • ZooKeeper shines here.Tuesday, October 2, 2012
  • 19. Can you test your changes?Tuesday, October 2, 2012
  • 20. JVM tuning is black art -verbose:gc -XX:+PrintGCDetails -server -Xmx8G -Xms8G -XX:MaxPermSize=256m -XX:PermSize=256m -XX:+AggressiveHeap -XX:+DisableExplicitGC -XX:ParallelGCThreads=16 -XX:+UseParallelOldGCTuesday, October 2, 2012
  • 21. Tuesday, October 2, 2012
  • 22. Run, don’t WalkTuesday, October 2, 2012
  • 23. Telling some stories • Prototyping • Application Development • Maintaining Your Big Search IndexesTuesday, October 2, 2012
  • 24. Using Solr as key/value store Solr Key/ Value Cache Metadata Ingest Solr Solr Pipeline Solr Solr Content FilesTuesday, October 2, 2012
  • 25. Using Solr as key/value store • thousands of queries per second without real time get. http://localhost:8983/solr/run2_enrichment/select? q=id:DOC45242&fl=entities,html • how fast with real time get? http://localhost:8983/solr/run2_enrichment/get? id=DOC45242&fl=entities,htmlTuesday, October 2, 2012
  • 26. Push schema definition to the application • Not “schema less” • Just different owner of schema! • Schema may have common set of fields like id, type, timestamp, version • Nothing required. q=intensity_i:[70 TO 0]&fq=TYPE:streetlamp_monitorTuesday, October 2, 2012
  • 27. Don’t do expensive things in Solr • Tika content extraction aka Solr Cell • UpdateRequestProcessorChainTuesday, October 2, 2012
  • 28. Don’t do expensive things in Solr • Tika content extraction aka Solr Cell • UpdateRequestProcessorChainTuesday, October 2, 2012
  • 29. Beware JavaBin Solr Key/ Value Cache Metadata Ingest Solr Solr Pipeline Solr Solr Content FilesTuesday, October 2, 2012
  • 30. Beware JavaBin Solr Key/ Solr 3.4 Value Cache Metadata Ingest Solr Solr Pipeline Solr Solr Content FilesTuesday, October 2, 2012
  • 31. Beware JavaBin Solr Key/ Solr 3.4 Value Cache Metadata Solr 4 Ingest Solr Solr Pipeline Solr Solr Content FilesTuesday, October 2, 2012
  • 32. Beware JavaBin Solr Key/ Solr 3.4 Value Cache Metadata Solr 4 Ingest Solr Solr Pipeline Solr Solr Content Which SolrJ Files version do I use?Tuesday, October 2, 2012
  • 33. No JavaBin /u G te p iv / da e av m r e o! • Avoid Jarmaggeddon • Reflection? Ugh.Tuesday, October 2, 2012
  • 34. Avro! • Supports serialization of data readable from multiple languages • It’s smart XML, w/o the XML! • Handles forward and reverse versions of an object • Compact and fast to read.Tuesday, October 2, 2012
  • 35. Avro! Solr Key/ Value Cache .avro Metadata Ingest Solr Solr Pipeline Solr Solr Content FilesTuesday, October 2, 2012
  • 36. Telling some stories • Prototyping • Application Development • Maintaining Your Big Search IndexesTuesday, October 2, 2012
  • 37. Upgrade Lucene Indexes Easily • Don’t reindex! • Try out new versions of Lucene based search engines. David Lyle java -cp lucene-core.jar org.apache.lucene.index.IndexUpgrader [-delete-prior- commits] [-verbose] indexDirTuesday, October 2, 2012
  • 38. Indexing is Easy and QuickTuesday, October 2, 2012
  • 39. CHEAP AND CHEERFUL < >Tuesday, October 2, 2012
  • 40. NRT versus BigDataTuesday, October 2, 2012
  • 41. The tension between scale and update rate 10 million Bad Place 100’s of millionsTuesday, October 2, 2012
  • 42. Grim ReaperTuesday, October 2, 2012
  • 43. Delayed Replication <requestHandler name="/replication" class="solr.ReplicationHandler" > <lst name="slave"> <str name="masterUrl">http://localhost:8983/solr/replication</str> <str name="pollInterval">36:00:00</str> </lst> </requestHandler>Tuesday, October 2, 2012
  • 44. Enable/Disable • Solr-3301Tuesday, October 2, 2012
  • 45. Enable/Disable <requestHandler name="/admin/ping" class="solr.PingRequestHandler"> <lst name="invariants"> <str name="q">MY HARD QUERY</str> <str name="shards">http://search1.o19s.com:8983/solr/run2_1,http:// search1.o19s.com:8983/solr/run2_2,http://search1.o19s.com:8983/solr/run2_2 </lst> <lst name="defaults"> <str name="echoParams">all</str> </lst> <str name="healthcheckFile">server-enabled.txt</str> </requestHandler>Tuesday, October 2, 2012
  • 46. Provisioning • Chef/Puppet • ZooKeeper • Have you versioned everything to build an index over again?Tuesday, October 2, 2012
  • 47. TRADITIONAL ENVIRONMENTTuesday, October 2, 2012
  • 48. th in POOLED ENVIRONMENT k Cl ou d!Tuesday, October 2, 2012
  • 49. Do I need Failover? • Can I build quickly? • Do I have a reliable cluster of servers? • Am I spread across data centers? • Is sooo 90’s....Tuesday, October 2, 2012
  • 50. Telling some stories • Prototyping • Application Development • Maintaining Your Big Search IndexesTuesday, October 2, 2012
  • 51. One more thought...Tuesday, October 2, 2012
  • 52. Measuring the impact of our algorithms changes is just getting harder with Big Data.Tuesday, October 2, 2012
  • 53. Project SolrPanlTuesday, October 2, 2012
  • 54. Thank you! Questions? • epugh@o19s.com • @dep4b • www.opensourceconnections.comTuesday, October 2, 2012