OSSCON: Big Search 4 Big Data
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

OSSCON: Big Search 4 Big Data

  • 1,081 views
Uploaded on

At Basis Technologies Open Source Search conference I talked about a project I did this past year, and talked about the lessons, both good and the bad that we learned.

At Basis Technologies Open Source Search conference I talked about a project I did this past year, and talked about the lessons, both good and the bad that we learned.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,081
On Slideshare
1,074
From Embeds
7
Number of Embeds
3

Actions

Shares
Downloads
13
Comments
0
Likes
1

Embeds 7

https://twitter.com 4
http://www.twylah.com 2
https://si0.twimg.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Big Search w/ Big Data Principles Basis Technology Open Source Search 2012 Eric Pugh | epugh@o19s.com | @dep4bTuesday, October 2, 2012
  • 2. What is Big Search?Tuesday, October 2, 2012
  • 3. Who am I? • Principal of OpenSource Connections - Solr/Lucene Search Consultancy • Member of Apache Software Foundation • SOLR-284 UpdateRichDocuments (July 07) • Fascinated by the art of software developmentTuesday, October 2, 2012
  • 4. 2n d ed it io n! CO-AUTHORTuesday, October 2, 2012
  • 5. war Telling some stories ^ • Prototyping • Application Development • Maintaining Your Big Search IndexesTuesday, October 2, 2012
  • 6. Not an intro to SolrCloud! • Great tutorials given by Tomás Fernández Löbbe from LucidWorks yesterday!Tuesday, October 2, 2012
  • 7. Background for Client X’s Project • Big Data is any data set that is primarily at rest due to the difficulty of working with it. • 100’s of millions of documents to search • Limited selection of tools available. • Aggressive timeline. • All the data must be searched per query. • On Solr 3.x lineTuesday, October 2, 2012
  • 8. Telling some stories • Prototyping • Application Development • Maintaining Your Big Search IndexesTuesday, October 2, 2012
  • 9. Boy meets Girl Story Metadata Ingest Solr Solr Pipeline Solr Solr Content FilesTuesday, October 2, 2012
  • 10. Bash RocksTuesday, October 2, 2012
  • 11. Bash Rocks • Remote Solr stop/start scripts • Remote Indexer stop/start scripts • Performance Monitoring • Content Extraction scripts (+Java) • Ingestor Scripts (+Java) • Artifact Deployment (CM)Tuesday, October 2, 2012
  • 12. Make it easy to change approachTuesday, October 2, 2012
  • 13. Make it easy to change sharding public void run(Map options, List<SolrInputDocument> docs) throws InstantiationException, IllegalAccessException, ClassNotFoundException { IndexStrategy indexStrategy = (IndexStrategy) Class.forName( "com.o19s.solr.ModShardIndexStrategy").newInstance(); indexStrategy.configure(options); for (SolrInputDocument doc:docs){ indexStrategy.addDocument(doc); } }Tuesday, October 2, 2012
  • 14. Separate JVM from Solr Cores • Step 1: Fire up empty Solr’s on all the servers (nohup &). • Step 2:Verify they started cleanly. • Step 3: Create Cores (curl http:// search1.o19s.com:8983/solr/admin? action=create&name=run2) • Step 4: Create a “aggregator” core, passing in urls of Cores. (&property.shards=)Tuesday, October 2, 2012
  • 15. Go Wide QuicklyTuesday, October 2, 2012
  • 16. search1.o19s.com search1.o19s.com shard1 shard1 shard1 shard1 :8983 shard1 shard1 shard1 shard1 :8983 search2.o19s.com shard1 shard1 shard1 shard8 :8984 shard1 shard1 shard1 :8983 shard8 shard1 shard1 shard1 :8985 shard12 search3.o19s.com shard1 shard1 shard1 :8985 shard12 shard1 shard1 shard1 :8983 shard12Tuesday, October 2, 2012
  • 17. Simple Pipeline • Simple pipeline • mv is atomicTuesday, October 2, 2012
  • 18. Don’t Move Files • SCP across machines is slow/error prone • NFS share, single point of failure. • Clustered file system like GFS (Global File System) can have “fencing” issues • HDFS shines here. • ZooKeeper shines here.Tuesday, October 2, 2012
  • 19. Can you test your changes?Tuesday, October 2, 2012
  • 20. JVM tuning is black art -verbose:gc -XX:+PrintGCDetails -server -Xmx8G -Xms8G -XX:MaxPermSize=256m -XX:PermSize=256m -XX:+AggressiveHeap -XX:+DisableExplicitGC -XX:ParallelGCThreads=16 -XX:+UseParallelOldGCTuesday, October 2, 2012
  • 21. Tuesday, October 2, 2012
  • 22. Run, don’t WalkTuesday, October 2, 2012
  • 23. Telling some stories • Prototyping • Application Development • Maintaining Your Big Search IndexesTuesday, October 2, 2012
  • 24. Using Solr as key/value store Solr Key/ Value Cache Metadata Ingest Solr Solr Pipeline Solr Solr Content FilesTuesday, October 2, 2012
  • 25. Using Solr as key/value store • thousands of queries per second without real time get. http://localhost:8983/solr/run2_enrichment/select? q=id:DOC45242&fl=entities,html • how fast with real time get? http://localhost:8983/solr/run2_enrichment/get? id=DOC45242&fl=entities,htmlTuesday, October 2, 2012
  • 26. Push schema definition to the application • Not “schema less” • Just different owner of schema! • Schema may have common set of fields like id, type, timestamp, version • Nothing required. q=intensity_i:[70 TO 0]&fq=TYPE:streetlamp_monitorTuesday, October 2, 2012
  • 27. Don’t do expensive things in Solr • Tika content extraction aka Solr Cell • UpdateRequestProcessorChainTuesday, October 2, 2012
  • 28. Don’t do expensive things in Solr • Tika content extraction aka Solr Cell • UpdateRequestProcessorChainTuesday, October 2, 2012
  • 29. Beware JavaBin Solr Key/ Value Cache Metadata Ingest Solr Solr Pipeline Solr Solr Content FilesTuesday, October 2, 2012
  • 30. Beware JavaBin Solr Key/ Solr 3.4 Value Cache Metadata Ingest Solr Solr Pipeline Solr Solr Content FilesTuesday, October 2, 2012
  • 31. Beware JavaBin Solr Key/ Solr 3.4 Value Cache Metadata Solr 4 Ingest Solr Solr Pipeline Solr Solr Content FilesTuesday, October 2, 2012
  • 32. Beware JavaBin Solr Key/ Solr 3.4 Value Cache Metadata Solr 4 Ingest Solr Solr Pipeline Solr Solr Content Which SolrJ Files version do I use?Tuesday, October 2, 2012
  • 33. No JavaBin /u G te p iv / da e av m r e o! • Avoid Jarmaggeddon • Reflection? Ugh.Tuesday, October 2, 2012
  • 34. Avro! • Supports serialization of data readable from multiple languages • It’s smart XML, w/o the XML! • Handles forward and reverse versions of an object • Compact and fast to read.Tuesday, October 2, 2012
  • 35. Avro! Solr Key/ Value Cache .avro Metadata Ingest Solr Solr Pipeline Solr Solr Content FilesTuesday, October 2, 2012
  • 36. Telling some stories • Prototyping • Application Development • Maintaining Your Big Search IndexesTuesday, October 2, 2012
  • 37. Upgrade Lucene Indexes Easily • Don’t reindex! • Try out new versions of Lucene based search engines. David Lyle java -cp lucene-core.jar org.apache.lucene.index.IndexUpgrader [-delete-prior- commits] [-verbose] indexDirTuesday, October 2, 2012
  • 38. Indexing is Easy and QuickTuesday, October 2, 2012
  • 39. CHEAP AND CHEERFUL < >Tuesday, October 2, 2012
  • 40. NRT versus BigDataTuesday, October 2, 2012
  • 41. The tension between scale and update rate 10 million Bad Place 100’s of millionsTuesday, October 2, 2012
  • 42. Grim ReaperTuesday, October 2, 2012
  • 43. Delayed Replication <requestHandler name="/replication" class="solr.ReplicationHandler" > <lst name="slave"> <str name="masterUrl">http://localhost:8983/solr/replication</str> <str name="pollInterval">36:00:00</str> </lst> </requestHandler>Tuesday, October 2, 2012
  • 44. Enable/Disable • Solr-3301Tuesday, October 2, 2012
  • 45. Enable/Disable <requestHandler name="/admin/ping" class="solr.PingRequestHandler"> <lst name="invariants"> <str name="q">MY HARD QUERY</str> <str name="shards">http://search1.o19s.com:8983/solr/run2_1,http:// search1.o19s.com:8983/solr/run2_2,http://search1.o19s.com:8983/solr/run2_2 </lst> <lst name="defaults"> <str name="echoParams">all</str> </lst> <str name="healthcheckFile">server-enabled.txt</str> </requestHandler>Tuesday, October 2, 2012
  • 46. Provisioning • Chef/Puppet • ZooKeeper • Have you versioned everything to build an index over again?Tuesday, October 2, 2012
  • 47. TRADITIONAL ENVIRONMENTTuesday, October 2, 2012
  • 48. th in POOLED ENVIRONMENT k Cl ou d!Tuesday, October 2, 2012
  • 49. Do I need Failover? • Can I build quickly? • Do I have a reliable cluster of servers? • Am I spread across data centers? • Is sooo 90’s....Tuesday, October 2, 2012
  • 50. Telling some stories • Prototyping • Application Development • Maintaining Your Big Search IndexesTuesday, October 2, 2012
  • 51. One more thought...Tuesday, October 2, 2012
  • 52. Measuring the impact of our algorithms changes is just getting harder with Big Data.Tuesday, October 2, 2012
  • 53. Project SolrPanlTuesday, October 2, 2012
  • 54. Thank you! Questions? • epugh@o19s.com • @dep4b • www.opensourceconnections.comTuesday, October 2, 2012