OSSCON: Big Search 4 Big Data

1,037 views
937 views

Published on

At Basis Technologies Open Source Search conference I talked about a project I did this past year, and talked about the lessons, both good and the bad that we learned.

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,037
On SlideShare
0
From Embeds
0
Number of Embeds
13
Actions
Shares
0
Downloads
15
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

OSSCON: Big Search 4 Big Data

  1. 1. Big Search w/ Big Data Principles Basis Technology Open Source Search 2012 Eric Pugh | epugh@o19s.com | @dep4bTuesday, October 2, 2012
  2. 2. What is Big Search?Tuesday, October 2, 2012
  3. 3. Who am I? • Principal of OpenSource Connections - Solr/Lucene Search Consultancy • Member of Apache Software Foundation • SOLR-284 UpdateRichDocuments (July 07) • Fascinated by the art of software developmentTuesday, October 2, 2012
  4. 4. 2n d ed it io n! CO-AUTHORTuesday, October 2, 2012
  5. 5. war Telling some stories ^ • Prototyping • Application Development • Maintaining Your Big Search IndexesTuesday, October 2, 2012
  6. 6. Not an intro to SolrCloud! • Great tutorials given by Tomás Fernández Löbbe from LucidWorks yesterday!Tuesday, October 2, 2012
  7. 7. Background for Client X’s Project • Big Data is any data set that is primarily at rest due to the difficulty of working with it. • 100’s of millions of documents to search • Limited selection of tools available. • Aggressive timeline. • All the data must be searched per query. • On Solr 3.x lineTuesday, October 2, 2012
  8. 8. Telling some stories • Prototyping • Application Development • Maintaining Your Big Search IndexesTuesday, October 2, 2012
  9. 9. Boy meets Girl Story Metadata Ingest Solr Solr Pipeline Solr Solr Content FilesTuesday, October 2, 2012
  10. 10. Bash RocksTuesday, October 2, 2012
  11. 11. Bash Rocks • Remote Solr stop/start scripts • Remote Indexer stop/start scripts • Performance Monitoring • Content Extraction scripts (+Java) • Ingestor Scripts (+Java) • Artifact Deployment (CM)Tuesday, October 2, 2012
  12. 12. Make it easy to change approachTuesday, October 2, 2012
  13. 13. Make it easy to change sharding public void run(Map options, List<SolrInputDocument> docs) throws InstantiationException, IllegalAccessException, ClassNotFoundException { IndexStrategy indexStrategy = (IndexStrategy) Class.forName( "com.o19s.solr.ModShardIndexStrategy").newInstance(); indexStrategy.configure(options); for (SolrInputDocument doc:docs){ indexStrategy.addDocument(doc); } }Tuesday, October 2, 2012
  14. 14. Separate JVM from Solr Cores • Step 1: Fire up empty Solr’s on all the servers (nohup &). • Step 2:Verify they started cleanly. • Step 3: Create Cores (curl http:// search1.o19s.com:8983/solr/admin? action=create&name=run2) • Step 4: Create a “aggregator” core, passing in urls of Cores. (&property.shards=)Tuesday, October 2, 2012
  15. 15. Go Wide QuicklyTuesday, October 2, 2012
  16. 16. search1.o19s.com search1.o19s.com shard1 shard1 shard1 shard1 :8983 shard1 shard1 shard1 shard1 :8983 search2.o19s.com shard1 shard1 shard1 shard8 :8984 shard1 shard1 shard1 :8983 shard8 shard1 shard1 shard1 :8985 shard12 search3.o19s.com shard1 shard1 shard1 :8985 shard12 shard1 shard1 shard1 :8983 shard12Tuesday, October 2, 2012
  17. 17. Simple Pipeline • Simple pipeline • mv is atomicTuesday, October 2, 2012
  18. 18. Don’t Move Files • SCP across machines is slow/error prone • NFS share, single point of failure. • Clustered file system like GFS (Global File System) can have “fencing” issues • HDFS shines here. • ZooKeeper shines here.Tuesday, October 2, 2012
  19. 19. Can you test your changes?Tuesday, October 2, 2012
  20. 20. JVM tuning is black art -verbose:gc -XX:+PrintGCDetails -server -Xmx8G -Xms8G -XX:MaxPermSize=256m -XX:PermSize=256m -XX:+AggressiveHeap -XX:+DisableExplicitGC -XX:ParallelGCThreads=16 -XX:+UseParallelOldGCTuesday, October 2, 2012
  21. 21. Tuesday, October 2, 2012
  22. 22. Run, don’t WalkTuesday, October 2, 2012
  23. 23. Telling some stories • Prototyping • Application Development • Maintaining Your Big Search IndexesTuesday, October 2, 2012
  24. 24. Using Solr as key/value store Solr Key/ Value Cache Metadata Ingest Solr Solr Pipeline Solr Solr Content FilesTuesday, October 2, 2012
  25. 25. Using Solr as key/value store • thousands of queries per second without real time get. http://localhost:8983/solr/run2_enrichment/select? q=id:DOC45242&fl=entities,html • how fast with real time get? http://localhost:8983/solr/run2_enrichment/get? id=DOC45242&fl=entities,htmlTuesday, October 2, 2012
  26. 26. Push schema definition to the application • Not “schema less” • Just different owner of schema! • Schema may have common set of fields like id, type, timestamp, version • Nothing required. q=intensity_i:[70 TO 0]&fq=TYPE:streetlamp_monitorTuesday, October 2, 2012
  27. 27. Don’t do expensive things in Solr • Tika content extraction aka Solr Cell • UpdateRequestProcessorChainTuesday, October 2, 2012
  28. 28. Don’t do expensive things in Solr • Tika content extraction aka Solr Cell • UpdateRequestProcessorChainTuesday, October 2, 2012
  29. 29. Beware JavaBin Solr Key/ Value Cache Metadata Ingest Solr Solr Pipeline Solr Solr Content FilesTuesday, October 2, 2012
  30. 30. Beware JavaBin Solr Key/ Solr 3.4 Value Cache Metadata Ingest Solr Solr Pipeline Solr Solr Content FilesTuesday, October 2, 2012
  31. 31. Beware JavaBin Solr Key/ Solr 3.4 Value Cache Metadata Solr 4 Ingest Solr Solr Pipeline Solr Solr Content FilesTuesday, October 2, 2012
  32. 32. Beware JavaBin Solr Key/ Solr 3.4 Value Cache Metadata Solr 4 Ingest Solr Solr Pipeline Solr Solr Content Which SolrJ Files version do I use?Tuesday, October 2, 2012
  33. 33. No JavaBin /u G te p iv / da e av m r e o! • Avoid Jarmaggeddon • Reflection? Ugh.Tuesday, October 2, 2012
  34. 34. Avro! • Supports serialization of data readable from multiple languages • It’s smart XML, w/o the XML! • Handles forward and reverse versions of an object • Compact and fast to read.Tuesday, October 2, 2012
  35. 35. Avro! Solr Key/ Value Cache .avro Metadata Ingest Solr Solr Pipeline Solr Solr Content FilesTuesday, October 2, 2012
  36. 36. Telling some stories • Prototyping • Application Development • Maintaining Your Big Search IndexesTuesday, October 2, 2012
  37. 37. Upgrade Lucene Indexes Easily • Don’t reindex! • Try out new versions of Lucene based search engines. David Lyle java -cp lucene-core.jar org.apache.lucene.index.IndexUpgrader [-delete-prior- commits] [-verbose] indexDirTuesday, October 2, 2012
  38. 38. Indexing is Easy and QuickTuesday, October 2, 2012
  39. 39. CHEAP AND CHEERFUL < >Tuesday, October 2, 2012
  40. 40. NRT versus BigDataTuesday, October 2, 2012
  41. 41. The tension between scale and update rate 10 million Bad Place 100’s of millionsTuesday, October 2, 2012
  42. 42. Grim ReaperTuesday, October 2, 2012
  43. 43. Delayed Replication <requestHandler name="/replication" class="solr.ReplicationHandler" > <lst name="slave"> <str name="masterUrl">http://localhost:8983/solr/replication</str> <str name="pollInterval">36:00:00</str> </lst> </requestHandler>Tuesday, October 2, 2012
  44. 44. Enable/Disable • Solr-3301Tuesday, October 2, 2012
  45. 45. Enable/Disable <requestHandler name="/admin/ping" class="solr.PingRequestHandler"> <lst name="invariants"> <str name="q">MY HARD QUERY</str> <str name="shards">http://search1.o19s.com:8983/solr/run2_1,http:// search1.o19s.com:8983/solr/run2_2,http://search1.o19s.com:8983/solr/run2_2 </lst> <lst name="defaults"> <str name="echoParams">all</str> </lst> <str name="healthcheckFile">server-enabled.txt</str> </requestHandler>Tuesday, October 2, 2012
  46. 46. Provisioning • Chef/Puppet • ZooKeeper • Have you versioned everything to build an index over again?Tuesday, October 2, 2012
  47. 47. TRADITIONAL ENVIRONMENTTuesday, October 2, 2012
  48. 48. th in POOLED ENVIRONMENT k Cl ou d!Tuesday, October 2, 2012
  49. 49. Do I need Failover? • Can I build quickly? • Do I have a reliable cluster of servers? • Am I spread across data centers? • Is sooo 90’s....Tuesday, October 2, 2012
  50. 50. Telling some stories • Prototyping • Application Development • Maintaining Your Big Search IndexesTuesday, October 2, 2012
  51. 51. One more thought...Tuesday, October 2, 2012
  52. 52. Measuring the impact of our algorithms changes is just getting harder with Big Data.Tuesday, October 2, 2012
  53. 53. Project SolrPanlTuesday, October 2, 2012
  54. 54. Thank you! Questions? • epugh@o19s.com • @dep4b • www.opensourceconnections.comTuesday, October 2, 2012

×