Big Search 4 Big Data War Stories

934 views

Published on

Some lessons that we learned in rolling out a search engine across a very big set of data.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
934
On SlideShare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
19
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Big Search 4 Big Data War Stories

  1. 1. Big Search 4 Big DataEnterprise Search Summit Europe 2013 LondonEric Pugh | epugh@o19s.com | @dep4b1
  2. 2. Who am I?• Principal of OpenSource Connections- Solr/Lucene Search Consultancy• Member of Apache SoftwareFoundation• SOLR-284 UpdateRichDocuments(July 07)• Fascinated by the art of softwaredevelopment2
  3. 3. CO-AUTHORWorkingon4.0!3
  4. 4. Telling some stories• Prototyping• Application Development• MaintainingYour Big Search Indexeswar^4
  5. 5. What is Big Search?5
  6. 6. Background for ClientX’s Project• Big Data is any data set that is primarily atrest due to the difficulty of working with it.• 100’s of millions of documents to search• Aggressive timeline.• All the data must be searched per query.• Limited selection of tools available.• On Solr 3.x line6
  7. 7. Telling some stories•Prototyping• Application Development• MaintainingYour Big Search Indexes7
  8. 8. Boy meets Girl StoryMetadataContentFilesIngestPipelineSolrSolrSolrSolr8
  9. 9. Bash Rocks9
  10. 10. Bash Rocks• Remote Solr stop/start scripts• Remote Indexer stop/start scripts• Performance Monitoring• Content Extraction scripts (+Java)• Ingestor Scripts (+Java)• Artifact Deployment (CM)10
  11. 11. Lesson:Don’tgetcapturedbyyourenvironment11
  12. 12. Make it easy to changeapproach12
  13. 13. Make it easy to changesharding IndexStrategy indexStrategy = (IndexStrategy) Class.forName( "com.o19s.solr.ModShardIndexStrategy").newInstance(); indexStrategy.configure(options); for (SolrInputDocument doc:docs){ indexStrategy.addDocument(doc); } Lesson:Sharpenyouraxe13
  14. 14. Go Wide Quickly14
  15. 15. shard1shard1shard1shard1 :8983shard1shard1shard1shard8 :8984shard1shard1shard1shard12 :8985search1.o19s.comshard1shard1shard1shard12 :8985shard1shard1shard1shard1 :8983search1.o19s.comshard1shard1shard1shard8 :8983search2.o19s.comshard1shard1shard1shard12 :8983search3.o19s.comLesson:Hardwareischeap/devsexpensive15
  16. 16. Why so many pipelines?16
  17. 17. Simple Pipeline• Simple pipeline• mv is atomicLesson:SimpleWorks17
  18. 18. Don’t Move Files• SCP across machines is slow/error prone• NFS share, single point of failure.• Clustered file system like GFS (Global FileSystem) can have “fencing” issues• HDFS shines here.• ZooKeeper shines here.• Map/Reduce18
  19. 19. Can you test yourchanges?19
  20. 20. JVM tuning is black art-verbose:gc-XX:+PrintGCDetails-server-Xmx8G-Xms8G-XX:MaxPermSize=256m-XX:PermSize=256m-XX:+AggressiveHeap-XX:+DisableExplicitGC-XX:ParallelGCThreads=16-XX:+UseParallelOldGC20
  21. 21. 21
  22. 22. Run, don’t WalkLesson:Testingneedstobeeasy22
  23. 23. Telling some stories• Prototyping•Application Development• MaintainingYour Big Search Indexes23
  24. 24. Using Solr as key/value storeMetadataContentFilesIngestPipelineSolrSolrSolrSolrSolr Key/Value Cache24
  25. 25. • thousands of queries per second withoutreal time get.• how fast with real time get?http://localhost:8983/solr/select?q=id:DOC45242&fl=entities,htmlhttp://localhost:8983/solr/get?id=DOC45242&fl=entities,htmlUsing Solr as key/value storeLesson:Usewhatyouhaveathand25
  26. 26. Don’t do expensivethings in Solr• Tika content extraction aka Solr Cell• UpdateRequestProcessorChain• Don’t duplicate work26
  27. 27. Tika as a pipeline?• Auto detects content type• Metadata structure has all thekey/value needed for Solr• Allows us to scale up withBehemoth project.• Ingest multiple XML formats aswell as CSV and EDI27
  28. 28. Telling some stories• Prototyping• Application Development•Maintaining Your Big Search Indexes28
  29. 29. Indexing is Easy andQuick29
  30. 30. CHEAP AND CHEERFUL><30
  31. 31. NRT versus BigData31
  32. 32. The tension betweenscale and update rate10 million 100’s of millionsBad Place32
  33. 33. Grim Reaper33
  34. 34. Grim Reaper “Death ofMice”Especially if you are oncloud platform. Theyimplement their serverson the cheapestcommodity hardwareLesson:Embracefailure,don’tfearit34
  35. 35. Provisioning• Chef/Puppet• ZooKeeper• Have you versioned everything to build anindex over again?Lesson:AutomateEverything!35
  36. 36. TRADITIONAL ENVIRONMENT36
  37. 37. POOLED ENVIRONMENTLesson:ThinkCloud37
  38. 38. Building  a  Patents  Index0751502253005 days 3 days 30 Minutes1 5300MachineCountWhat  happens  when  we  want  to  index  2  million  patents  in  30  minutes?38
  39. 39. Amazon  AWS  is  Good  but...•EC2  is  costly  for  your  “base”  load•Issues  of  access  to  internal  data•Firewall  and  security39
  40. 40. Do I need Failover?• Can I build quickly?• Do I have a reliable cluster of servers?• Am I spread across data centers?• Is sooo 90’s....• Think shared nothing cluster!Lesson:No!40
  41. 41. Telling some stories• Prototyping• Application Development• MaintainingYour Big Search Indexes41
  42. 42. One more thought...42
  43. 43. Measuring the impactof our algorithmschanges is just gettingharder with Big Data.43
  44. 44. www.solrpa.nlProject SolrPanlWeneedamotivatedbetatester!44
  45. 45. Thank you!Questions?• epugh@o19s.com• @dep4b• www.opensourceconnections.com• slideshare.com/o19sNervous aboutspeaking up? Askme later!45

×