Your SlideShare is downloading. ×
Big Search 4 Big Data War Stories
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Big Search 4 Big Data War Stories


Published on

Some lessons that we learned in rolling out a search engine across a very big set of data.

Some lessons that we learned in rolling out a search engine across a very big set of data.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Big Search 4 Big DataEnterprise Search Summit Europe 2013 LondonEric Pugh | | @dep4b1
  • 2. Who am I?• Principal of OpenSource Connections- Solr/Lucene Search Consultancy• Member of Apache SoftwareFoundation• SOLR-284 UpdateRichDocuments(July 07)• Fascinated by the art of softwaredevelopment2
  • 3. CO-AUTHORWorkingon4.0!3
  • 4. Telling some stories• Prototyping• Application Development• MaintainingYour Big Search Indexeswar^4
  • 5. What is Big Search?5
  • 6. Background for ClientX’s Project• Big Data is any data set that is primarily atrest due to the difficulty of working with it.• 100’s of millions of documents to search• Aggressive timeline.• All the data must be searched per query.• Limited selection of tools available.• On Solr 3.x line6
  • 7. Telling some stories•Prototyping• Application Development• MaintainingYour Big Search Indexes7
  • 8. Boy meets Girl StoryMetadataContentFilesIngestPipelineSolrSolrSolrSolr8
  • 9. Bash Rocks9
  • 10. Bash Rocks• Remote Solr stop/start scripts• Remote Indexer stop/start scripts• Performance Monitoring• Content Extraction scripts (+Java)• Ingestor Scripts (+Java)• Artifact Deployment (CM)10
  • 11. Lesson:Don’tgetcapturedbyyourenvironment11
  • 12. Make it easy to changeapproach12
  • 13. Make it easy to changesharding IndexStrategy indexStrategy = (IndexStrategy) Class.forName( "com.o19s.solr.ModShardIndexStrategy").newInstance(); indexStrategy.configure(options); for (SolrInputDocument doc:docs){ indexStrategy.addDocument(doc); } Lesson:Sharpenyouraxe13
  • 14. Go Wide Quickly14
  • 15. shard1shard1shard1shard1 :8983shard1shard1shard1shard8 :8984shard1shard1shard1shard12 :8985search1.o19s.comshard1shard1shard1shard12 :8985shard1shard1shard1shard1 :8983search1.o19s.comshard1shard1shard1shard8 :8983search2.o19s.comshard1shard1shard1shard12 :8983search3.o19s.comLesson:Hardwareischeap/devsexpensive15
  • 16. Why so many pipelines?16
  • 17. Simple Pipeline• Simple pipeline• mv is atomicLesson:SimpleWorks17
  • 18. Don’t Move Files• SCP across machines is slow/error prone• NFS share, single point of failure.• Clustered file system like GFS (Global FileSystem) can have “fencing” issues• HDFS shines here.• ZooKeeper shines here.• Map/Reduce18
  • 19. Can you test yourchanges?19
  • 20. JVM tuning is black art-verbose:gc-XX:+PrintGCDetails-server-Xmx8G-Xms8G-XX:MaxPermSize=256m-XX:PermSize=256m-XX:+AggressiveHeap-XX:+DisableExplicitGC-XX:ParallelGCThreads=16-XX:+UseParallelOldGC20
  • 21. 21
  • 22. Run, don’t WalkLesson:Testingneedstobeeasy22
  • 23. Telling some stories• Prototyping•Application Development• MaintainingYour Big Search Indexes23
  • 24. Using Solr as key/value storeMetadataContentFilesIngestPipelineSolrSolrSolrSolrSolr Key/Value Cache24
  • 25. • thousands of queries per second withoutreal time get.• how fast with real time get?http://localhost:8983/solr/select?q=id:DOC45242&fl=entities,htmlhttp://localhost:8983/solr/get?id=DOC45242&fl=entities,htmlUsing Solr as key/value storeLesson:Usewhatyouhaveathand25
  • 26. Don’t do expensivethings in Solr• Tika content extraction aka Solr Cell• UpdateRequestProcessorChain• Don’t duplicate work26
  • 27. Tika as a pipeline?• Auto detects content type• Metadata structure has all thekey/value needed for Solr• Allows us to scale up withBehemoth project.• Ingest multiple XML formats aswell as CSV and EDI27
  • 28. Telling some stories• Prototyping• Application Development•Maintaining Your Big Search Indexes28
  • 29. Indexing is Easy andQuick29
  • 31. NRT versus BigData31
  • 32. The tension betweenscale and update rate10 million 100’s of millionsBad Place32
  • 33. Grim Reaper33
  • 34. Grim Reaper “Death ofMice”Especially if you are oncloud platform. Theyimplement their serverson the cheapestcommodity hardwareLesson:Embracefailure,don’tfearit34
  • 35. Provisioning• Chef/Puppet• ZooKeeper• Have you versioned everything to build anindex over again?Lesson:AutomateEverything!35
  • 37. POOLED ENVIRONMENTLesson:ThinkCloud37
  • 38. Building  a  Patents  Index0751502253005 days 3 days 30 Minutes1 5300MachineCountWhat  happens  when  we  want  to  index  2  million  patents  in  30  minutes?38
  • 39. Amazon  AWS  is  Good  but...•EC2  is  costly  for  your  “base”  load•Issues  of  access  to  internal  data•Firewall  and  security39
  • 40. Do I need Failover?• Can I build quickly?• Do I have a reliable cluster of servers?• Am I spread across data centers?• Is sooo 90’s....• Think shared nothing cluster!Lesson:No!40
  • 41. Telling some stories• Prototyping• Application Development• MaintainingYour Big Search Indexes41
  • 42. One more thought...42
  • 43. Measuring the impactof our algorithmschanges is just gettingharder with Big Data.43
  • 44. www.solrpa.nlProject SolrPanlWeneedamotivatedbetatester!44
  • 45. Thank you!Questions?•• @dep4b•• aboutspeaking up? Askme later!45