Big Search 4 Big Data War Stories

  • 456 views
Uploaded on

Some lessons that we learned in rolling out a search engine across a very big set of data.

Some lessons that we learned in rolling out a search engine across a very big set of data.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
456
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
17
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Big Search 4 Big DataEnterprise Search Summit Europe 2013 LondonEric Pugh | epugh@o19s.com | @dep4b1
  • 2. Who am I?• Principal of OpenSource Connections- Solr/Lucene Search Consultancy• Member of Apache SoftwareFoundation• SOLR-284 UpdateRichDocuments(July 07)• Fascinated by the art of softwaredevelopment2
  • 3. CO-AUTHORWorkingon4.0!3
  • 4. Telling some stories• Prototyping• Application Development• MaintainingYour Big Search Indexeswar^4
  • 5. What is Big Search?5
  • 6. Background for ClientX’s Project• Big Data is any data set that is primarily atrest due to the difficulty of working with it.• 100’s of millions of documents to search• Aggressive timeline.• All the data must be searched per query.• Limited selection of tools available.• On Solr 3.x line6
  • 7. Telling some stories•Prototyping• Application Development• MaintainingYour Big Search Indexes7
  • 8. Boy meets Girl StoryMetadataContentFilesIngestPipelineSolrSolrSolrSolr8
  • 9. Bash Rocks9
  • 10. Bash Rocks• Remote Solr stop/start scripts• Remote Indexer stop/start scripts• Performance Monitoring• Content Extraction scripts (+Java)• Ingestor Scripts (+Java)• Artifact Deployment (CM)10
  • 11. Lesson:Don’tgetcapturedbyyourenvironment11
  • 12. Make it easy to changeapproach12
  • 13. Make it easy to changesharding IndexStrategy indexStrategy = (IndexStrategy) Class.forName( "com.o19s.solr.ModShardIndexStrategy").newInstance(); indexStrategy.configure(options); for (SolrInputDocument doc:docs){ indexStrategy.addDocument(doc); } Lesson:Sharpenyouraxe13
  • 14. Go Wide Quickly14
  • 15. shard1shard1shard1shard1 :8983shard1shard1shard1shard8 :8984shard1shard1shard1shard12 :8985search1.o19s.comshard1shard1shard1shard12 :8985shard1shard1shard1shard1 :8983search1.o19s.comshard1shard1shard1shard8 :8983search2.o19s.comshard1shard1shard1shard12 :8983search3.o19s.comLesson:Hardwareischeap/devsexpensive15
  • 16. Why so many pipelines?16
  • 17. Simple Pipeline• Simple pipeline• mv is atomicLesson:SimpleWorks17
  • 18. Don’t Move Files• SCP across machines is slow/error prone• NFS share, single point of failure.• Clustered file system like GFS (Global FileSystem) can have “fencing” issues• HDFS shines here.• ZooKeeper shines here.• Map/Reduce18
  • 19. Can you test yourchanges?19
  • 20. JVM tuning is black art-verbose:gc-XX:+PrintGCDetails-server-Xmx8G-Xms8G-XX:MaxPermSize=256m-XX:PermSize=256m-XX:+AggressiveHeap-XX:+DisableExplicitGC-XX:ParallelGCThreads=16-XX:+UseParallelOldGC20
  • 21. 21
  • 22. Run, don’t WalkLesson:Testingneedstobeeasy22
  • 23. Telling some stories• Prototyping•Application Development• MaintainingYour Big Search Indexes23
  • 24. Using Solr as key/value storeMetadataContentFilesIngestPipelineSolrSolrSolrSolrSolr Key/Value Cache24
  • 25. • thousands of queries per second withoutreal time get.• how fast with real time get?http://localhost:8983/solr/select?q=id:DOC45242&fl=entities,htmlhttp://localhost:8983/solr/get?id=DOC45242&fl=entities,htmlUsing Solr as key/value storeLesson:Usewhatyouhaveathand25
  • 26. Don’t do expensivethings in Solr• Tika content extraction aka Solr Cell• UpdateRequestProcessorChain• Don’t duplicate work26
  • 27. Tika as a pipeline?• Auto detects content type• Metadata structure has all thekey/value needed for Solr• Allows us to scale up withBehemoth project.• Ingest multiple XML formats aswell as CSV and EDI27
  • 28. Telling some stories• Prototyping• Application Development•Maintaining Your Big Search Indexes28
  • 29. Indexing is Easy andQuick29
  • 30. CHEAP AND CHEERFUL><30
  • 31. NRT versus BigData31
  • 32. The tension betweenscale and update rate10 million 100’s of millionsBad Place32
  • 33. Grim Reaper33
  • 34. Grim Reaper “Death ofMice”Especially if you are oncloud platform. Theyimplement their serverson the cheapestcommodity hardwareLesson:Embracefailure,don’tfearit34
  • 35. Provisioning• Chef/Puppet• ZooKeeper• Have you versioned everything to build anindex over again?Lesson:AutomateEverything!35
  • 36. TRADITIONAL ENVIRONMENT36
  • 37. POOLED ENVIRONMENTLesson:ThinkCloud37
  • 38. Building  a  Patents  Index0751502253005 days 3 days 30 Minutes1 5300MachineCountWhat  happens  when  we  want  to  index  2  million  patents  in  30  minutes?38
  • 39. Amazon  AWS  is  Good  but...•EC2  is  costly  for  your  “base”  load•Issues  of  access  to  internal  data•Firewall  and  security39
  • 40. Do I need Failover?• Can I build quickly?• Do I have a reliable cluster of servers?• Am I spread across data centers?• Is sooo 90’s....• Think shared nothing cluster!Lesson:No!40
  • 41. Telling some stories• Prototyping• Application Development• MaintainingYour Big Search Indexes41
  • 42. One more thought...42
  • 43. Measuring the impactof our algorithmschanges is just gettingharder with Big Data.43
  • 44. www.solrpa.nlProject SolrPanlWeneedamotivatedbetatester!44
  • 45. Thank you!Questions?• epugh@o19s.com• @dep4b• www.opensourceconnections.com• slideshare.com/o19sNervous aboutspeaking up? Askme later!45