0
Big Search 4 Big DataEnterprise Search Summit Europe 2013 LondonEric Pugh | epugh@o19s.com | @dep4b1
Who am I?• Principal of OpenSource Connections- Solr/Lucene Search Consultancy• Member of Apache SoftwareFoundation• SOLR-...
CO-AUTHORWorkingon4.0!3
Telling some stories• Prototyping• Application Development• MaintainingYour Big Search Indexeswar^4
What is Big Search?5
Background for ClientX’s Project• Big Data is any data set that is primarily atrest due to the difficulty of working with i...
Telling some stories•Prototyping• Application Development• MaintainingYour Big Search Indexes7
Boy meets Girl StoryMetadataContentFilesIngestPipelineSolrSolrSolrSolr8
Bash Rocks9
Bash Rocks• Remote Solr stop/start scripts• Remote Indexer stop/start scripts• Performance Monitoring• Content Extraction ...
Lesson:Don’tgetcapturedbyyourenvironment11
Make it easy to changeapproach12
Make it easy to changesharding	 	 IndexStrategy indexStrategy = (IndexStrategy) Class.forName(	 	 	 	 "com.o19s.solr.ModSh...
Go Wide Quickly14
shard1shard1shard1shard1 :8983shard1shard1shard1shard8 :8984shard1shard1shard1shard12 :8985search1.o19s.comshard1shard1sha...
Why so many pipelines?16
Simple Pipeline• Simple pipeline• mv is atomicLesson:SimpleWorks17
Don’t Move Files• SCP across machines is slow/error prone• NFS share, single point of failure.• Clustered file system like ...
Can you test yourchanges?19
JVM tuning is black art-verbose:gc-XX:+PrintGCDetails-server-Xmx8G-Xms8G-XX:MaxPermSize=256m-XX:PermSize=256m-XX:+Aggressi...
21
Run, don’t WalkLesson:Testingneedstobeeasy22
Telling some stories• Prototyping•Application Development• MaintainingYour Big Search Indexes23
Using Solr as key/value storeMetadataContentFilesIngestPipelineSolrSolrSolrSolrSolr Key/Value Cache24
• thousands of queries per second withoutreal time get.• how fast with real time get?http://localhost:8983/solr/select?q=i...
Don’t do expensivethings in Solr• Tika content extraction aka Solr Cell• UpdateRequestProcessorChain• Don’t duplicate work26
Tika as a pipeline?• Auto detects content type• Metadata structure has all thekey/value needed for Solr• Allows us to scal...
Telling some stories• Prototyping• Application Development•Maintaining Your Big Search Indexes28
Indexing is Easy andQuick29
CHEAP AND CHEERFUL><30
NRT versus BigData31
The tension betweenscale and update rate10 million 100’s of millionsBad Place32
Grim Reaper33
Grim Reaper “Death ofMice”Especially if you are oncloud platform. Theyimplement their serverson the cheapestcommodity hard...
Provisioning• Chef/Puppet• ZooKeeper• Have you versioned everything to build anindex over again?Lesson:AutomateEverything!35
TRADITIONAL ENVIRONMENT36
POOLED ENVIRONMENTLesson:ThinkCloud37
Building	  a	  Patents	  Index0751502253005 days 3 days 30 Minutes1 5300MachineCountWhat	  happens	  when	  we	  want	  to...
Amazon	  AWS	  is	  Good	  but...•EC2	  is	  costly	  for	  your	  “base”	  load•Issues	  of	  access	  to	  internal	  da...
Do I need Failover?• Can I build quickly?• Do I have a reliable cluster of servers?• Am I spread across data centers?• Is ...
Telling some stories• Prototyping• Application Development• MaintainingYour Big Search Indexes41
One more thought...42
Measuring the impactof our algorithmschanges is just gettingharder with Big Data.43
www.solrpa.nlProject SolrPanlWeneedamotivatedbetatester!44
Thank you!Questions?• epugh@o19s.com• @dep4b• www.opensourceconnections.com• slideshare.com/o19sNervous aboutspeaking up? ...
Upcoming SlideShare
Loading in...5
×

Big Search 4 Big Data War Stories

524

Published on

Some lessons that we learned in rolling out a search engine across a very big set of data.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
524
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
18
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Big Search 4 Big Data War Stories"

  1. 1. Big Search 4 Big DataEnterprise Search Summit Europe 2013 LondonEric Pugh | epugh@o19s.com | @dep4b1
  2. 2. Who am I?• Principal of OpenSource Connections- Solr/Lucene Search Consultancy• Member of Apache SoftwareFoundation• SOLR-284 UpdateRichDocuments(July 07)• Fascinated by the art of softwaredevelopment2
  3. 3. CO-AUTHORWorkingon4.0!3
  4. 4. Telling some stories• Prototyping• Application Development• MaintainingYour Big Search Indexeswar^4
  5. 5. What is Big Search?5
  6. 6. Background for ClientX’s Project• Big Data is any data set that is primarily atrest due to the difficulty of working with it.• 100’s of millions of documents to search• Aggressive timeline.• All the data must be searched per query.• Limited selection of tools available.• On Solr 3.x line6
  7. 7. Telling some stories•Prototyping• Application Development• MaintainingYour Big Search Indexes7
  8. 8. Boy meets Girl StoryMetadataContentFilesIngestPipelineSolrSolrSolrSolr8
  9. 9. Bash Rocks9
  10. 10. Bash Rocks• Remote Solr stop/start scripts• Remote Indexer stop/start scripts• Performance Monitoring• Content Extraction scripts (+Java)• Ingestor Scripts (+Java)• Artifact Deployment (CM)10
  11. 11. Lesson:Don’tgetcapturedbyyourenvironment11
  12. 12. Make it easy to changeapproach12
  13. 13. Make it easy to changesharding IndexStrategy indexStrategy = (IndexStrategy) Class.forName( "com.o19s.solr.ModShardIndexStrategy").newInstance(); indexStrategy.configure(options); for (SolrInputDocument doc:docs){ indexStrategy.addDocument(doc); } Lesson:Sharpenyouraxe13
  14. 14. Go Wide Quickly14
  15. 15. shard1shard1shard1shard1 :8983shard1shard1shard1shard8 :8984shard1shard1shard1shard12 :8985search1.o19s.comshard1shard1shard1shard12 :8985shard1shard1shard1shard1 :8983search1.o19s.comshard1shard1shard1shard8 :8983search2.o19s.comshard1shard1shard1shard12 :8983search3.o19s.comLesson:Hardwareischeap/devsexpensive15
  16. 16. Why so many pipelines?16
  17. 17. Simple Pipeline• Simple pipeline• mv is atomicLesson:SimpleWorks17
  18. 18. Don’t Move Files• SCP across machines is slow/error prone• NFS share, single point of failure.• Clustered file system like GFS (Global FileSystem) can have “fencing” issues• HDFS shines here.• ZooKeeper shines here.• Map/Reduce18
  19. 19. Can you test yourchanges?19
  20. 20. JVM tuning is black art-verbose:gc-XX:+PrintGCDetails-server-Xmx8G-Xms8G-XX:MaxPermSize=256m-XX:PermSize=256m-XX:+AggressiveHeap-XX:+DisableExplicitGC-XX:ParallelGCThreads=16-XX:+UseParallelOldGC20
  21. 21. 21
  22. 22. Run, don’t WalkLesson:Testingneedstobeeasy22
  23. 23. Telling some stories• Prototyping•Application Development• MaintainingYour Big Search Indexes23
  24. 24. Using Solr as key/value storeMetadataContentFilesIngestPipelineSolrSolrSolrSolrSolr Key/Value Cache24
  25. 25. • thousands of queries per second withoutreal time get.• how fast with real time get?http://localhost:8983/solr/select?q=id:DOC45242&fl=entities,htmlhttp://localhost:8983/solr/get?id=DOC45242&fl=entities,htmlUsing Solr as key/value storeLesson:Usewhatyouhaveathand25
  26. 26. Don’t do expensivethings in Solr• Tika content extraction aka Solr Cell• UpdateRequestProcessorChain• Don’t duplicate work26
  27. 27. Tika as a pipeline?• Auto detects content type• Metadata structure has all thekey/value needed for Solr• Allows us to scale up withBehemoth project.• Ingest multiple XML formats aswell as CSV and EDI27
  28. 28. Telling some stories• Prototyping• Application Development•Maintaining Your Big Search Indexes28
  29. 29. Indexing is Easy andQuick29
  30. 30. CHEAP AND CHEERFUL><30
  31. 31. NRT versus BigData31
  32. 32. The tension betweenscale and update rate10 million 100’s of millionsBad Place32
  33. 33. Grim Reaper33
  34. 34. Grim Reaper “Death ofMice”Especially if you are oncloud platform. Theyimplement their serverson the cheapestcommodity hardwareLesson:Embracefailure,don’tfearit34
  35. 35. Provisioning• Chef/Puppet• ZooKeeper• Have you versioned everything to build anindex over again?Lesson:AutomateEverything!35
  36. 36. TRADITIONAL ENVIRONMENT36
  37. 37. POOLED ENVIRONMENTLesson:ThinkCloud37
  38. 38. Building  a  Patents  Index0751502253005 days 3 days 30 Minutes1 5300MachineCountWhat  happens  when  we  want  to  index  2  million  patents  in  30  minutes?38
  39. 39. Amazon  AWS  is  Good  but...•EC2  is  costly  for  your  “base”  load•Issues  of  access  to  internal  data•Firewall  and  security39
  40. 40. Do I need Failover?• Can I build quickly?• Do I have a reliable cluster of servers?• Am I spread across data centers?• Is sooo 90’s....• Think shared nothing cluster!Lesson:No!40
  41. 41. Telling some stories• Prototyping• Application Development• MaintainingYour Big Search Indexes41
  42. 42. One more thought...42
  43. 43. Measuring the impactof our algorithmschanges is just gettingharder with Big Data.43
  44. 44. www.solrpa.nlProject SolrPanlWeneedamotivatedbetatester!44
  45. 45. Thank you!Questions?• epugh@o19s.com• @dep4b• www.opensourceconnections.com• slideshare.com/o19sNervous aboutspeaking up? Askme later!45
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×