Batch Indexing & Near Real Time,keeping things fastMarc SturleseSoftware engineer @ TrovitThursday, 2 May 2013
About me...• Marc Sturlese – @sturlese• Software engineer @Trovit. R&D focused• Responsible for search and scalabilityThur...
Agenda• Who we are• Batch architecture. Hadoop & Hive• Near real time architecture. Storm & stuff• Putting it all together...
Who we areTrovit, a search engine for classifiedsThursday, 2 May 2013
Who we areThursday, 2 May 2013
Batch Layer• Hadoop based• Documents are crunched by a pipeline of MRjobs• Hive to save stats of each phaseThursday, 2 May...
Batch LayerPipeline overviewIncoming dataDeploymentLucene IndexesAd Processor Diff Matching Expiration Deduplication Index...
Batch LayerThe good things!• Index always built from scratch. Small number ofbig segments• Multicast deployment allows to ...
Batch LayerThat was cool but...• Not even close to real time• Crunch documents in batch means to wait untilall is processe...
Near real time LayerStorm and stuff to the rescueThursday, 2 May 2013
Near real time LayerStorm properties• Distributed real time computation system• Fault tolerance• Horizontal scalability• L...
Near real time LayerStorm in actionSlaveSlaveSolr prod replicasSlaveXML feedXML feedKafka partitionKafka partitionStorm to...
Near real time LayerStorm in action• Spouts just read and send• Doc Manager Bolt processes and classifies• Indexer Bolt ad...
Near real time LayerStorm in actionThursday, 2 May 2013
Near real time LayerStorm in action. But...Thursday, 2 May 2013
Near real time LayerStorm in action. But...• Now Solr has to handle user queries and storminserts• Field grouping on Index...
Near real time LayerStorm in action - CommittingIndexer Bolt Cars USReal state UK R1 Cars US R1 Cars US R2 Jobs BR R1 Jobs...
Near real time LayerStorm in action• Adding documents now is fast• Keep number of segments small• Avoid merges on big segm...
Mixed ArchitecturePutting it all together15SlaveSlaveSolr prod replicasSlaveXML feedXML feedKafka partitionKafka partition...
Mixed ArchitectureSwapping indexes• NRT docs might not be contained in the newbatch index (even fresher than the “being bu...
Mixed ArchitectureSwapping indexes. Time jumps!Thursday, 2 May 2013
Mixed ArchitectureSwapping indexesHBaseXML feed tSlave t+1Slave tPipeline tPipeline t+1XML feed t+1XML feed t+2NRT indexer...
Mixed ArchitectureSwapping indexesHBaseXML feed tSlave t+1Slave tPipeline tPipeline t+1XML feed t+1XML feed t+2NRT indexer...
Mixed ArchitectureSwapping indexesHBaseXML feed tSlave t+1Slave tPipeline tPipeline t+1XML feed t+1XML feed t+2NRT indexer...
Mixed ArchitectureSwapping indexesHBaseXML feed tSlave t+1Slave tPipeline tPipeline t+1XML feed t+1XML feed t+2NRT indexer...
Mixed ArchitectureSwapping indexes• NRT indexed docs must be stored in atemporary storage• Fetch missing docs from the sto...
Mixed ArchitectureStorm and Hadoop• Near real time inserts, low latency• Hadoop handles deletes and updates. No rushon tho...
AlternativesSolrCloud - Why not?• Good for the vast majority of use cases• Incremental inserts/updates/deletes oriented.Pa...
Future linesLucene real time feature• Allows to see docs in the index before they arecommitted• Good but not a must right ...
??Thursday, 2 May 2013
Thanks for your attention!Marc Sturlesemarc@trovit.comLucene/Solr Revolution 2013, San Diego, May 1 2013Thursday, 2 May 2013
CONFERENCE PARTYThe Tipsy Crow: 770 5th AveStarts after Stump The ChumpYour conference badge getsyou in the doorTOMORROWBr...
Upcoming SlideShare
Loading in...5
×

Batch Indexing & Near Real Time, keeping things fast

1,579

Published on

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,579
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
24
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Batch Indexing & Near Real Time, keeping things fast

  1. 1. Batch Indexing & Near Real Time,keeping things fastMarc SturleseSoftware engineer @ TrovitThursday, 2 May 2013
  2. 2. About me...• Marc Sturlese – @sturlese• Software engineer @Trovit. R&D focused• Responsible for search and scalabilityThursday, 2 May 2013
  3. 3. Agenda• Who we are• Batch architecture. Hadoop & Hive• Near real time architecture. Storm & stuff• Putting it all together• Alternatives and Future directions• QuestionsThursday, 2 May 2013
  4. 4. Who we areTrovit, a search engine for classifiedsThursday, 2 May 2013
  5. 5. Who we areThursday, 2 May 2013
  6. 6. Batch Layer• Hadoop based• Documents are crunched by a pipeline of MRjobs• Hive to save stats of each phaseThursday, 2 May 2013
  7. 7. Batch LayerPipeline overviewIncoming dataDeploymentLucene IndexesAd Processor Diff Matching Expiration Deduplication Indexingt – 1External DataHive StatsHadoop ClusterThursday, 2 May 2013
  8. 8. Batch LayerThe good things!• Index always built from scratch. Small number ofbig segments• Multicast deployment allows to send indexes toall slaves at the same time.• Backups convenient on HDFSThursday, 2 May 2013
  9. 9. Batch LayerThat was cool but...• Not even close to real time• Crunch documents in batch means to wait untilall is processed. This can take a few hours• We want to show the user fresher results!Thursday, 2 May 2013
  10. 10. Near real time LayerStorm and stuff to the rescueThursday, 2 May 2013
  11. 11. Near real time LayerStorm properties• Distributed real time computation system• Fault tolerance• Horizontal scalability• Low latency• ReliabilityThursday, 2 May 2013
  12. 12. Near real time LayerStorm in actionSlaveSlaveSolr prod replicasSlaveXML feedXML feedKafka partitionKafka partitionStorm topologySourcesKafka spoutKafka spoutXML spout Doc Manager bolt Indexer boltSHUFFLEGROUPING GROUPINGFIELDThursday, 2 May 2013
  13. 13. Near real time LayerStorm in action• Spouts just read and send• Doc Manager Bolt processes and classifies• Indexer Bolt adds documents to Solr• Replicated logic with different implementation• Careful not to overload Solr slaves...Thursday, 2 May 2013
  14. 14. Near real time LayerStorm in actionThursday, 2 May 2013
  15. 15. Near real time LayerStorm in action. But...Thursday, 2 May 2013
  16. 16. Near real time LayerStorm in action. But...• Now Solr has to handle user queries and storminserts• Field grouping on Indexer Bolt for politeness• Small bulks to reduce insert requests• Committing on many cores, same host, sametime can be painfulThursday, 2 May 2013
  17. 17. Near real time LayerStorm in action - CommittingIndexer Bolt Cars USReal state UK R1 Cars US R1 Cars US R2 Jobs BR R1 Jobs BR R2 Real state ES R1Indexer Bolt Jobs BRZooKeeper LockerSlave 1 Slave 2 Slave N. . .Thursday, 2 May 2013
  18. 18. Near real time LayerStorm in action• Adding documents now is fast• Keep number of segments small• Avoid merges on big segments• Just add new docs (no deletes or updates)Thursday, 2 May 2013
  19. 19. Mixed ArchitecturePutting it all together15SlaveSlaveSolr prod replicasSlaveXML feedXML feedKafka partitionKafka partitionStorm topologySourcesHbase doc infoBulk addExists?MR PipelinezkThursday, 2 May 2013
  20. 20. Mixed ArchitectureSwapping indexes• NRT docs might not be contained in the newbatch index (even fresher than the “being built”batch index)• This can lead to inconsistencies...Thursday, 2 May 2013
  21. 21. Mixed ArchitectureSwapping indexes. Time jumps!Thursday, 2 May 2013
  22. 22. Mixed ArchitectureSwapping indexesHBaseXML feed tSlave t+1Slave tPipeline tPipeline t+1XML feed t+1XML feed t+2NRT indexerBatch indexerThursday, 2 May 2013
  23. 23. Mixed ArchitectureSwapping indexesHBaseXML feed tSlave t+1Slave tPipeline tPipeline t+1XML feed t+1XML feed t+2NRT indexerBatch indexerThursday, 2 May 2013
  24. 24. Mixed ArchitectureSwapping indexesHBaseXML feed tSlave t+1Slave tPipeline tPipeline t+1XML feed t+1XML feed t+2NRT indexerBatch indexerNRT t+1NRT t+2Thursday, 2 May 2013
  25. 25. Mixed ArchitectureSwapping indexesHBaseXML feed tSlave t+1Slave tPipeline tPipeline t+1XML feed t+1XML feed t+2NRT indexerBatch indexerNRT t+1NRT t+2Thursday, 2 May 2013
  26. 26. Mixed ArchitectureSwapping indexes• NRT indexed docs must be stored in atemporary storage• Fetch missing docs from the storage and addthem before the next deploy• This avoids time jumpsThursday, 2 May 2013
  27. 27. Mixed ArchitectureStorm and Hadoop• Near real time inserts, low latency• Hadoop handles deletes and updates. No rushon those• No merges on big segments so optimal queryresponse times• Tolerant to human errors• Temporary lost of accuracy on the NRT layerThursday, 2 May 2013
  28. 28. AlternativesSolrCloud - Why not?• Good for the vast majority of use cases• Incremental inserts/updates/deletes oriented.Pay segment merges per real time• Need to deploy full indexes fast (faster that rsyncor http replication)• Now full deploy easier with aliasesThursday, 2 May 2013
  29. 29. Future linesLucene real time feature• Allows to see docs in the index before they arecommitted• Good but not a must right now for the use case• Very easy to integrate on the currentarchitectureThursday, 2 May 2013
  30. 30. ??Thursday, 2 May 2013
  31. 31. Thanks for your attention!Marc Sturlesemarc@trovit.comLucene/Solr Revolution 2013, San Diego, May 1 2013Thursday, 2 May 2013
  32. 32. CONFERENCE PARTYThe Tipsy Crow: 770 5th AveStarts after Stump The ChumpYour conference badge getsyou in the doorTOMORROWBreakfast starts at 7:30Keynotes start at 8:30Thursday, 2 May 2013
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×