Batch indexing & near real time, keeping things fast.

2,290 views

Published on

Presented by Marc Sturlese, Architect, Backend engineer, Trovit

In this talk I will explain how we combine a mixed architecture using Hadoop for batch indexing and Storm, HBase and Zookeeper to keep our indexes updated in near real time.Will talk about why we didn't choose just a default Solr Cloud and it's real time feature (mainly to avoid hitting merges while serving queries on the slaves) and the advantages and complexities of having a mixed architecture. Both parts of the infrastucture and how they are coordinated will be explained with details.Finally will mention future lines, how we plan to use Lucene real time feature.

Published in: Education, Technology, Business
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,290
On SlideShare
0
From Embeds
0
Number of Embeds
764
Actions
Shares
0
Downloads
33
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Batch indexing & near real time, keeping things fast.

  1. 1. Batch Indexing & Near Real Time,keeping things fastMarc SturleseSoftware engineer @ Trovit
  2. 2. About me...• Marc Sturlese – @sturlese• Software engineer @Trovit. R&D focused• Responsible for search and scalability
  3. 3. Agenda• Who we are• Batch architecture. Hadoop & Hive• Near real time architecture. Storm & stuff• Putting it all together• Alternatives and Future directions• Questions
  4. 4. Who we areTrovit, a search engine for classifieds
  5. 5. Who we are
  6. 6. Batch Layer• Hadoop based• Documents are crunched by a pipeline of MRjobs• Hive to save stats of each phase
  7. 7. Batch LayerPipeline overviewIncoming dataDeploymentLucene IndexesAd Processor Diff Matching Expiration Deduplication Indexingt – 1External DataHive StatsHadoop Cluster
  8. 8. Batch LayerThe good things!• Index always built from scratch. Small number ofbig segments• Multicast deployment allows to send indexes toall slaves at the same time.• Backups convenient on HDFS
  9. 9. Batch LayerThat was cool but...• Not even close to real time• Crunch documents in batch means to wait untilall is processed. This can take a few hours• We want to show the user fresher results!
  10. 10. Near real time LayerStorm and stuff to the rescue
  11. 11. Near real time LayerStorm properties• Distributed real time computation system• Fault tolerance• Horizontal scalability• Low latency• Reliability
  12. 12. Near real time LayerStorm in actionSlaveSlaveSolr prod replicasSlaveXML feedXML feedKafka partitionKafka partitionStorm topologySourcesKafka spoutKafka spoutXML spout Doc Manager bolt Indexer boltSHUFFLEGROUPING GROUPINGFIELD
  13. 13. Near real time LayerStorm in action• Spouts just read and send• Doc Manager Bolt processes and classifies• Indexer Bolt adds documents to Solr• Replicated logic with different implementation• Careful not to overload Solr slaves...
  14. 14. Near real time LayerStorm in action
  15. 15. Near real time LayerStorm in action. But...
  16. 16. Near real time LayerStorm in action. But...• Now Solr has to handle user queries and storminserts• Field grouping on Indexer Bolt for politeness• Small bulks to reduce insert requests• Committing on many cores, same host, sametime can be painful
  17. 17. Near real time LayerStorm in action - CommittingIndexer Bolt Cars USReal state UK R1 Cars US R1 Cars US R2 Jobs BR R1 Jobs BR R2 Real state ES R1Indexer Bolt Jobs BRZooKeeper LockerSlave 1 Slave 2 Slave N. . .
  18. 18. Near real time LayerStorm in action• Adding documents now is fast• Keep number of segments small• Avoid merges on big segments• Just add new docs (no deletes or updates)
  19. 19. Mixed ArchitecturePutting it all together15SlaveSlaveSolr prod replicasSlaveXML feedXML feedKafka partitionKafka partitionStorm topologySourcesHbase doc infoBulk addExists?MR Pipelinezk
  20. 20. Mixed ArchitectureSwapping indexes• NRT docs might not be contained in the newbatch index (even fresher than the “being built”batch index)• This can lead to inconsistencies...
  21. 21. Mixed ArchitectureSwapping indexes. Time jumps!
  22. 22. Mixed ArchitectureSwapping indexesHBaseXML feed tSlave t+1Slave tPipeline tPipeline t+1XML feed t+1XML feed t+2NRT indexerBatch indexer
  23. 23. Mixed ArchitectureSwapping indexesHBaseXML feed tSlave t+1Slave tPipeline tPipeline t+1XML feed t+1XML feed t+2NRT indexerBatch indexer
  24. 24. Mixed ArchitectureSwapping indexesHBaseXML feed tSlave t+1Slave tPipeline tPipeline t+1XML feed t+1XML feed t+2NRT indexerBatch indexerNRT t+1NRT t+2
  25. 25. Mixed ArchitectureSwapping indexesHBaseXML feed tSlave t+1Slave tPipeline tPipeline t+1XML feed t+1XML feed t+2NRT indexerBatch indexerNRT t+1NRT t+2
  26. 26. Mixed ArchitectureSwapping indexes• NRT indexed docs must be stored in atemporary storage• Fetch missing docs from the storage and addthem before the next deploy• This avoids time jumps
  27. 27. Mixed ArchitectureStorm and Hadoop• Near real time inserts, low latency• Hadoop handles deletes and updates. No rushon those• No merges on big segments so optimal queryresponse times• Tolerant to human errors• Temporary lost of accuracy on the NRT layer
  28. 28. AlternativesSolrCloud - Why not?• Good for the vast majority of use cases• Incremental inserts/updates/deletes oriented.Pay segment merges per real time• Need to deploy full indexes fast (faster that rsyncor http replication)• Now full deploy easier with aliases
  29. 29. Future linesLucene real time feature• Allows to see docs in the index before they arecommitted• Good but not a must right now for the use case• Very easy to integrate on the currentarchitecture
  30. 30. ??
  31. 31. Thanks for your attention!Marc Sturlesemarc@trovit.comLucene/Solr Revolution 2013, San Diego, May 1 2013
  32. 32. CONFERENCE PARTYThe Tipsy Crow: 770 5th AveStarts after Stump The ChumpYour conference badge getsyou in the doorTOMORROWBreakfast starts at 7:30Keynotes start at 8:30

×