From legacy, to batch, to near real-time

1,167 views

Published on

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,167
On SlideShare
0
From Embeds
0
Number of Embeds
619
Actions
Shares
0
Downloads
4
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • From legacy, to batch, to near real-time

    1. 1. FROM LEGACY, TO BATCH, TO NEAR REAL-TIME Marc Sturlese, Dani Solà
    2. 2. WHO ARE WE?• Marc Sturlese - @sturlese • Backend engineer, focused on R&D • Interests: search, scalability• Dani Solà - @dani_sola • Backend engineer • Interests: distributed systems, data mining, search,...
    3. 3. TROVITSearch engine for classifieds: 6 verticals, 38 countries & growing
    4. 4. FROM LEGACY TO BATCH• Old architecture• Why & when we changed• Current architecture• Hive, Pig & custom tools• Migration process
    5. 5. OLD ARCHITECTURE• Based on MySQL and PHP scripts• Indexes created with DataImportHandler Incoming data DataImportHandler Lucene Indexes MySQL PHP Scripts
    6. 6. WHEN & WHY WE MOVED• Sharded strategies are hard to maintain• We had 10M rows in a single table• Many processes working on MySQL databases• We wanted a more maintainable codebase• The solution was pretty obvious...
    7. 7. CURRENT ARCHITECTURE• Based on Hadoop• Batch process that reprocess all the ads...• But needs to be aware of the previous execution!• Hive & custom tools to know what happens
    8. 8. CURRENT ARCHITECTUREIncoming data External Data Lucene Indexes DeploymentAd Processor Diff Matching Expiration Deduplication Indexing t-1 Hadoop Cluster Hive Stats
    9. 9. AD PROCESSORIncoming data • Converts text files to Thrift objects • Checks that the ads are complete • Searches for poisonwordsAd Processor • Checks the value ranges Thrift • Parses text (dates, currencies, etc)Objects
    10. 10. DIFF PHASEads t ads t-1 • Performs the diff between executions Diff • Merges the ads of both executions ads t
    11. 11. MATCHING PHASEads External Data • Extracts semantic information: • Geographical information • Cars makes and models Matching • Companies enriched • ... ads
    12. 12. EXPIRATION PHASE ads • Works as a filter • Deletes: Expiration • Expired ads • Incorrect adsads to be indexed
    13. 13. DEDUPLICATION PHASE • Duplicates are a big issue for us ads • Youcannot compare N ads against each other • Solution: Deduplication • Use heuristics to create “possible duplicates” groupsdeduplicated ads • Compare all the ads of each group
    14. 14. INDEXING PHASE ads • Is actually done with two phases • First we create micro indexes • We use Embedded Solr Server Expiration • Then we merge them • Plain LuceneLucene Indexes
    15. 15. HIVE, PIG & CUSTOM TOOLS • Critical: • To know that is going on (control info) • To debug • To prototype new processes • To understand your datagrep, cat • To create reports
    16. 16. MIGRATION PROCESS• Used Amazon EC2 to test different cluster configurations• Maintained both systems running during one month• Switched to the new system gradually, one country at a time• Then we moved the cluster to our own servers
    17. 17. FROM BATCH TO NEAR REAL-TIME• Batch is not enough• Storm for real time data processing• HBase for data storage• Zookeeper for systems coordination• Putting it all together• Batch and NRT. Mixed architecture
    18. 18. BATCH IS NOT ENOUGH• Dataprocessing with map reduce scales well but takes time and has latency• Crunch documents in batch means wait until all is processed• We want to show the user fresher results!
    19. 19. BATCH IS NOT ENOUGH ZK MR pipeline HDFS Id tables• Storm + HBase + Zookeeper looks like a good Solr feed !!! Topology ZKFeeds Spouts Bolts Bolts Slaves
    20. 20. STORM - PROPERTIES• Distributed real time computation system• Fault tolerance• Horizontal scalability• Low latency• Reliability
    21. 21. STORM - COMPONENTS• Tuple• Stream• Spout• Bolt• Topology
    22. 22. STORM IN ACTION Spouts Bolts Bolts Streams of tuplesQueue Topology DataStore
    23. 23. STORM - DAEMONS• Nimbus• Supervisors• Workers
    24. 24. HBASE - PROPERTIES• Distributed, sorted map datastore• Automatic failover• Rows are sorted• Many columns per row• Good Hadoop integration
    25. 25. HBASE - COMPONENTS• Master • Slave coordination and failure detection • Admin features• Region server (slaves)
    26. 26. ZOOKEEPER• Highly available coordination system• Used for locking, distributed configuration, leader election, cluster management...• Curator makes it easy for common algorithms
    27. 27. PUTTING IT ALL TOGETHER ZK MR pipeline HDFS Id tables Solr Topology ZKFeeds Spouts Bolts processor Bolt Indexer Slaves
    28. 28. MIXED ARCHITECTURE• Ifthe number of segments in the index gets too big is has an impact in search performance• Building indexes in batch allows to keep small number of segments• Gives near real time updates and it’s tolerant to human error
    29. 29. THANK YOU! QUESTIONS?

    ×