FROM LEGACY, TO BATCH,  TO NEAR REAL-TIME      Marc Sturlese, Dani Solà
WHO ARE WE?•   Marc Sturlese - @sturlese    • Backend   engineer, focused on R&D    • Interests: search, scalability•   Da...
TROVITSearch engine for classifieds: 6 verticals, 38 countries & growing
FROM LEGACY TO BATCH• Old   architecture• Why    & when we changed• Current     architecture• Hive, Pig   & custom tools• ...
OLD ARCHITECTURE• Based    on MySQL and PHP scripts• Indexes     created with DataImportHandler      Incoming data        ...
WHEN & WHY WE MOVED• Sharded   strategies are hard to maintain• We    had 10M rows in a single table• Many   processes wor...
CURRENT ARCHITECTURE• Based   on Hadoop• Batch   process that reprocess all the ads...• But   needs to be aware of the pre...
CURRENT ARCHITECTUREIncoming data          External Data                                Lucene Indexes                    ...
AD PROCESSORIncoming data     • Converts    text files to Thrift objects                  • Checks    that the ads are comp...
DIFF PHASEads t           ads t-1                          • Performs   the diff between executions         Diff          ...
MATCHING PHASEads              External Data                                 • Extracts   semantic information:           ...
EXPIRATION PHASE   ads               • Works   as a filter               • Deletes:  Expiration                 • Expired  ...
DEDUPLICATION PHASE                   • Duplicates   are a big issue for us     ads                   • Youcannot compare ...
INDEXING PHASE   ads           • Is   actually done with two phases                 • First   we create micro indexes     ...
HIVE, PIG & CUSTOM TOOLS            • Critical:              • To   know what is going on (control info)              • To...
MIGRATION PROCESS• Used Amazon    EC2 to test different cluster configurations• Maintained   both systems running during on...
FROM BATCH              TO NEAR REAL-TIME• Batch   is not enough• Storm     for real time data processing• HBase     for d...
BATCH IS NOT ENOUGH• Dataprocessing with map reduce scales well but takes time and has latency• Crunch   documents in batc...
BATCH IS NOT ENOUGH                                                         ZK          MR pipeline            HDFS       ...
STORM - PROPERTIES• Distributed   real time computation system• Fault   tolerance• Horizontal    scalability• Low     late...
STORM - COMPONENTS• Tuple• Stream• Spout• Bolt• Topology
STORM IN ACTION         Spouts      Bolts     Bolts                     Streams                        of                 ...
STORM - DAEMONS• Nimbus• Supervisors• Workers
HBASE - PROPERTIES• Distributed, sorted    map datastore• Automatic   failover• Rows   are sorted• Many   columns per row•...
HBASE - COMPONENTS• Master • Slave   coordination and failure detection • Admin    features• Region   server (slaves)
ZOOKEEPER• Highly   available coordination system• Used for locking, distributed configuration, leader election, cluster ma...
PUTTING IT ALL TOGETHER                                                                   ZK        MR pipeline          H...
MIXED ARCHITECTURE• Ifthe number of segments in the index gets too big is has an  impact in search performance• Building  ...
THANK YOU!  QUESTIONS?
From legacy, to batch, to near real-time
Upcoming SlideShare
Loading in …5
×

From legacy, to batch, to near real-time

869 views

Published on

In this talk we present our transition from an old legacy system built on top of MySQL and a bunch of PHP scripts to a system based on Hadoop and Hive, which allows us to process and keep stats of hundreds of thousands of ads several times a day and index them to be served to the end users. We also present our current undertaking to deliver the ads from our sources (property portals, job boards…) to our visitors in near real-time using Storm, HBase and Zookeeper.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
869
On SlideShare
0
From Embeds
0
Number of Embeds
20
Actions
Shares
0
Downloads
15
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • From legacy, to batch, to near real-time

    1. 1. FROM LEGACY, TO BATCH, TO NEAR REAL-TIME Marc Sturlese, Dani Solà
    2. 2. WHO ARE WE?• Marc Sturlese - @sturlese • Backend engineer, focused on R&D • Interests: search, scalability• Dani Solà - @dani_sola • Backend engineer • Interests: distributed systems, data mining, search,...
    3. 3. TROVITSearch engine for classifieds: 6 verticals, 38 countries & growing
    4. 4. FROM LEGACY TO BATCH• Old architecture• Why & when we changed• Current architecture• Hive, Pig & custom tools• Migration process
    5. 5. OLD ARCHITECTURE• Based on MySQL and PHP scripts• Indexes created with DataImportHandler Incoming data DataImportHandler Lucene Indexes MySQL PHP Scripts
    6. 6. WHEN & WHY WE MOVED• Sharded strategies are hard to maintain• We had 10M rows in a single table• Many processes working on MySQL databases• We wanted a more maintainable codebase• The solution was pretty obvious...
    7. 7. CURRENT ARCHITECTURE• Based on Hadoop• Batch process that reprocess all the ads...• But needs to be aware of the previous execution!• Hive & custom tools to know what happens
    8. 8. CURRENT ARCHITECTUREIncoming data External Data Lucene Indexes DeploymentAd Processor Diff Matching Expiration Deduplication Indexing t-1 Hadoop Cluster Hive Stats
    9. 9. AD PROCESSORIncoming data • Converts text files to Thrift objects • Checks that the ads are complete • Searches for poisonwordsAd Processor • Checks the value ranges Thrift • Parses text (dates, currencies, etc)Objects
    10. 10. DIFF PHASEads t ads t-1 • Performs the diff between executions Diff • Merges the ads of both executions ads t
    11. 11. MATCHING PHASEads External Data • Extracts semantic information: • Geographical information • Cars’ makes and models Matching • Companies enriched • ... ads
    12. 12. EXPIRATION PHASE ads • Works as a filter • Deletes: Expiration • Expired ads • Incorrect adsads to be indexed
    13. 13. DEDUPLICATION PHASE • Duplicates are a big issue for us ads • Youcannot compare N ads against each other • Solution: Deduplication • Use heuristics to create possible duplicates groupsdeduplicated ads • Compare all the ads of each group
    14. 14. INDEXING PHASE ads • Is actually done with two phases • First we create micro indexes • We use Embedded Solr Server Expiration • Then we merge them • Plain LuceneLucene Indexes
    15. 15. HIVE, PIG & CUSTOM TOOLS • Critical: • To know what is going on (control info) • To debug • To prototype new processes • To understand your datagrep, cat • To create reports
    16. 16. MIGRATION PROCESS• Used Amazon EC2 to test different cluster configurations• Maintained both systems running during one month• Switched to the new system gradually, one country at a time• Then we moved the cluster to our own servers
    17. 17. FROM BATCH TO NEAR REAL-TIME• Batch is not enough• Storm for real time data processing• HBase for data storage• Zookeeper for systems coordination• Putting it all together• Batch and NRT. Mixed architecture
    18. 18. BATCH IS NOT ENOUGH• Dataprocessing with map reduce scales well but takes time and has latency• Crunch documents in batch means wait until all is processed• We want to show the user fresher results!
    19. 19. BATCH IS NOT ENOUGH ZK MR pipeline HDFS Id tables SolrStorm + HBase + Zookeeper looks like a good fit! Topology ZK Feeds Spouts Bolts Bolts Slaves
    20. 20. STORM - PROPERTIES• Distributed real time computation system• Fault tolerance• Horizontal scalability• Low latency• Reliability
    21. 21. STORM - COMPONENTS• Tuple• Stream• Spout• Bolt• Topology
    22. 22. STORM IN ACTION Spouts Bolts Bolts Streams of tuplesQueue Topology DataStore
    23. 23. STORM - DAEMONS• Nimbus• Supervisors• Workers
    24. 24. HBASE - PROPERTIES• Distributed, sorted map datastore• Automatic failover• Rows are sorted• Many columns per row• Good Hadoop integration
    25. 25. HBASE - COMPONENTS• Master • Slave coordination and failure detection • Admin features• Region server (slaves)
    26. 26. ZOOKEEPER• Highly available coordination system• Used for locking, distributed configuration, leader election, cluster management...• Curator makes it easy for common algorithms
    27. 27. PUTTING IT ALL TOGETHER ZK MR pipeline HDFS Id tables Solr Topology ZKFeeds Spouts Bolts processor Bolt Indexer Slaves
    28. 28. MIXED ARCHITECTURE• Ifthe number of segments in the index gets too big is has an impact in search performance• Building indexes in batch allows to keep small number of segments• Gives near real time updates and it’s tolerant to human error
    29. 29. THANK YOU! QUESTIONS?

    ×