Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale

3,268 views

Published on

The Data Platform at Twitter supports engineers and data scientists running batch jobs on Hadoop clusters that are several 1000s of nodes, and real-time jobs on top of systems such as Storm. In this presentation, I discuss the overall Data Platform stack at Twitter. In particular, I talk about enabling real-time and batch analytics at scale with the help of Scalding, which is a Scala DSL for batch jobs using MapReduce, Summingbird, which is a framework for combined real-time and batch processing, and Tsar, which is a framework for real-time time-series aggregations.

Published in: Technology
  • Be the first to comment

Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale

  1. 1. Data Platform at Twitter Enabling Real-time & Batch Analytics at Scale Sriram Krishnan | Twitter | @krishnansriram Hadoop Innovation Summit, Feb 12, 2015
  2. 2. Who am I? • Engineering Manager, Data Platform at Twitter • Formerly: • Tech Lead, Big Data Platform at Netflix • Group Lead & Senior Researcher at the San Diego Supercomputer Center
  3. 3. Twitter Scale (Feb 2015) • More than 288M monthly active users • More than 500M unique monthly logged out users • 500M tweets per day • 80% mobile users • Thousands of advertisers
  4. 4. Twitter Data Platform • Enables use of large-scale resources to perform data analytics at scale • trending tweets • ad impressions • most clicked, most followed, most retweeted • platform health & statistics • experimentation
  5. 5. Twitter Data Platform • Enables use of large-scale resources to perform data analytics at scale • trending tweets • ad impressions • most clicked, most followed, most retweeted • platform health & statistics • experimentation In real-time!
  6. 6. Use Case: Counting Impressions
  7. 7. Data Streams (user_id, tweet, timestamp, hashtags, …) (user_id, tweet, device_id, …) event queue Logs Storm topology Hadoop job Real-time results Batch results
  8. 8. Storm • Streaming compute framework • Jobs represented as topologies • Process tuples as they come • Data comes in from Spouts • Data is computed on in Bolts http://storm.apache.org
  9. 9. Hadoop http://www.glennklockwood.com/di/hadoop-overview.php • Batch MapReduce library • Uses the Hadoop Distributed File System (HDFS) to store and process data • Jobs consist of Map & Reduce phases • Synchronization barriers between each stage
  10. 10. Challenges • Hadoop and Storm present two different programming models • Written in different languages • Each job has to be written twice • Hard to optimize, specific for each job & platform • Hard to compute complete up-to-the-moment information
  11. 11. Goals • Write job once
  12. 12. Goals • Write job once! • Portable
  13. 13. Goals • Write job once! • Portable! • Fault tolerant
  14. 14. Goals • Write job once! • Portable! • Fault tolerant! • Up-to-the-moment
  15. 15. TSAR • The TimeSeries AggregatoR • https://blog.twitter.com/2014/tsar-a-timeseries-aggregator
  16. 16. We are still counting impressions
  17. 17. aggregate { onKeys ( (TweetId) ) produce ( Count ) sinkTo ( Vertica ) } fromProducer { ! ClientEventSource(“client_events”) .filter { e => isImpressionEvent(e) } .map { e => val impr = Impression(e.tweetId) (e.timestamp, impr) } } The TSAR job Dimensions Metrics Data Sinks Data Sources
  18. 18. TSAR will then: • Configure and launch jobs on Storm & Hadoop • Create services for querying results • Create tables and views and staging jobs • Create alerts and observability graphs
  19. 19. Behind the scenes Scalding Storm Batch (Hadoop)! Better accuracy! Worse latency Realtime! Better latency! Worse accuracy TSAR
  20. 20. Behind the scenes Scalding Storm Batch (Hadoop)! Better accuracy! Worse latency Realtime! Better latency! Worse accuracy TSAR Vertica Manual Data Exploration Manhattan Dashboards & ! Production Services
  21. 21. Behind the scenes Summingbird Scalding Storm Batch (Hadoop)! Better accuracy! Worse latency Realtime! Better latency! Worse accuracy TSAR Vertica Manual Data Exploration Manhattan Dashboards & ! Production Services
  22. 22. Glossary • Summingbird - framework for integrating batch and online MapReduce computations • Scalding - Scala library for running batch MapReduce jobs • Manhattan - real-time multi-tenant distributed database
  23. 23. What is Summingbird? 1) Model for streaming multi-stage map-reduce
  24. 24. What is streaming map-reduce? Service Merge SumByKey Map Map Lookup Source Source ! Can push one tuple through at a time to update state ! => can work on batch and real- time streams of dataStore
  25. 25. One-at-a-time semantics, run the job in realtime or in batch
  26. 26. What is Summingbird? 2) Implementations to run on Storm, Scalding (Hadoop), Spark, etc
  27. 27. What is Summingbird? 2) Implementations to run on Storm, Scalding (Hadoop), Spark, etc Portable
  28. 28. Optimizers at the Summingbird layer, leverage those optimizers across platforms
  29. 29. What is Summingbird? 3) Systematic implementation of the “Lambda Architecture”
  30. 30. summingbird-scalding summingbird-storm storehaus-memcache storehaus-algebra storehaus-manhattan Kafka Lambda Architecture with Summingbird http://lambda-architecture.net
  31. 31. What is Summingbird? 3) Systematic implementation of the “Lambda Architecture” Fault Tolerant
  32. 32. What is Summingbird? 3) Systematic implementation of the “Lambda Architecture” Fault Tolerant Up-to-the-moment
  33. 33. Restricting reduce operators to a very general class (semigroups, monoids)system.
  34. 34. Monoid 2 + 3 = 61 +
  35. 35. Monoid 2 + 3 = 61 + = 3
  36. 36. Monoid 2 + 3 = 61 + = 5
  37. 37. Example Monoids • (a min b) min c = a min (b min c) • (a max b) max c = a max (b max c) • addition: (a + b) + c = a + (b + c) • set union: (a u b) u c = a u (b u c) • set intersection: (a n b) n c = a n (b n c) • approximate unique count (HLL), approximate counter (CMS) ! ! Algebird - https://github.com/twitter/algebird
  38. 38. Batching and associativity yields reliability Batch0 Batch1 Batch2 Batch3 Slow but fault tolerant Noisy but fast Realtime sums from 0, each batch Log Hadoop Hadoop Hadoop Hadoop Log Log Log RT RT RT RTRT RT RT RT Hadoop keeps a total sum (reliably)
  39. 39. Batching and associativity yields reliability Batch0 Batch1 Batch2 Batch3 Log Hadoop Hadoop Hadoop Hadoop Log Log Log RT RT RT RTRT RT RT RT Sum of RT Batch(i) + Hadoop Batch(i-1) has bounded noise, bounded read/write size. ! Done at query time
  40. 40. All of that sums up to
  41. 41. Summary • Twitter Data Platform enables use of large-scale resources to perform data analytics at scale • Write jobs once, that are • Portable, reliable, and up-to-the-moment • A systematic implementation of the lambda architecture • Monoids & semi-groups for parallelism & performance • Batching & associativity for reliability
  42. 42. Thank you! @krishnansriram Acknowledgements! Ekaterina Gonina Ian O’Connell Gabriel Gonzalez Oscar Boykin ! Twitter Data Platform Team! We are hiring!! ! @summingbird https://github.com/twitter/summingbird @scalding https://github.com/twitter/scalding

×