Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Handson with Twitter Heron

1,561 views

Published on

Slides from Data Engineers Guild Twitter Heron hands-on Meetup

Published in: Technology
  • Be the first to comment

Handson with Twitter Heron

  1. 1. Twitter Heron KARTHIK RAMASAMY @KARTHIKZ #TwitterHeron
  2. 2. BEGIN END HERON OVERVIEW ! I HERON PERFORMANCE (II CONCLUSION K V HERON LOAD SHEDDING ZIV TALK OUTLINE HERON BACKPRESSURE bIII
  3. 3. HERON OVERVIEW b
  4. 4. STORM/HERON TERMINOLOGY TOPOLOGY Directed acyclic graph Vertices=computation, and edges=streams of data tuples SPOUTS Sources of data tuples for the topology Examples - Kafka/Kestrel/MySQL/Postgres BOLTS Process incoming tuples and emit outgoing tuples Examples - filtering/aggregation/join/arbitrary function , %
  5. 5. STORM/HERON TOPOLOGY % % % % % SPOUT 1 SPOUT 2 BOLT 1 BOLT 2 BOLT 3 BOLT 4 BOLT 5
  6. 6. WORD COUNT TOPOLOGY % % TWEET SPOUT PARSE TWEET BOLT WORD COUNT BOLT Live stream of Tweets LOGICAL PLAN
  7. 7. WORD COUNT TOPOLOGY % % TWEET SPOUT TASKS PARSE TWEET BOLT TASKS WORD COUNT BOLT TASKS %%%% %%%% When a parse tweet bolt task emits a tuple which word count bolt task should it send to?
  8. 8. STREAM GROUPINGS Random distribution of tuples Group tuples by a field or multiple fields Replicates tuples to all tasks SHUFFLE GROUPING FIELDS GROUPING ALL GROUPING Sends the entire stream to one task GLOBAL GROUPING / - ,.
  9. 9. WORD COUNT TOPOLOGY % % TWEET SPOUT TASKS PARSE TWEET BOLT TASKS WORD COUNT BOLT TASKS %%%% %%%% SHUFFLE GROUPING FIELDS GROUPING
  10. 10. WHY HERON? PERFORMANCE PREDICTABILITY EASE OF MANAGEABILITY ! d " IMPROVE DEVELOPER PRODUCTIVITY
  11. 11. HERON DESIGN DECISIONS FULLY API COMPATIBLE WITH STORM Directed acyclic graph Topologies, spouts and bolts USE OF MAIN STREAM LANGUAGES C++/JAVA/Python ! d " TASK ISOLATION Ease of debug ability/resource isolation/profiling
  12. 12. HERON ARCHITECTURE Topology 1 TOPOLOGY SUBMISSION Scheduler Topology 2 Topology 3 Topology N
  13. 13. TOPOLOGY ARCHITECTURE Topology Master ZK CLUSTER Stream Manager I1 I2 I3 I4 Stream Manager I1 I2 I3 I4 Logical Plan, Physical Plan and Execution State Sync Physical Plan CONTAINER CONTAINER Metrics Manager Metrics Manager
  14. 14. HERON SAMPLE TOPOLOGIES
  15. 15. Large amount of data produced every day Large cluster Several hundred topologies deployed Several billion messages every day HERON @TWITTER 1 stage 10 stages 3x reduction in cores and memory Heron has been in production for 2 years
  16. 16. HERON USE CASES REALTIME ETL REAL TIME BI SPAM DETECTION REAL TIME TRENDS REALTIME ML REAL TIME MEDIA REAL TIME OPS
  17. 17. COMPONENTIZING HERON x 9
  18. 18. HETEROGENOUS ECOSYSTEM Unmanaged Schedulers Managed Schedulers State Managers Job Uploaders
  19. 19. LOOKING BACK- MICROKERNELS
  20. 20. HERON- MICROSTREAM ENGINE HARDWARE BASIC INTER/INTRA IPC Topology Master Stream Manager Instance Metrics Manager Scribe Graphite SCHEDULERSPISTATEMANAGERSPI
  21. 21. HERON - MICROSTREAM ENGINE PLUG AND PLAY COMPONENTS As environment changes, core does not change MULTI LANGUAGE INSTANCES Support multiple language API with native instances MULTIPLE PROCESSING SEMANTICS Efficient stream managers for each semantics EASE OF DEVELOPMENT Faster development of components with little dependency ! ! ! !
  22. 22. HERON ENVIRONMENT Local Scheduler Local State Manager Local Uploader Aurora/Scheduler Zookeeper State Manager Packer Uploader Mesos/Aurora Scheduler Zookeeper State Manager HDFS Uploader Laptop/Server Cluster/ProductionCluster/Testing
  23. 23. HERON RESOURCE USAGE x 9
  24. 24. HERON RESOURCE USAGE Event Spout Aggregate Bolt 60-100M/min Filter 8-12M/min Flat-Map 40-60M/min Aggregate Cache 1 sec Output 25-42M/min Redis
  25. 25. RESOURCE CONSUMPTION Cores Requested Cores Used Memory Requested (GB) Memory Used Redis 24 2-4 48 N/A Heron 120 30-50 200 180
  26. 26. RESOURCE CONSUMPTION 7% 9% 84% Spout Instances Bolt Instances Heron Overhead
  27. 27. PROFILING SPOUTS 2%7% 16% 6% 6% 63% Deserialize Parse/Filter Mapping Kafka Iterator Kafka Fetch Rest
  28. 28. PROFILING BOLTS 2%4% 5% 2% 19% 68% Write Data Serialize Deserialize Aggregation Data Transport Rest
  29. 29. RESOURCE CONSUMPTION - BREAKDOWN 8% 11% 21% 61% Fetching Data User Logic Heron Usage Writing Data
  30. 30. HERON BACK PRESSURE x 9
  31. 31. STRAGGLERS BAD HOST EXECUTION SKEW INADEQUATE PROVISIONING b Ñ Stragglers are the norm in a multi-tenant distributed systems
  32. 32. APPROACHES TO HANDLE STRAGGLERS SENDERS TO STRAGGLER DROP DATA DETECT STRAGGLERS AND RESCHEDULE THEM ! d " SENDERS SLOW DOWN TO THE SPEED OF STRAGGLER
  33. 33. DROP DATA STRATEGY UNPREDICTABLE AFFECTS ACCURACY POOR VISIBILITY b Ñ
  34. 34. SLOW DOWN SENDER STRATEGY PROVIDES PREDICTABILITY PROCESSES DATA AT MAXIMUM RATE REDUCE RECOVERY TIMES HANDLES TEMPORARY SPIKES /b Ñ back pressure
  35. 35. SLOW DOWN SENDER STRATEGY % % S1 B2 B3 % B4 back pressure
  36. 36. S1 B2 B3 SLOW DOWN SENDERS STRATEGY Stream Manager Stream Manager Stream Manager Stream Manager S1 B2 B3 B4 S1 B2 B3 S1 B2 B3 B4 B4 S1 S1 S1S1
  37. 37. BACK PRESSURE IN PRACTICE Read pointer Write pointer Manually restart the container
  38. 38. BACK PRESSURE IN PRACTICE IN MOST SCENARIOS BACK PRESSURE RECOVERS Without any manual intervention SOMETIMES USER PREFER DROPPING OF DATA Care about only latest data ! d " SUSTAINED BACK PRESSURE Irrecoverable GC cycles Bad or faulty host
  39. 39. DETECTING BAD/FAULTY HOSTS
  40. 40. HERON LOAD SHEDDING x 9
  41. 41. LOAD SHEDDING SAMPLING BASED APPROACHES Down sample the incoming stream and scale up the results Easy to reason if the sampling is uniform Hard to achieve uniformity across distributed spouts ! " DROP BASED APPROACHES Simply drop older data Spouts takes a lag threshold and a lag adjustment value Works well in practice
  42. 42. CURIOUS TO LEARN MORE… Twitter Heron: Stream Processing at Scale Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M. Patel*,1 , Karthik Ramasamy, Siddarth Taneja @sanjeevrk, @challenger_nik, @Louis_Fumaosong, @vikkyrk, @cckellogg, @saileshmittal, @pateljm, @karthikz, @staneja Twitter, Inc., *University of Wisconsin – Madison ABSTRACT Storm has long served as the main platform for real-time analytics at Twitter. However, as the scale of data being processed in real- time at Twitter has increased, along with an increase in the diversity and the number of use cases, many limitations of Storm have become apparent. We need a system that scales better, has better debug-ability, has better performance, and is easier to manage – all while working in a shared cluster infrastructure. We considered various alternatives to meet these needs, and in the end concluded that we needed to build a new real-time stream data processing system. This paper presents the design and implementation of this new system, called Heron. Heron is now system process, which makes debugging very challenging. Thus, we needed a cleaner mapping from the logical units of computation to each physical process. The importance of such clean mapping for debug-ability is really crucial when responding to pager alerts for a failing topology, especially if it is a topology that is critical to the underlying business model. In addition, Storm needs dedicated cluster resources, which requires special hardware allocation to run Storm topologies. This approach leads to inefficiencies in using precious cluster resources, and also limits the ability to scale on demand. We needed the ability to work in a more flexible way with popular cluster scheduling software that allows sharing the cluster resources across different types of data Storm @Twitter Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M. Patel*, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal, Dmitriy Ryaboy @ankitoshniwal, @staneja, @amits, @karthikz, @pateljm, @sanjeevrk, @jason_j, @krishnagade, @Louis_Fumaosong, @jakedonham, @challenger_nik, @saileshmittal, @squarecog Twitter, Inc., *University of Wisconsin – Madison Streaming@Twitter Maosong Fu, Sailesh Mittal, Vikas Kedigehalli, Karthik Ramasamy, Michael Barry, Andrew Jorgensen, Christopher Kellogg, Neng Lu, Bill Graham, Jingwei Wu Twitter, Inc. Abstract Twitter generates tens of billions of events per hour when users interact with it. Analyzing these events to surface relevant content and to derive insights in real time is a challenge. To address this, we developed Heron, a new real time distributed streaming engine. In this paper, we first describe the design goals of Heron and show how the Heron architecture achieves task isolation and resource reservation to ease debugging, troubleshooting, and seamless use of shared cluster infrastructure with other critical Twitter services. We subsequently explore how a topology self adjusts using back pressure so that the pace of the topology goes as its slowest component. Finally, we outline how Heron implements at most once and at least once semantics and we describe a few operational stories based on running Heron in production. 1 Introduction
  43. 43. INTERESTED IN HERON? CONTRIBUTIONS ARE WELCOME! https://github.com/twitter/heron http://heronstreaming.io HERON IS OPEN SOURCED FOLLOW US @HERONSTREAMING
  44. 44. " #ThankYou FOR LISTENING
  45. 45. QUESTIONS and ANSWERS R # Go ahead. Ask away.
  46. 46. Heron Hands On NENG LU, PANKAJ GUPTA MARK LI, ZUYU ZHANG TAISHI NOJIMA #TwitterHeron
  47. 47. BEGIN END INSTALL HERON ! I LAUNCH TOPOLOGIES (II CONCLUSI ON K V HERON STARTER ZIV AGENDA OUTLINE EXPLORE UI bIII
  48. 48. INSTALL HERON x 9
  49. 49. HERON PREREQS JAVA 7 OR ABOVE PYTHON 2.7 LINUX PLATFORMS: INSTALL LIBUNWIND
  50. 50. INSTALL HERON CLIENT STEP 1: DOWNLOAD CLIENT BINARIES https://github.com/twitter/heron/releases/tag/0.14.1 heron-client-install-0.14.1-<platform>.sh STEP 2: INSTALL CLIENT BINARIES chmod +x heron-client-install-0.14.1-<platform>.sh ./heron-client-install-0.14.1-<platform>.sh —user
  51. 51. INSTALL HERON TOOLS STEP 1: DOWNLOAD TOOLS BINARIES https://github.com/twitter/heron/releases/tag/0.14.1 heron-tools-install-0.14.1-<platform>.sh STEP 2: INSTALL TOOLS BINARIES chmod +x heron-tools-install-0.14.1-<platform>.sh ./heron-tools-install-0.14.1-<platform>.sh —user
  52. 52. LAUNCH HERON TOOLS STEP 1: LAUNCH HERON TRACKER heron-tracker STEP 2: LAUNCH HERON UI heron-ui
  53. 53. LAUNCH EXAMPLE TOPOLOGIES STEP 1: SUBMIT TOPOLOGY heron submit local ~/.heron/examples/heron-examples.jar com.twitter.heron.examples.ExclamationTopology ExclamationTopology STEP 2: SUBMIT YET ANOTHER TOPOLOGY heron submit local ~/.heron/examples/heron-examples.jar com.twitter.heron.examples.AckingTopology AckingTopology STEP 3: SUBMIT YET ANOTHER ANOTHER TOPOLOGY heron submit local ~/.heron/examples/heron-examples.jar com.twitter.heron.examples.MultiStageAckingTopology MultiStageAckingTopology
  54. 54. EXPLORE UI STEP 1: LOGICAL PLAN AND PHYSICAL PLAN STEP 2: EXPLORE DASHBOARD STEP 3: VIEW METRICS
  55. 55. EXPLORE UI STEP 4: VIEW PROCESS LOGS STEP 5: VIEW EXCEPTIONS STEP 6: EXPLORE OTHER FEATURES
  56. 56. HERON STARTER SEVERAL STARTER EXAMPLES https://github.com/kramasamy/heron-starter INTERESTING EXERCISES Trending Twitter Words Trending Twitter Hashtags Find top retweeters for given hash tag
  57. 57. QUESTIONS IN HERON? ANY QUESTIONS/CLARIFICATIONS? https://groups.google.com/forum/#!forum/heron-users http://heronstreaming.io FOR UPDATES FOLLOW @HERONSTREAMING
  58. 58. " #ThankYou FOR TRYING

×