Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Anomaly detection in real-time data streams using Heron

3,994 views

Published on

Twitter has become the de facto medium for consumption of news in real time, and billions of events are generated and analyzed on a daily basis. To analyze these events, Twitter designed its own next-generation streaming system, Heron. Arun Kejariwal and Karthik Ramasamy walk you through how Heron is used to detect anomalies in real-time data streams. Although there’s been over 75 years of prior work in anomaly detection, most of the techniques cannot be used off the shelf because they’re not suitable for high-velocity data streams. Arun and Karthik explain how to make trade-offs between accuracy and speed and discuss incremental approaches that marry sampling with robust measures such as median and MCD for anomaly detection.

Published in: Technology
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Anomaly detection in real-time data streams using Heron

  1. 1. Arun  Kejariwal                  Karthik  Ramasamy            MZ  Research                                                                      Twi.er Anomaly Detection in Real-Time Data Streams Using Heron
  2. 2. 2
  3. 3. 3 DATA  @  MZ   An  Overview GOW AND MOBILE STRIKE Peaked at 1M events/sec MARKETING Serve >1B impressions/day worldwide Integrated with >150 distinct advertising channels POTPOURRI ~35B messages/day Writes: 20TB/day
  4. 4. 4 SENSORS Monitoring   Smartwatches,  Refrigerators   Wearables ACTUATORS Automa,on   Manufacturing   Robo@cs DRONES Expanding  the  scope   Delivery,  Real  Estate   Power  Transmission  Lines MOBILE Life’s  Remote  Control   Personaliza@on   Produc@vity EXPLOSION  IN  DATA  VELOCITY  AND  VOLUME
  5. 5. 5 MANUFACTURING HEALTH   Care POWER   Grid GAS   Pipelines SECURITY OPERATIONS ROBOTICS #  TWEETS   per  minute ANOMALY  DETECTION:  WHY  BOTHER? DIGITAL   Marke,ng CONNECTED   Cars
  6. 6. 6 ANOMALY  DETECTION:  LIVE  EXAMPLE
  7. 7. 7 ANOMALY  DETECTION:  HISTORY
  8. 8. 8 RESEARCHED   FOR   >100  YEARS Manufacturing Econometrics Networking Image  Processing Computer  Vision (Cyber)  Security Text  Mining Signal  Processing Finance Experimental  Social  Psychology Web  Opera@ons Sta@s@cs  (and  Time  Series  Analysis) Data  Fidelity Astronomy ANOMALY  DETECTION:    APPLICATION  DOMAINS
  9. 9. 9 ANOMALY  DETECTION:  RECENT  WORKS  IN  INDUSTRY JAN’15 MARCH’15 AUG’15 NOV’15NOV’15AUG’15 JULY’15 JUNE’16
  10. 10. 10 FALSE   Posi@ve   Rate FALSE   Nega@ve   Rate SCALE   Data   Granularity WHY  NOT  USE  OFF-­‐THE-­‐SHELF? Anomalies  are  CONTEXTUAL
  11. 11. 11 Severity Data   Characteris@cs Data     Fidelity Different  Ac@ons   Page  or  not   Sta@onarity,  Normal     Distribu,on   Missing  Data   Data  Corrup,on   MOSTLY  UNSUPERVISED
  12. 12. 12 DATA  VISUALIZATION   Not  viable  in  prac2ce
  13. 13. 13 MEAN AND STANDARD DEVIATION Mean: Compute incrementally Not robust in the presence of anomalies COMMONLY  USED  STATISTICS TRIMMED MEAN Robust in the presence of anomalies Small samples? How to handle asymmetric distributions? Results in a biased estimator What should be the trimming boundaries? WINSORIZED MEAN L-ESTIMATORS Linear combinations of order statistics
  14. 14. 14 ROBUST  STATISTICS MEDIAN AND MEDIAN ABSOLUTE DEVIATION (MAD) Robust in the presence of anomalies Not amenable to incremental computation Use q-digest, t-digest What if MAD is zero? A sample with many similar values BROADENED MEDIAN, M-ESTIMATORS, SN AND QN
  15. 15. 15 ANALYZE INDIVIDUAL TIME SERIES Too many alerts Not actionable Alert Fatigue MULTIPLE  TIME  SERIES   Methods MINIMUM COVARIANCE DETERMINANT (MCD) Proposed by Rousseeuw, 1984 Mahalanobis distance1 FastMCD [1]  “On  the  generalised  distance  in  sta/s/cs”,  by  P.  C.  Mahalanobis,  1936.  
  16. 16. 16 MULTIPLE  TIME  SERIES   Other  Methods CORRELATION Direction Magnitude nxn Correlation Matrix? Bake in context Exploit topology
  17. 17. 17 CHALLENGES Susceptible to Anomalies Data Skew Missing Data Speed MULTIPLE  TIME  SERIES   Other  Methods TECHNIQUES Robust Correlation Cross Correlation Intersection Analysis Trade-off between speed and accuracy
  18. 18. THE  BIG  PICTURE
  19. 19. 19 THE  FLOW   RTpla9orm  and  Heron Live  Data Streaming  Computa,on RTpla/orm
  20. 20. 20 RTplatform Cloud-based platform built for connecting, processing, and reacting to live data. + Extreme scale + High performance + Unprecedented reliability + Natively serverless
  21. 21. 21 RTplatform “Real-time” has many definitions that have variable KPIs. Real time results on data-at-rest, not on live data
  22. 22. 22 Live Stream Bots A backbone for live data: Free Messaging for publishers and subscribers Filter, analyze and transform messages in live stream Notify Anomaly detection RTplatform MESSAGING Real-time Pub/Sub with ultra-low latency and high fanout QUERYING Filter, analyze, and transform messages live, in-stream BOTS Deploy rule-based bots for real-time anomaly detection/reaction
  23. 23. 23 RTplatform
  24. 24. HERON
  25. 25. 25 HERON  DESIGN  GOALS Task isolation Ease  of  debug-­‐ability/isolaDon/profiling Support for back pressure Topologies  should  self  adjusDng Efficiency Reduce resource consumption Off -the-shelf schedulers Unmanaged    -­‐  Apache  YARN/Mesos   Managed  -­‐    Apache  Aurora,  Amazon  ECS Use of main stream languages C++,  Java  and  Python Batching of tuples AmorDzing  the  cost  of  transferring  tuples ! "# G 4 !
  26. 26. 26 HERON  ARCHITECTURE Topology 1 Topology Submission Scheduler Topology 2 Topology N
  27. 27. 27 TOPOLOGY  ARCHITECTURE Topology Master ZK Cluster Stream Manager I1 I2 I3 I4 Stream Manager I1 I2 I3 I4 Logical Plan, Physical Plan and Execution State Sync Physical Plan CONTAINER CONTAINER Metrics Manager Metrics Manager 27
  28. 28. 28 STREAM  MANAGER   Sample  Topology % % S1 B2 B3 % B4
  29. 29. 29 HERON  PHYSICAL  EXECUTION S1 B2 B3 Stream Manager Stream Manager Stream Manager Stream Manager S1 B2 B3 B4 S1 B2 B3 S1 B2 B3 B4 B4
  30. 30. 30 BACKPRESSURE   Stragglers  are  the  norm  in  a  mul2-­‐tenant  distributed  systems BAD HOST EXECUTION SKEW INADEQUATE PROVISIONING Ñ"
  31. 31. 31 SENDERS TO STRAGGLER: DROP DATA BACKPRESSURE   Approaches  to  Handle  Stragglers DETECT STRAGGLERS AND RESCHEDULE THEM SENDERS SLOW DOWN TO THE SPEED OF STRAGGLER
  32. 32. 32 BACKPRESSURE   Data  Drop  Strategy UNPREDICTABLE AFFECTS ACCURACY POOR VISIBILITY
  33. 33. 33 BACKPRESSURE   Slow  Down  Sender HANDLES TEMPORARY SPIKES # PROCESSES DATA AT MAXIMUM RATE / PROVIDES PREDICTABILITY REDUCE RECOVERY TIMES
  34. 34. 34 BACKPRESSURE   Stream  Manager TCP backpressure Spout based backpressure Stagewise backpressure ! ! !
  35. 35. 35 BACKPRESSURE  -­‐  TCP   Stream  Manager Slows  upstream  and  downstream  instances S1 B2 B3 Stream Manager Stream Manager Stream Manager Stream Manager S1 B2 B3 B4 S1 B2 B3 S1 B2 B3 B4 B4
  36. 36. 36 BACKPRESSURE  -­‐  SPOUT   Stream  Manager S1 S1 S1S1S1 S1 S1S1 B2 B3 Stream Manager Stream Manager Stream Manager Stream Manager B2 B3 B4 B2 B3 B2 B3 B4 B4
  37. 37. 37 IN MOST SCENARIOS BACK PRESSURE RECOVERS Without any manual intervention BACKPRESSURE   In  Prac2ce SOMETIMES USER PREFERS DROPPING OF DATA Care about only latest data SUSTAINED BACK PRESSURE Irrecoverable GC cycles, Bad or faulty host
  38. 38. 38 PREDICTABILITY Tuple failures are more deterministic BACKPRESSURE   Advantages SELF ADJUSTS Topology goes as fast as the slowest component
  39. 39. 39 HERON:  EXTENSIBLE  STREAMING  ENGINE HARDWARE BASIC INTER/INTRA IPC Topology Master Stream Manager Instance Metrics Manager Scribe Graphite SCHEDULERSTATEMANAGER
  40. 40. 40 PLUG AND PLAY COMPONENTS As environment changes, core does not change MULTI LANGUAGE INSTANCES Support multiple language API with native instances MULTIPLE PROCESSING SEMANTICS Efficient stream managers for each semantics EASE OF DEVELOPMENT Faster development of components with little dependency HERON:  EXTENSIBLE  STREAMING  ENGINE
  41. 41. 41 REPEATED SERIALIZATION Java objects —> Byte Arrays —> Protocol Buffers EAGER DESERIALIZATION Stream manager deserializes entire tuple even though full contents are not examined IMMUTABILITY Stream manager does not reuse any ProtoBuf objects OPTIMIZING  HERON
  42. 42. 42 HERON:  PERFORMANCE   At  most  once  seman2cs 0 2000 4000 6000 8000 10000 12000 25 100 200 MILLION TUPLES/MIN SPOUT PARALLELISM THROUGHPUT Without Optimizations With Optimizations 0 5 10 15 20 25 30 35 25 100 200 MILLION TUPLES/MIN SPOUT PARALLELISM THROUGHPUT PER CORE Without Optimizations With Optimizations
  43. 43. 43 HERON:  PERFORMANCE   At  least  once  seman2cs 0 500 1000 1500 2000 2500 25 100 200 MILLION TUPLES/MIN SPOUT PARALLELISM THROUGHPUT Without Optimizations With Optimizations 0 20 40 60 80 100 120 140 160 180 25 100 200 MILLISECS SPOUT PARALLELISM LATENCY Without Optimizations With Optimizations
  44. 44. 44 HERON:  PERFORMANCE   At  least  once  seman2cs  -­‐  Impact  of  Cache  Drain  Frequency 0 500 1000 1500 2000 2500 0 5 10 15 20 25 30 35 MILLION TUPLES/MIN CACHE DRAIN FREQUENCY (MS) THROUGHPUT VS CACHE DRAIN FREQUENCY 200 100 25 0 10 20 30 40 50 60 70 80 90 100 0 5 10 15 20 25 30 35 LATENCY (MS) CACHE DRAIN FREQUENCY (MS) LATENCY VS CACHE DRAIN FREQUENCY 200 100 25
  45. 45. 45 HALBERT   Nakagawa   Co-­‐Founder  &  CTO FRANCOIS   Orsini   CTO JOSH   Lulewicz   Head  of  Data  Placorm WE  ARE  HIRING! KARTHIK   Ramasamy   Manager
  46. 46. 46 QUESTIONS    ANSWERS   Go  ahead.   Don‘t  hesitate.
  47. 47. 47 READINGS STROM @ TWITTER A. Toshniwall et. al, SIGMOD 2014. TWITTER HERON: STREAM PROCESSING AT SCALE S. Kulkarni et al., SIGMOD 2015. STREAMING @ TWITTER M. Fu, 2016. TWITTER HERON: TOWARDS EXTENSIBLE STREAMING ENGINES M. Fu, ICDE 2017.
  48. 48. 48 READINGS LIMITS THEOREMS FOR THE MEDIAN DEVIATIONS P. Hall and A. H. Welsh, 1985. ALTERNATIVES TO MEDIAN ABSOLUTE DEVIATION P. J. Rousseeuw and C. Croux, 1993. ASYMPTOTIC INDEPENDENCE OF MEDIAN AND MAD M. Falk, 1997. BAHADUR REPRESENTATIONS FOR THE MEDIAN ABSOLUTE DEVIATION AND ITS MODIFICATIONS S. Mazumder and R. Serfling, 2009. THE MINIMUM REGULARIZED COVARIANCE DETERMINANT ESTIMATOR K. Boudt, P. J. Rousseeuw, S. Vanduffel and T. Verdonck, 2017.
  49. 49. THANK  YOU   For  your  aKen2on!

×