Successfully reported this slideshow.
Your SlideShare is downloading. ×

Riding the Stream Processing Wave (Strange loop 2019)

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 57 Ad

Riding the Stream Processing Wave (Strange loop 2019)

Download to read offline

At LinkedIn, we run several thousands of stream processing applications which, coupled with our scale, has exposed us to some unique challenges. We will talk about the 3 kinds of applications that have made the most impact on our stream processing platform.

At LinkedIn, we run several thousands of stream processing applications which, coupled with our scale, has exposed us to some unique challenges. We will talk about the 3 kinds of applications that have made the most impact on our stream processing platform.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Riding the Stream Processing Wave (Strange loop 2019) (20)

Advertisement

Recently uploaded (20)

Riding the Stream Processing Wave (Strange loop 2019)

  1. 1. Riding the stream processing wave Samarth Shetty Sept 13 2019
  2. 2. Riding the stream processing wave Samarth Shetty Sept 13 2019
  3. 3. 1 2 3 4 Agenda Overview Hard Problems Future Work Q&A
  4. 4. Stream Processing • Continuous Processing • Unbounded datasets • Low Latency Applications Recommended reading: Tyler Akidau: https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
  5. 5. Example Application Count number of "Page-Views" for each member in a 5 minute window Page View Page View/Member Stream Processing JobMessaging Queue Messaging Queue
  6. 6. Apache Samza • Stream Processing Platform • Top Level Apache project (2014) • Created by LinkedIn. In use at LinkedIn, Slack, Intuit, Redfin etc
  7. 7. Apache Samza Scale @ LinkedIn • ~4k jobs • 20k+ containers • ~2 Trillion messages processed per day
  8. 8. Stream Processing at LinkedIn Security Bot Detection, Access Monitoring Notifications Email and Push Notifications Classification Topic tagging, Image classifications
  9. 9. Stream Processing at LinkedIn Site Speed Site Speed and Health Monitoring Index Updates Updates to Search Index Business Metrics Pre-aggregated real-time counts by dimensions
  10. 10. 1 2 3 4 Future Work Agenda Overview Hard Problems Q&A
  11. 11. Hard Problems Common challenges we face in stream processing. • Scale • Operability • Scenarios spanning Offline and Online environments • Data access and Data movement
  12. 12. Hard Problems Common challenges we face in stream processing. Today’s session we will talk about • Scale (Stateful applications) • Operability • Scenarios spanning Offline and Online environments • Data access and Data movement
  13. 13. Real time targeting platform 1. Celia creates a LinkedIn post 2. Targeting platform scores each edge based on features such as connection strength, content affinity 3. Prunes low-quality edges based on scores (FPR). 4.. Remaining edges trigger the notification platform, where they are scored again (SPR) and optimized (aggregation, capping, etc.) Targeting Platform 1 1 Notification Platform
  14. 14. Real time targeting platform • Low Latency • High QPS • Large State ○ Features generated offline used for scoring in nearline
  15. 15. Real time targeting platform • High QPS ○ Parallelism and Async Processing • Large State ○ Optimized state lookup...
  16. 16. Samza Local State • Used for Lookups, Buffering data, computed results • Local state can be in-memory or on disk. • State computed or ingested from a remote source Change Capture or HDFS->Kafka Push Input Stream(s) Output Stream) Local Store Bootstrap local state
  17. 17. Samza Local State • How does it compare to remote state? Change Capture or HDFS->Kafka Push Input Stream(s) Output Stream) Local Store Bootstrap local state
  18. 18. Samza Local State • How does it compare to remote state? • 100 X • Faster Change Capture or HDFS->Kafka Push Input Stream(s) Output Stream) Local Store Bootstrap local state
  19. 19. Samza Local State • How does it compare to remote state? • 100 X • Faster Change Capture or HDFS->Kafka Push Input Stream(s) Output Stream) Local Store Bootstrap local state 30 X Throughput Gains Shadi A. Noghabi et al. Samza: stateful scalable stream processing at LinkedIn. Proc. VLDB Endow. 10, 12 (August 2017), 1634-1645.
  20. 20. Samza Local State • How do we provide durability? ○ Backed up in log compacted topic ○ Incremental checkpointing Change Capture or HDFS->Kafka Push Input Stream(s) Log Compacted Kafka topic Output Stream) State Backup Local Store Bootstrap local state
  21. 21. Local State ● How do we handle application failures? Samza Change Capture Stream Input Stream(s) Log Compacted Kafka topic Output Stream) State Backup Local Store Bootstrap local state
  22. 22. Local State ● How do we handle application failures? Samza Change Capture Stream Input Stream(s) Log Compacted Kafka topic Output Stream) State Backup Local Store Bootstrap local state
  23. 23. Local State ● How do we handle application failures? Samza Change Capture Stream Input Stream(s) Log Compacted Kafka topic Output Stream) State Backup Local Store Bootstrap local state Samza Master Heartbeats X
  24. 24. Local State ● How do we handle application failures? Samza Change Capture Stream Input Stream(s) Log Compacted Kafka topic Output Stream) State Backup Local Store Bootstrap local state Samza Master Heartbeats X Samza New Container
  25. 25. Local State ● How do we handle application failures? Samza Change Capture Stream Input Stream(s) Log Compacted Kafka topic Output Stream) State Backup Local Store Bootstrap local state Samza Master Heartbeats X Samza New Container Restore Local state from State Backup
  26. 26. Local State ● How do we handle application failures? Change Capture Stream Input Stream(s) Log Compacted Kafka topic Catch Up with Bootstrap stream Samza State Backup Read from last checkpoint Output Stream
  27. 27. Restoring Local State Challenges • For large state, recovery can take up to an hour • Impacted by Kafka quotas, SSD bottlenecks etc
  28. 28. Local State ● 50%: Per container state < 0.5GB
  29. 29. Local State ● 50%: Per container state < 0.5GB ● 95%: Per container state < 36 GB
  30. 30. Local State ● 50%: Per container state < 0.5GB ● 95%: Per container state < 36 GB ● Max container state is~150GB and growing
  31. 31. Restoring Local State • Can we reduce the frequency of state restore? • Can we reduce the time for state restore? • Can we have a bounded time for state restore?
  32. 32. Restoring Local State • Can we reduce the frequency of state restore? • Can we reduce the time for state restore? • Can we have a bounded time for state restore?
  33. 33. Restoring Local State • Can we reduce the frequency of state restore? • Can we reduce the time for state restore? • Can we have a bounded time for state restore?
  34. 34. Host Affinity Reducing downtime during recovery Task-1 Container-1 Container-2 Heartbeat Samza master Task-2 Durable containerID – host mapping • Restart containers on same host • Re-use on-disk state snapshot (host affinity) • Catch-up on only delta from the Kafka change-log 0 Downtime
  35. 35. Host Affinity Reducing downtime during recovery Task-1 Container-1 Container-2 Heartbeat Samza master Task-2 Durable containerID – host mapping • Limitations ○ Host affinity is not guaranteed ○ Host failures are a reality :) ○ Bugs and host contention may cause a full state restore 0
  36. 36. Standby Containers Bounded time for state restore • Jobs have active and standby containers • Standby container keeps a copy of application state • Only active containers process messages Active Container Standby Container Input Stream Heartbeat Samza master Change Log
  37. 37. Standby Containers Bounded time for state restore • Active container’s host fails Standby Container Input Stream Heartbeat Samza master X Change Log Active Container
  38. 38. Standby Containers Bounded time for state restore • Active container’s host fails • Heartbeat to host and container lost Standby Container Input Stream Heartbeat Samza master X Change Log Active Container X
  39. 39. Standby Containers Bounded time for state restore • Active container’s host fails • Heartbeat to host and container lost • Samza master selects a standby for promotion Standby Container Input Stream Heartbeat Samza master X Change Log Active Container X
  40. 40. Standby Containers Bounded time for state restore • Samza master promotes standby to active Active Container Samza master
  41. 41. Standby Containers Bounded time for state restore • Samza master promotes standby to active • Newly activated container processes from checkpoint Active Container Samza master Change Log Input Stream
  42. 42. Standby Containers Bounded time for state restore • Samza master promotes standby to active • Newly activated container processes from checkpoint • Samza master creates a new standby Active ContainerInput Stream Samza master Change Log Standby Container
  43. 43. Standby Containers Bounded time for state restore • Samza master promotes standby to active • Newly activated container processes from checkpoint • Samza master creates a new standby • Replica factor is configurable Active ContainerInput Stream Samza master Change Log Standby Container
  44. 44. Standby Containers Bounded time for state restore • Samza master promotes standby to active • Newly activated container processes from checkpoint • Samza master creates a new standby • Replica factor is configurable Active ContainerInput Stream Samza master Change Log Standby Container • Bounded Restore Time: 5 mins • ~20x faster for large state stores (200GB+)
  45. 45. Hard Problems Common challenges we face in stream processing. Today’s session we will talk about • Scale (Stateful applications) • Operability • Scenarios spanning Offline and Online environments • Data access and Data movement
  46. 46. Scenarios spanning Offline and Online environments • ML Model Training, Feature Engineering (Generation and Access) • Lambda Architecture • Experimentation
  47. 47. Feature Management Frame: Virtual Feature Store https://www.slideshare.net/DavidStein1/frame-feature-management-for-productive-machine-learning • Goal: Simplify feature discovery and access • Applications get features by “name” in a global namespace • Abstraction layer for feature access • Unified across environments and data sources
  48. 48. Frame Simplifying Feature Access (Datastore: HDFS) 0 https://www.slideshare.net/DavidStein1/frame-feature-management-for-productive-machine-learning
  49. 49. Frame Simplifying Feature Access (Datastore: KV, REST etc) 0 https://www.slideshare.net/DavidStein1/frame-feature-management-for-productive-machine-learning
  50. 50. Frame Nearline Applications
  51. 51. Simplifying Lambda Unified Metrics • Metrics pipeline built on Pig, Hive • Need real time insights • Solution: ○ Convert Pig and Hive to Samza pipelines for nearline processing ○ Apache Pinot for serving results Khai Tran: https://engineering.linkedin.com/blog/2019/01/bridging-offline-and-nearline-computations-with-apache-calcite
  52. 52. Simplifying Lambda Unified Metrics Samza jobs Batch jobs UMP neartime platform UMP offline platform Raptor code configMetrics definition HDFS Pinot Khai Tran: https://engineering.linkedin.com/blog/2019/01/bridging-offline-and-nearline-computations-with-apache-calcite Lambda architecture with a single codebase
  53. 53. Simplifying Lambda Unified Metrics Khai Tran: https://engineering.linkedin.com/blog/2019/01/bridging-offline-and-nearline-computations-with-apache-calcite ... Metric union User code User code Dimension decoration Calcite relational algebra as an IR convert generateoptimize Beam physical plan Pig to Calcite Calcite to Beam Streaming config Beam Java API code
  54. 54. In-progress Explorations Convergence API • Apache Beam ○ Samza supports a Beam runner ○ Exploring Spark-Beam runner • SQL ○ Samza SQL and Spark SQL
  55. 55. 1 2 3 4 Agenda Overview Hard Problems Future Work Q&A
  56. 56. Future Work • Auto Sizing of Jobs • Multi Language Support (e.g Python) • Frame for feature generation • State store on Azure Managed Disks
  57. 57. Thank you

×