Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Flink forward-2017-netflix keystones-paas

3,059 views

Published on

The need for gleaning answers from unbounded data streams is moving from nicety to a necessity. Netflix is a data driven company, and has a need to process over 1 trillion events a day amounting to 3 PB of data to derive business insights.
To ease extracting insight, we are building a self-serve, scalable, fault-tolerant, multi-tenant "Stream Processing as a Service" platform so the user can focus on data analysis. I'll share our experience using Flink to help build the platform.

Published in: Technology
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Flink forward-2017-netflix keystones-paas

  1. 1. SPaaS Stream Processing with Flink at Netflix Monal Daxini Engineering Manager, Stream Processing FlinkForward 2017 @monaldax
  2. 2. Challenges & Lessons Learnt We’ll look at our Ingest pipeline, SPaaS, Use cases, Challenges, and Lessons learnt SPaaS Stream Processing as a Service SPaaS Use Cases Ingest Pipeline
  3. 3. Challenges & Lessons Learnt Let’s start with Keystone Ingest pipeline SPaaS Stream Processing as a Service SPaaS Use Cases Ingest Pipeline
  4. 4. Event Producers Sinks To reliably publish and process events, one needs highly available Ingest pipelines - the backbone of a real-time data infrastructure
  5. 5. Keystone Stream Processing (SPaaS) Keystone Management Keystone Messaging Keystone pipeline is built on 3 core subsystems It does not impact members ability to play videos 100% in AWS
  6. 6. Keystone Management – Provision a data stream (mini pipeline) 📽
  7. 7. Keystone Management – Update filter 📽
  8. 8. Keystone Management - Filter DSL & Message Parser * We would like to move away from xpath & our custom parser
  9. 9. Keystone Management - Configure output (sinks / destination)
  10. 10. Keystone Management – ElasticSearch sink config
  11. 11. Keystone Management – Projection 📽
  12. 12. Provisioning of a data stream generates dashboard & alert configurations Event Processing
  13. 13. Keystone Management – Tooling 📽
  14. 14. Keystone Management – Sample report, Inactive streams
  15. 15. Keystone Management – Sample Report, Under provisioned streams
  16. 16. Challenges & Lessons Learnt Let’s explore event flow in Keystone pipeline, and its capabilities SPaaS Stream Processing as a Service SPaaS Use Cases Ingest Pipeline Event Flow / Capabilities
  17. 17. Keystone pipeline system boundary Flink SPaaS Router EMR Fronting Kafka Event Producer Keystone Management KSGateway Stream Consumers Consumer Kafka KCWKCW * Does not impact video playability
  18. 18. Events are published via a proxy or a Kafka client wrapper (KCW) Flink SPaaS Router EMR Fronting Kafka Event Producer KSGateway KCW Stream Consumers Consumer Kafka Keystone Management KCW
  19. 19. Events land up in fronting Kafka cluster Flink SPaaS Router EMR Fronting Kafka Event Producer KSGateway Stream Consumers Consumer Kafka Keystone Management KCWKCW
  20. 20. Events are polled by router, filter and projection applied Flink SPaaS Router EMR Fronting Kafka Event Producer KSGateway Stream Consumers Consumer Kafka Keystone Management KCWKCW
  21. 21. Router sends events to destination Flink SPaaS Router EMR Fronting Kafka Event Producer KSGateway Stream Consumers Consumer Kafka Keystone Management KCWKCW
  22. 22. Keystone Pipeline Capabilities • At-least-once delivery semantics* • Data stream Isolation • Inject event metadata • GUID, timestamp, host, app
  23. 23. Keep data loss < 1% per day per data stream for infrastructure migration or deployments
  24. 24. Keystone Pipeline Capabilities • Scales based on traffic (externally driven) • Producer & Router • Kafka Cluster failover • Kafka Kong
  25. 25. Kafka Kong Once a week Keystone pipeline is up 24x7, availability is key, In the spirit of Chaos Kong, we do Kafka Kong
  26. 26. Automated kafka cluster failover Flink Router Fronting Kafka Event Producer X Bring up backup Kafka cluster Flink Router1
  27. 27. Fronting Kafka Failover
  28. 28. Time is the essence - failover as fast as 5 minutes Fully Automated Fronting Kafka Failover
  29. 29. Challenges & Lessons Learnt Let’s look at Stream Processing as a Service SPaaS Stream Processing as a Service SPaaS Use Cases Ingest Pipeline Event Flow / Capabilities
  30. 30. Point & Click Pipelines Filtering & Projection (In prod) SPaaS enables point & click pipeline, and customs jobs Flink is currently used for two broad classes of use cases in SPaaS DSL (Future)Custom Code (Staging with prod data)
  31. 31. High level SPaaS Architecture Container Runtime (Titus) Point & Click or Job DSL (Future) Custom Code1. Create 2. Submit Stream Processing Job 2. Submit Job DSL (Future) 3. Launch Job Continuous Delivery Platform Keystone Management Point and Click Custom code (upcoming)
  32. 32. Stream processing platform layered cake offers flexible services to our internal customers (engineers) AWS EC2 Container Runtime SPaaS-Core Reusable Blocks ES, Kafka, & Hive Sink (Flink in test) Routers Stream Processing Applications Reusable Blocks - Early days Keystone & Kafka Sink, Complex Sessionization
  33. 33. Titus Job Task Manager IP Titus Host 4 Titus Host 5 Flink Job deployment on container runtime Zookeeper Job Manager (standby) Job Manager (master) Task Manager Titus Host 1 IP Titus Host 2 …. Task Manager Titus Host 3 IP Titus Job IPIP AWS VPC ENI Fronting Kafa (Offset checkpointing) Checkpoint / Snapshot 1. 2.
  34. 34. SPaaS run on Titus (Netflix’s inhouse) Container runtime Titus UITitus UI Docker Registry Docker Registry Rhea container container container docker Titus Agent metrics agent container container SPaaS-Job Titus executor logging agent zfsmesos agent docker RheaTitus API Cassandra Titus Master Job Management & Scheduler S3 Zookeeper Docker Registry EC2 Autoscaling API Mesos Master Titus UI (CI/CD) Fenzo
  35. 35. Challenges & Lessons Learnt Let’s look at Stream Processing use cases SPaaS Stream Processing as a Service 3 SPaaS Use Cases Ingest Pipeline Event Flow / Capabilities
  36. 36. 1. Keystone pipeline Router is a massively parallel use case • Every data stream is independent and isolated • No dependencies between tasks • Chained operator • Only state – Kafka offset checkpointing Job Plan from JobManager UI
  37. 37. Broker Router – each Flink job reads from one topic, and each task independently polls events from assigned partitions ES Router Kafka Router
  38. 38. Prod Scale processed by ES & Kafka Flink routers, and Hive Samza routers • 1,300,000,000,000+ events processed / day • 3+PB in 9+PB out / day • 99%+ availability ytd
  39. 39. Prod – trending events (approximate) ≅ 80B to 1.3T
  40. 40. Prod Scale – only Kafka and ES Flink routers are deployed in prod, (Hive output Flink routers are in test, unaccounted below) • 4000+ Kafka brokers, 50+ clusters • 100’s of Data Streams (Flink Jobs) • 3700+ Docker containers running • 1400+ nodes with 22K+ cpu cores
  41. 41. Router has large scale in terms of volume and overall deployed streams in the cloud, which leads to challenges unnoticed otherwise • S3 checkpointing backend • S3 outage = router downtime, rely on Kafka offset commit only, like Samza • Pressure on S3 if deployment is not staggered • Disable distributed checkpointing, only JobManager writes to checkpointing backend.
  42. 42. Router has large scale in terms volume and overall deployed streams in the cloud, which leads to challenges unnoticed otherwise • A failed task can cause the job to restart (JVM running) • Need Fine grained recovery (Phase 1), FLIP-1 • Failures at times can cause few more duplicates than Samza
  43. 43. Titus Job Task Manager IP Titus Host 4 Titus Host 5 Flink Job deployment on container runtime Zookeeper Job Manager (standby) Job Manager (master) Task Manager Titus Host 1 IP Titus Host 2 …. Task Manager Titus Host 3 IP Titus Job IPIP AWS VPC ENI Fronting Kafa (Offset checkpointing) Checkpoint / Snapshot 1. 2. X Causes Flink Job Restart
  44. 44. Measurable cost savings moving from ES and Kafka routers to Flink from Samza • Disclaimer: When comparing Flink and Samza, you may observe different results in your own environment and setup • This is not an exact apples-to-apples comparison Observed significant savings by migrating ES and Kafka to Flink routers on New container runtime vs Samza on Old container runtime
  45. 45. 2. Enriching User Video Plays with “discovery” attributes using Flink • Talk to other live services • Integration with IPC ecosystem • Needs high throughput • Small state • O(100M) events / day Not in production No Keystone Management support yet
  46. 46. The challenges with the event enrichment use case ● Access data (slow / fast changing) from live or static sources ● Play nice, avoid member streaming impact ● Reliability and stability ● Dependency Isolation (Jar Hell) ● Backfill – (historical data / deal with bugs)
  47. 47. 3. Complex sessionization of user events using Flink • Create sessions with start and end events, determined based on event payload and event time order (punctuated) • Handle late, and out-of-order events • 2 to 24 hour session window duration • O(10B) events per day, testing with a small fraction of this volume (flink job state 100GB+) Not in production No Keystone Management support yet
  48. 48. The challenges with complex sessionization use case with large state ● Flink supported session window with gap duration is not sufficient ● Developed custom, complex session windows - done ● Large state & large scale ● Quick checkpoints, and fast recovery from job failures ● Incremental checkpointing ● Exploring other storage strategies with Flink community
  49. 49. We have realized that these three use cases represent a large set of challenges / requirements needed from SPaaS platform ● Router - Massively Parallel - almost no state, very large scale ● Event enrichment - small state, medium scale ● Complex sessionization – large state, large scale
  50. 50. In addition, there are several other challenges across use cases ● Developer tooling & Testing ● Insight and Operations ● Continuity through upgrades & deployments ● Data parity & Canary tooling ● Thinking streaming first – always on, operational responsibilities ● Cross region event aggregation and routing ● Auto scaling & capacity planning
  51. 51. Community Contributions
  52. 52. We are contributing by running Flink at scale in the cloud (pioneer tax), and more ● Metrics, Operations, Deployment ● Custom, complex session windows ● Fault tolerance, large State management ● Challenges related to massively parallel codebase ● Adaptation of our Patch - https://github.com/apache/flink/pull/3312
  53. 53. You got a glimpse of how we are leveraging Flink as part of our stream processing platform to serve the business insights of other engineers at Netflix. We have come a long way, however we have just begun the journey in our quest for Fast data. If you are on a similar journey or have ideas, or would like to collaborate to move Flink forward, we would like to hear from you. Conclusion

×