Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Beam on Kubernetes (ApacheCon NA 2019)

36 views

Published on

Access to real-time data is increasingly important for many organizations. At Lyft, we process millions of events per second in real-time to compute prices, balance marketplace dynamics, detect fraud, among many other use cases. To do so, we run dozens of Apache Flink and Apache Beam pipelines. Flink provides a powerful framework that makes it easy for non-experts to write correct, high-scale streaming jobs, while Beam extends that power to our large base of Python programmers.

Historically, we have run Flink clusters on bare, custom-managed EC2 instances. In order to achieve greater elasticity and reliability, we decided to rebuild our streaming platform on top of Kubernetes. In this session, I’ll cover how we designed and built an open-source Kubernetes operator for Flink and Beam, some of the unique challenges of running a complex, stateful application on Kubernetes, and some of the lessons we learned along the way.

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Beam on Kubernetes (ApacheCon NA 2019)

  1. 1. ● ● ● ● ●
  2. 2. ● ● ● ● ●
  3. 3. ● ● ● ● ●
  4. 4. SQL
  5. 5. ELB TM JobManager TM TM TM
  6. 6. TM/JM Jenkins S3
  7. 7. ● ● ● ● ●
  8. 8. ● ● ●
  9. 9. ● ● ● 🔥
  10. 10. ● ● ● ●
  11. 11. $ aws ec2 run-instances --image-id ami-xxxxxxxx --instance-type c5.xlarge
  12. 12. apiVersion: v1 kind: Pod metadata: name: wordcount spec: containers: - image: lyft/wordcount:latest name: wordcount resources: requests: cpu: "16" memory: 32Gi $ aws ec2 run-instances --image-id ami-xxxxxxxx --instance-type c5.xlarge
  13. 13. ● ● ●
  14. 14. ● ● ● 4
  15. 15. ● ● ●
  16. 16. ● ● ●
  17. 17. ● ● ●
  18. 18. apiVersion: flink.k8s.io/v1beta1 kind: FlinkApplication metadata: name: wordcount spec: image: lyft/wordcount:latest jarName: "wordcount-1.0.0-SNAPSHOT.jar" parallelism: 30 entryClass: "com.lyft.WordCount" flinkVersion: "1.8" flinkConfig: state.backend: filesystem jobManagerConfig: resources: requests: memory: "8Gi" cpu: "4" taskManagerConfig: resources: requests: memory: "8Gi" cpu: "8"
  19. 19. $ flink run WordCount.jar $ curl -XPUT
  20. 20. $ flink stop <jobid> $ curl -XDELETE wordcount
  21. 21. 1. Update jar on jobmanager 2. Cancel job with savepoint 3. Restart cluster 4. Submit job with savepoint $ curl -XPATCH wordcount -d'{“spec/image”: “wordcount:v2”}'
  22. 22. apiVersion: flink.k8s.io/v1beta1 kind: FlinkApplication metadata: name: wordcount spec: image: lyft/wordcount:v1 jarName: "wordcount-1.0.0-SNAPSHOT.jar" parallelism: 30 entryClass: "com.lyft.WordCount" flinkVersion: "1.8" flinkConfig: state.backend: filesystem jobManagerConfig: resources: requests: memory: "8Gi" cpu: "4" taskManagerConfig: resources: requests: memory: "8Gi" cpu: "8" apiVersion: flink.k8s.io/v1beta1 kind: FlinkApplication metadata: name: wordcount spec: image: lyft/wordcount:v2 jarName: "wordcount-1.0.0-SNAPSHOT.jar" parallelism: 30 entryClass: "com.lyft.WordCount" flinkVersion: "1.8" flinkConfig: state.backend: filesystem jobManagerConfig: resources: requests: memory: "8Gi" cpu: "4" taskManagerConfig: resources: requests: memory: "8Gi" cpu: "8"
  23. 23. $ flink cancel <jobid> $ curl -XDELETE wordcount
  24. 24. $ flink cancel <jobid> $ curl -XPATCH wordcount -d'{“spec/deleteMode”: “ForceCancel”}' $ curl -XDELETE wordcount
  25. 25. $ sp restart ???
  26. 26. $ sp restart $ curl -XPATCH wordcount -d'{“spec/restartNonce”: “xxx”}'
  27. 27. Status: Deploy Hash: b6c4bb26 Failed Deploy Hash: Job Status: Completed Checkpoint Count: 3908 Entry Class: com.lyft.wordcount.WordCount Failed Checkpoint Count: 0 Health: Green Jar Name: wordcount-1.0-SNAPSHOT.jar Job ID: 1ebd3cd9445dda09d1ebe5b28b1661ee Job Restart Count: 1 Last Checkpoint Time: 2019-09-12T01:04:24Z Last Failing Time: <nil> Parallelism: 8 Restore Time: 2019-09-10T21:18:42Z Start Time: 2019-09-10T21:18:40Z State: RUNNING Last Seen Error: <nil> Last Updated At: 2019-09-12T01:04:38Z Phase: Running Retry Count: 0
  28. 28. Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal CreatingCluster 2m29s flinkK8sOperator Creating Flink cluster for deploy a931a6c4 Normal CancellingJob 93s flinkK8sOperator Cancelling job 1ebd3cd9445dda09d1ebe5b28b1661ee with a final savepoint Normal CanceledJob 63s flinkK8sOperator Canceled job with savepoint s3://streamingplatform/wordcount/savepoints/savepoint-1ebd3c-4e3f35489444 Normal JobSubmitted 59s flinkK8sOperator Flink job submitted to cluster with id 0c1790aa33fdc8dd2798bad1d55ddfa8 Normal ToreDownCluster 33s flinkK8sOperator Deleted old cluster with hash b6c4bb26
  29. 29. ● ● sp
  30. 30. ● ● Execution (Fn API) Apache Flink Apache Spark Pipeline Construction (Runner API) Beam GoBeam Java Beam Python Execution Execution Cloud Dataflow Execution
  31. 31. p = beam.Pipeline(options=pipeline_options) messages = (p | FlinkStreamingImpulseSource() .set_message_count(known_args.count) .set_interval_ms(known_args.interval_ms)) _ = (messages | 'decode' >> beam.Map(lambda x: ('', 1)) | 'window' >> beam.WindowInto(window.GlobalWindows()) | 'group' >> beam.GroupByKey() | 'count' >> beam.Map(count) | 'log' >> beam.Map(lambda x: logging.info("%d" % x[1]))) result = p.run() result.wait_until_finish()
  32. 32. Kubernetes Offline build

×