Successfully reported this slideshow.

Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the native way

4

Share

Upcoming SlideShare
Elawan Energy May 2018
Elawan Energy May 2018
Loading in …3
×
1 of 66
1 of 66

Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the native way

4

Share

Download to read offline

Roi Teveth (Data Engineer) and Itai Yaffe (Tech Lead, Big Data group) @ Nielsen:
At Nielsen Identity Engine, we use Spark to process 10’s of TBs of data. Our ETLs, orchestrated by Airflow, spin-up AWS EMR clusters with thousands of nodes per day.

In this talk, we’ll guide you through migrating Spark workloads to Kubernetes with minimal changes to Airflow DAGs, using the open-sourced GCP Spark-on-K8s operator and the native integration we recently contributed to the Airflow project.

Roi Teveth (Data Engineer) and Itai Yaffe (Tech Lead, Big Data group) @ Nielsen:
At Nielsen Identity Engine, we use Spark to process 10’s of TBs of data. Our ETLs, orchestrated by Airflow, spin-up AWS EMR clusters with thousands of nodes per day.

In this talk, we’ll guide you through migrating Spark workloads to Kubernetes with minimal changes to Airflow DAGs, using the open-sourced GCP Spark-on-K8s operator and the native integration we recently contributed to the Airflow project.

More Related Content

Similar to Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the native way

Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the native way

  1. 1. @ItaiYaffe, @RTeveth
  2. 2. @ItaiYaffe, @RTeveth $30,000/month Savings Open your eyes Robust
  3. 3. @ItaiYaffe, @RTeveth ● ● ● ● ● ●
  4. 4. @ItaiYaffe, @RTeveth ETLs with Airflow & 01. = 02. 03. -on- Airflow Integration & Contribution04.
  5. 5. @ItaiYaffe, @RTeveth ETLs with Airflow & 01. = 02. 03. -on- Airflow Integration & Contribution04.
  6. 6. @ItaiYaffe, @RTeveth ● ● ● ● ●
  7. 7. @ItaiYaffe, @RTeveth ~30TB ingested/day S3 ~3000 nodes/day 10’s of TB ingested/day druid$100K’s/month
  8. 8. @ItaiYaffe, @RTeveth Scalability Cost Efficiency Fault-tolerance
  9. 9. @ItaiYaffe, @RTeveth ● ● ● ○ ○ ○
  10. 10. @ItaiYaffe, @RTeveth ~20 automatic DAG deployments/day ~1000 DAG Runs/day ~2 years in production Met all requirements & more ~40 users across 4 groups 6 contributions to open-source
  11. 11. @ItaiYaffe, @RTeveth
  12. 12. @ItaiYaffe, @RTeveth
  13. 13. @ItaiYaffe, @RTeveth
  14. 14. @ItaiYaffe, @RTeveth
  15. 15. @ItaiYaffe, @RTeveth ● ● ●
  16. 16. @ItaiYaffe, @RTeveth ● ● ● ○ ● ○
  17. 17. @ItaiYaffe, @RTeveth EMR is an AWS managed service to run Hadoop & Spark clusters Allows you to reduce costs by using Spot instances Charges management cost for each instance in a cluster
  18. 18. @ItaiYaffe, @RTeveth ● ● ○ ■ ○ … ■ ■ emr_create_job_flow_operator emr_add_steps_operator emr_step_sensor Creates new emr cluster Adds Spark step to the cluster Checks if the step succeeded
  19. 19. @ItaiYaffe, @RTeveth ● ● ○ ■ ○ … ■ ■ emr_create_job_flow_operator emr_add_steps_operator emr_step_sensor Creates new emr cluster Adds Spark step to the cluster Checks if the step succeeded
  20. 20. @ItaiYaffe, @RTeveth $$$ Visibility Robustness
  21. 21. @ItaiYaffe, @RTeveth
  22. 22. @ItaiYaffe, @RTeveth ● ● ○ ○ ●
  23. 23. @ItaiYaffe, @RTeveth ClusterControl plane Worker nodes (EC2 in our case) Pods group of one or more containers (such as Docker containers)
  24. 24. @ItaiYaffe, @RTeveth ●
  25. 25. @ItaiYaffe, @RTeveth ●
  26. 26. @ItaiYaffe, @RTeveth ● ● ○ ○ ○ ■
  27. 27. @ItaiYaffe, @RTeveth ● ● ○ ○ ○ ■
  28. 28. @ItaiYaffe, @RTeveth ● ● ○ ○ ○ ■
  29. 29. @ItaiYaffe, @RTeveth ClusterControl plane
  30. 30. @ItaiYaffe, @RTeveth ClusterControl plane
  31. 31. @ItaiYaffe, @RTeveth ClusterControl plane
  32. 32. @ItaiYaffe, @RTeveth ClusterControl plane
  33. 33. @ItaiYaffe, @RTeveth ClusterControl plane
  34. 34. @ItaiYaffe, @RTeveth ClusterControl plane
  35. 35. @ItaiYaffe, @RTeveth ● ● ○ ○ ○ ○ ● ●
  36. 36. @ItaiYaffe, @RTeveth …
  37. 37. @ItaiYaffe, @RTeveth ● ● ○
  38. 38. @ItaiYaffe, @RTeveth
  39. 39. @ItaiYaffe, @RTeveth ./bin/spark-submit --master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances =3 --conf spark.kubernetes.container.image =<spark-image> local:///path/to/examples.jar Kubernetes control plane Kubernetes cluster SparkPi driver Executor 1 Executor 2 Executor 3
  40. 40. @ItaiYaffe, @RTeveth https://github.com/GoogleCloudPlatform/spark-on-k8s-operator ● ● ●
  41. 41. @ItaiYaffe, @RTeveth apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: "spark-pi” namespace: default spec: ... driver: ... executor: ... Spark-pi.yaml Kubernetes control plane Kubernetes cluster SparkPi driver Executor 1 Executor 2 Executor 3 kubectl Spark- on-K8s operator
  42. 42. @ItaiYaffe, @RTeveth VS
  43. 43. @ItaiYaffe, @RTeveth
  44. 44. @ItaiYaffe, @RTeveth …
  45. 45. @ItaiYaffe, @RTeveth … https://github.com/apache/airflow/pull/7163
  46. 46. @ItaiYaffe, @RTeveth ł
  47. 47. @ItaiYaffe, @RTeveth KubernetesHook SparkKubernetes operator SparkKubernetes sensor
  48. 48. @ItaiYaffe, @RTeveth ● ○ ● ○
  49. 49. @ItaiYaffe, @RTeveth ● ● ● apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: "spark-pi-{{ ds }}-{{ task_instance.try_number }}" namespace: default spec: ... driver: ... executor: ... SparkKubernetes operator
  50. 50. @ItaiYaffe, @RTeveth ● ●
  51. 51. @ItaiYaffe, @RTeveth ● ● ○ ● ○ ● ○
  52. 52. @ItaiYaffe, @RTeveth
  53. 53. @ItaiYaffe, @RTeveth
  54. 54. @ItaiYaffe, @RTeveth
  55. 55. @ItaiYaffe, @RTeveth
  56. 56. @ItaiYaffe, @RTeveth ● ○ ● ○ ● ○
  57. 57. @ItaiYaffe, @RTeveth ● ○ ○ ● ●
  58. 58. @ItaiYaffe, @RTeveth ● ○ ●
  59. 59. @ItaiYaffe, @RTeveth ● ○ on_success_callback on_failure_callback default_args ● ○ 😉
  60. 60. @ItaiYaffe, @RTeveth
  61. 61. @ItaiYaffe, @RTeveth Reduce costs Open your eyes Robust ● ●
  62. 62. @ItaiYaffe, @RTeveth ● ○ ■ ○ ○ ● ○ ○

×