Successfully reported this slideshow.
Your SlideShare is downloading. ×

Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way

Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way

Download to read offline

At Nielsen Identity, we use Apache Spark to process 10’s of TBs of data, running on AWS EMR. We started at a point where Spark was not even supported out-of-the-box by EMR, and today we’re spinning-up clusters with 1000’s of nodes on a daily basis, orchestrated by Airflow. A few months ago, we embarked on a journey to evaluate the option of using Kubernetes as our Spark infrastructure, mainly to reduce operational costs and improve stability (as we heavily rely on Spot Instances for our clusters). To allow us to achieve those goals, we combined the open-sourced GCP Spark-on-K8s operator (https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) with a native Airflow integration we developed and recently contributed back to the Airflow project (https://issues.apache.org/jira/browse/AIRFLOW-6542). Finally, we were able to migrate our existing Airflow DAGs, with minimal changes, from AWS EMR to K8s.

At Nielsen Identity, we use Apache Spark to process 10’s of TBs of data, running on AWS EMR. We started at a point where Spark was not even supported out-of-the-box by EMR, and today we’re spinning-up clusters with 1000’s of nodes on a daily basis, orchestrated by Airflow. A few months ago, we embarked on a journey to evaluate the option of using Kubernetes as our Spark infrastructure, mainly to reduce operational costs and improve stability (as we heavily rely on Spot Instances for our clusters). To allow us to achieve those goals, we combined the open-sourced GCP Spark-on-K8s operator (https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) with a native Airflow integration we developed and recently contributed back to the Airflow project (https://issues.apache.org/jira/browse/AIRFLOW-6542). Finally, we were able to migrate our existing Airflow DAGs, with minimal changes, from AWS EMR to K8s.

More Related Content

Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way

  1. 1. Data Teams Unite!
  2. 2. @ItaiYaffe, @RTeveth Take a look at your data pipeline...
  3. 3. @ItaiYaffe, @RTeveth … Now back to me
  4. 4. @ItaiYaffe, @RTeveth … Now back at your pipeline
  5. 5. @ItaiYaffe, @RTeveth … Now back to me
  6. 6. @ItaiYaffe, @RTeveth Sadly, your pipeline isn’t me...
  7. 7. @ItaiYaffe, @RTeveth But if you migrate: Airflow Spark jobs to Kubernetes...
  8. 8. @ItaiYaffe, @RTeveth You can boost your data pipeline like me SAVE $10,000’s/month GAIN visibility MAKE your systems robust
  9. 9. Migrating Airflow-based Spark jobs to K8s the native way Roi Teveth, Nielsen Itai Yaffe, Imply
  10. 10. @ItaiYaffe, @RTeveth Introduction Roi TevethItai Yaffe ● Principal Solutions Architect @ Imply Prev. Big Data Tech Lead @ Nielsen ● Itai Yaffe @ItaiYaffe ● Big Data developer @ Nielsen Identity ● Kubernetes evangelist ● Roi Teveth @RTeveth
  11. 11. @ItaiYaffe, @RTeveth What will you learn? How to easily migrate your Spark jobs to K8s
  12. 12. @ItaiYaffe, @RTeveth What will you learn? How to easily migrate your Spark jobs to K8s to reduce costs, gain visibility and robustness
  13. 13. @ItaiYaffe, @RTeveth What will you learn? How to easily migrate your Spark jobs to K8s to reduce costs, gain visibility and robustness using Airflow as your workflow management platform
  14. 14. @ItaiYaffe, @RTeveth Nielsen Identity ● Data and Measurement company ● Media consumption ● Single source of truth of individuals and households ○ Unifies many proprietary datasets ○ Generates holistic view of a consumer
  15. 15. @ItaiYaffe, @RTeveth Nielsen Identity in numbers >10B events/day 60TB/day S3 6000’s nodes/day 10’s of TB ingested/day druid
  16. 16. @ItaiYaffe, @RTeveth The challenges Scalability Cost Efficiency Fault-tolerance
  17. 17. @ItaiYaffe, @RTeveth Why do we need Airflow? ● Dozens of ETL workflows running around the clock ● Originally used AWS Data Pipeline for workflow management ● But we also wanted: ○ Better visibility of configuration and workflow ○ Better monitoring and statistics ○ Share common configuration/code between workflows
  18. 18. @ItaiYaffe, @RTeveth Why do we Airflow? ~20 automatic DAG deployments/day ~1000 DAG Runs/day ~2 years in production Met all requirements & more ~40 users across 4 groups 6 contributions to open-source
  19. 19. @ItaiYaffe, @RTeveth Common data pipeline pattern - Airflow DAG
  20. 20. @ItaiYaffe, @RTeveth Common data pipeline pattern - high-level architecture 1. Read input files Data Lake 2. Write output files 3. Ingest to DB Intermediate StorageData Processing OLAP
  21. 21. @ItaiYaffe, @RTeveth Spark clusters ● Available cluster managers ○ Mesos, YARN, Standalone and K8s ● Managed Spark on public clouds ○ AWS EMR, Databricks, GCP Dataproc, etc.
  22. 22. @ItaiYaffe, @RTeveth Common data pipeline pattern - high-level architecture 1. Read input files Data Lake 2. Write output files 3. Ingest to DB Intermediate StorageData Processing OLAP
  23. 23. @ItaiYaffe, @RTeveth What is EMR? EMR is an AWS managed service to run Hadoop & Spark clusters
  24. 24. @ItaiYaffe, @RTeveth What is EMR? EMR is an AWS managed service to run Hadoop & Spark clusters Allows you to reduce costs by using Spot instances
  25. 25. @ItaiYaffe, @RTeveth What is EMR? EMR is an AWS managed service to run Hadoop & Spark clusters Allows you to reduce costs by using Spot instances Charges management cost for each instance in a cluster
  26. 26. @ItaiYaffe, @RTeveth EMR pricing - example Cluster Cost $1000
  27. 27. @ItaiYaffe, @RTeveth EMR pricing - example* Cluster Cost * Based on current i3.8xlarge Spot pricing. This may vary depending on the region, instance type, etc. EC2 Cost $1000 = $650
  28. 28. @ItaiYaffe, @RTeveth EMR pricing - example* Cluster Cost * Based on current i3.8xlarge Spot pricing. This may vary depending on the region, instance type, etc. EC2 Cost EMR Cost $1000 = $650 + $350
  29. 29. @ItaiYaffe, @RTeveth Running Airflow-based Spark jobs on EMR ● EMR has official Airflow support ● Open-source, remember? ○ Allows us to fix existing components ■ EmrStepSensor fixes (AIRFLOW-3297) ○ … As well as add new components ■ AWS Athena Sensor (AIRFLOW-3403) ■ OpenFaaS hook (AIRFLOW-3411) emr_create_job_flow_operator emr_add_steps_operator emr_step_sensor Creates new emr cluster Adds Spark step to the cluster Checks if the step succeeded
  30. 30. @ItaiYaffe, @RTeveth Running Airflow-based Spark jobs on EMR ● EMR has official Airflow support ● Open-source, remember? ○ Allows us to fix existing components ■ EmrStepSensor fixes (AIRFLOW-3297) ○ … As well as add new components ■ AWS Athena Sensor (AIRFLOW-3403) ■ OpenFaaS hook (AIRFLOW-3411) emr_create_job_flow_operator emr_add_steps_operator emr_step_sensor Creates new emr cluster Adds Spark step to the cluster Checks if the step succeeded This was great...
  31. 31. @ItaiYaffe, @RTeveth But we wanted MORE! $$$ Visibility Robustness
  32. 32. @ItaiYaffe, @RTeveth Introducing - Spark-on-Kubernetes +
  33. 33. @ItaiYaffe, @RTeveth Let’s explain what is Kubernetes (a.k.a K8s) ● Open source platform for running and managing containerized workloads ● Includes ○ Built-in controllers to support various workloads (e.g micro-services) ○ Additional extensions (called “operators”) to support custom workloads ● Highly scalable
  34. 34. @ItaiYaffe, @RTeveth Basic Kubernetes terminology ClusterControl plane Worker nodes (EC2 in our case) Pods group of one or more containers (such as Docker containers)
  35. 35. @ItaiYaffe, @RTeveth Basic Kubernetes terminology ● kubectl - K8s CLI
  36. 36. @ItaiYaffe, @RTeveth Basic Kubernetes terminology ● The term “operator” exists both in Airflow and in Kubernetes
  37. 37. @ItaiYaffe, @RTeveth Basic Kubernetes terminology ● The term “operator” exists both in Airflow and in Kubernetes ● operator ○ Represents a single task ○ Operators determine what is actually executed when your DAG runs ○ Example: ■ bash-operator - executes a bash command
  38. 38. @ItaiYaffe, @RTeveth Basic Kubernetes terminology ● The term “operator” exists both in Airflow and in Kubernetes ● operator ○ Additional extensions to Kubernetes ○ Holds the knowledge of how to manage a specific application ○ Example: ■ postgres-operator - defines and manages a PostgreSQL cluster
  39. 39. @ItaiYaffe, @RTeveth Basic Kubernetes terminology ● The term “operator” exists both in Airflow and in Kubernetes ● operator ○ A non-core Kubernetes controller ○ Holds the knowledge of how to manage a specific application ○ Example: ■ postgres-operator - defines and manages a PostgreSQL cluster Operator != Operator
  40. 40. @ItaiYaffe, @RTeveth Kubernetes auto-scale ClusterControl plane Phase 1: no applications are running on the cluster
  41. 41. @ItaiYaffe, @RTeveth Kubernetes auto-scale ClusterControl plane Phase 2: application #1 starts running
  42. 42. @ItaiYaffe, @RTeveth Kubernetes auto-scale ClusterControl plane Phase 3: the cluster scales-up as needed
  43. 43. @ItaiYaffe, @RTeveth Kubernetes auto-scale ClusterControl plane Phase 4: application #2 starts running
  44. 44. @ItaiYaffe, @RTeveth Kubernetes auto-scale ClusterControl plane Phase 5: the cluster scales-up as needed
  45. 45. @ItaiYaffe, @RTeveth Kubernetes auto-scale ClusterControl plane Phase 6: applications finished running, cluster scales down
  46. 46. @ItaiYaffe, @RTeveth Kubernetes in a nutshell ● A platform for running and managing containerized workloads ● Each cluster has ○ 1 control plane ○ 0..X worker nodes ○ 0..Y pods ○ 0..Z applications running concurrently ● Kubernetes operator != Airflow operator ● Automatically scales out and in
  47. 47. @ItaiYaffe, @RTeveth Cool, so… Back to Spark-on-Kubernetes?
  48. 48. @ItaiYaffe, @RTeveth Spark-On-Kubernetes overview ● From Spark 2.3.0, K8s is supported as a cluster manager ● No additional management cost per instance ○ You only pay a small fee for the K8s cluster itself (e.g $60/month on AWS) ● This is still experimental, and some features are missing ○ E.g Dynamic Resource Allocation and External Shuffle Service
  49. 49. @ItaiYaffe, @RTeveth Submitting a Spark application to Kubernetes - alternatives 1. Using spark-submit script 2. Using Spark-On-Kubernetes Operator
  50. 50. @ItaiYaffe, @RTeveth Spark-submit example - SparkPi ./bin/spark-submit --master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances =3 --conf spark.kubernetes.container.image =<spark-image> local:///path/to/examples.jar
  51. 51. @ItaiYaffe, @RTeveth Spark-submit example - SparkPi ./bin/spark-submit --master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances =3 --conf spark.kubernetes.container.image =<spark-image> local:///path/to/examples.jar
  52. 52. @ItaiYaffe, @RTeveth Spark-submit example - SparkPi ./bin/spark-submit --master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances =3 --conf spark.kubernetes.container.image =<spark-image> local:///path/to/examples.jar Kubernetes control plane Kubernetes cluster SparkPi driver Executor 1 Executor 2 Executor 3
  53. 53. @ItaiYaffe, @RTeveth Spark-On-Kubernetes operator ● A Kubernetes operator ● Extends Kubernetes API to support Spark applications natively ● Built by GCP as an open-source project github.com/GoogleCloudPlatform/spark-on-k8s-operator
  54. 54. @ItaiYaffe, @RTeveth Spark-On-Kubernetes operator example - SparkPi apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: "spark-pi” namespace: default spec: ... driver: ... executor: ... Spark-pi.yaml
  55. 55. @ItaiYaffe, @RTeveth Spark-On-Kubernetes operator example - SparkPi apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: "spark-pi” namespace: default spec: ... driver: ... executor: ... Spark-pi.yaml Kubernetes control plane Kubernetes cluster kubectl Spark- on-K8s operator
  56. 56. @ItaiYaffe, @RTeveth Spark-On-Kubernetes operator example - SparkPi apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: "spark-pi” namespace: default spec: ... driver: ... executor: ... Spark-pi.yaml Kubernetes control plane Kubernetes cluster SparkPi driver Executor 1 Executor 2 Executor 3 kubectl Spark- on-K8s operator
  57. 57. @ItaiYaffe, @RTeveth Submitting a Spark application to K8s Topic Spark-submit Airflow built-in integration V Customize Spark-pods X* Easy access to Spark UI X Submit and view application from kubectl X
  58. 58. @ItaiYaffe, @RTeveth Submitting a Spark application to K8s Topic Spark-submit Spark-On-K8s operator Airflow built-in integration V X Customize Spark-pods X* V Easy access to Spark UI X V Submit and view application from kubectl X V
  59. 59. @ItaiYaffe, @RTeveth Integrate it with Airflow
  60. 60. @ItaiYaffe, @RTeveth So… we decided to take the road less traveled
  61. 61. @ItaiYaffe, @RTeveth So… we decided to take the road less traveled github.com/apache/airflow/pull/7163
  62. 62. @ItaiYaffe, @RTeveth A special thanks to Airflow committers @CzerwonyElmo (Kamil Breguła) @kaxil (Kaxil Naik) @AshBerlin (Ash Berlin-Taylor) @higrys (Jarek Potiuk)
  63. 63. @ItaiYaffe, @RTeveth Airflow SparkKubernetes integration KubernetesHook SparkKubernetes operator SparkKubernetes sensor
  64. 64. @ItaiYaffe, @RTeveth What have we gained by building this integration? ● Official built-in Airflow support ● Security ○ Save Kubernetes credentials inside Airflow connection mechanism ● Portability ○ Use templated Kubernetes object so the same app can be migrated easily to Airflow and also be run manually ● Kubernetes native ○ Communicate directly with the Kubernetes API
  65. 65. @ItaiYaffe, @RTeveth Common data pipeline pattern - revised
  66. 66. @ItaiYaffe, @RTeveth Common data pipeline pattern - revised 1. Read input files Data Lake 2. Write output files 3. Ingest to DB OLAPData Processing Intermediate Storage
  67. 67. @ItaiYaffe, @RTeveth Common data pipeline pattern - revised 1. Read input files Data Lake 2. Write output files 3. Ingest to DB OLAPData Processing Intermediate Storage
  68. 68. @ItaiYaffe, @RTeveth Common data pipeline pattern - revised 1. Read input files Data Lake 2. Write output files 3. Ingest to DB OLAPData Processing Intermediate Storage What’s missing?
  69. 69. @ItaiYaffe, @RTeveth Connecting the dots… making it production-ready
  70. 70. @ItaiYaffe, @RTeveth Visibility ● Spark History Server ○ Each K8s namespace has a dedicated Spark History Server ● Metrics ○ Spark metrics are exposed via JmxSink (github.com/prometheus/jmx_exporter) ○ System metrics are collected using github.com/kubernetes/kube-state-metrics ● Dashboards ○ Aggregating both Spark and system metrics
  71. 71. @ItaiYaffe, @RTeveth Visibility ● Logging ○ All logs are collected with Filebeat to Elasticsearch ● Alerting ○ Airflow callbacks emit metrics which trigger alerts when needed
  72. 72. @ItaiYaffe, @RTeveth Robustness ● Running a Spark job on multiple AZs ○ Can be beneficial when using Spot instances (depending on the amount of shuffling) ● AWS Node Termination Handler ○ Allows K8s to gracefully handle events such as EC2 Spot interruptions ○ Open source (github.com/aws/aws-node-termination-handler)
  73. 73. @ItaiYaffe, @RTeveth Benefits from migrating to Kubernetes ● ~30% cost reduction ○ No additional cost per instance ● Better visibility ● Robustness
  74. 74. @ItaiYaffe, @RTeveth Airflow integration current status ● Will be available in Airflow 2.0 ● Can’t wait? Check out the backport package for Airflow 1.10.12 tinyurl.com/y6xb7s3h
  75. 75. @ItaiYaffe, @RTeveth So with minimal changes...
  76. 76. @ItaiYaffe, @RTeveth You can boost your data pipeline like me SAVE $10,000’s/month GAIN visibility MAKE your systems robust
  77. 77. @ItaiYaffe, @RTeveth DRUID ES Want to know more? ● Women in Big Data ○ A world-wide program that aims : ■ To inspire, connect, grow, and champion success of women in the Big Data & analytics field ○ 30+ chapters and 17,000+ members world-wide ○ Everyone can join (regardless of gender), so find a chapter near you - https://www.womeninbigdata.org/wibd-structure/ ● Our Tech Blog - medium.com/nmc-techblog ○ Spark Dynamic Partition Inserts part 1 - https://tinyurl.com/yd94ztz5 ○ Spark Dynamic Partition Inserts Part 2 - https://tinyurl.com/y8uembml
  78. 78. QUESTIONS
  79. 79. THANK YOU Roi Teveth Roi Teveth Itai Yaffe Itai Yaffe
  80. 80. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

×