Gen AI in Business - Global Trends Report 2024.pdf
E2E Data Pipeline - Apache Spark/Airflow/Livy
1. Confidential
End to End Pipelines Using Apache Spark/Livy/Airflow
An integrated solution to batch data processing
Rikin Tanna and Karunasri Maram
Capital One Auto Finance
February 12, 2020
4. 4
Solution Requirements
?
Scalable
Handle jobs with
growing data sets
End to End Data Pipeline
Parallel Execution
Ability to run multiple jobs
in parallel
Open-Source Support
Active contributions to
components used to
stay efficient
Dynamic
Generation of pipeline on
demand to support varying
characteristics
Dependency Enabled
Support ordering of tasks based
on dependency
5. 5
A fully integrated big data pipeline…. with just 3
components!
• Apache Spark
• Unified data analytics engine for large-scale data processing
• Served on EMR cluster
• Apache Livy
• REST Interface to enable easy interaction with Apache Spark
• Served on master node of EMR cluster
• Apache Airflow
• WMS to schedule, trigger, and monitor workflows on a single
compute resource
• Served on single compute instance, with metadata DB on
separate RDS instance
Solution: Brief
5
7. 7
What is Airflow?
An open source platform to programatically author, schedule, and monitor workflows
Dynamic: Airflow pipelines are
configured as code, allowing
for dynamic pipeline
generation as DAGs
Extensible: Easily extend the
library and usability by
creating your own operators
and executors
General-Purpose: Airflow is
written in Python, and all
pipelines are configured in
Python
Accessible: Rich UI allows for
non-technical users to
monitor workflows
8. 8
Airflow Luigi Oozie Azkaban
Dynamic Pipelines
Rich, Interactable UI
General Purpose
Usability
Scalability
Dependency
Management
Maturity/Support
Why Airflow?
Comparison of common open source workflow management systems
Use this box for citations, sources, statements, notes, and legal disclaimers that are required.
Use this box for citations, sources, statements, notes, and legal disclaimers that are required.
9. 9
Apache Airflow Architecture
Source: https://medium.com/@dustinstansbury/understanding-apache-airflows-key-concepts-a96efed52b1a
● Metadata Database
○ stores information necessary for scheduler/executor
and webserver
○ task state, DAG definitions, log locations, etc
● Scheduler/Executor
○ process that uses DAG definitions and task states to push
tasks onto queue for execution
● Workers
○ process(es) that execute the logic of the tasks
● Webserver
○ process that renders web UI, interacting with metadata
database to allow user monitoring and interaction with
workflows
10. 10
How to Get Started with Apache Airflow
1. Install Python
2. “pip install apache-airflow”
a. install from pypi using pip
b. AIRFLOW_HOME = ~/airflow (default)
3. “airflow initdb”
a. initialize database
4. “airflow webserver -p 8080”
a. start web server, default port is 8080
5. “airflow scheduler”
a. start scheduler (also starts executor processes)
6. visit localhost:8080 in browser and enable example dags
Deeper Understanding
1. Connect to database (using Datagrip or DBeaver)
and view tables. See how the data is altered as
workflows execute and changes are made
2. Dig into source code
(https://github.com/apache/airflow) and view
actions triggered by scheduler CLI command
12. 12
Spark Flink Storm
Streaming
Batch/Interactive/Iterative
Processing
General Purpose Usability
Scalability
Product Maturity
Community Support
Why Spark?
Comparison of common open source big data processing systems.
Use this box for citations, sources, statements, notes, and legal disclaimers that are required.
Use this box for citations, sources, statements, notes, and legal disclaimers that are required.
17. 17
Livy spark-jobserver Mist Apache Toree
Streaming jobs
Batch Jobs
General Purpose
Usability
High Availability
Supports major languages
(Scala/Java/Python)
Dependency (No Code
changes required)
Why Livy?
Comparison of common open source Spark Interfaces.
Use this box for citations, sources, statements, notes, and legal disclaimers that are required.
Use this box for citations, sources, statements, notes, and legal disclaimers that are required.
18. 18
○ Apache Livy is a service that enables
easy interaction with a Spark cluster
over a REST interface.
○ It enables easy submission of Spark jobs
or snippets of Spark code, synchronous
or asynchronous result retrieval, as well
as Spark Context management, all via a
simple REST interface or an RPC client
library.
A Rest Service for Spark Jobs
18Confidential
22. 22
Failure Resiliency
● Current weakness
○ Current solution lacks resiliency in Airflow (single EC2 instance)
○ solution:
■ containerize Airflow, deploy on pod with separate worker pod,
distribute tasks using external queue
● Livy
○ It supports session recovery using Zookeeper and reconnects to the
existing session even if its fails while executing the job.
● Spark
○ Failed tasks can be re-launched in parallel on all the other nodes in the cluster and distribute the recomputations across many nodes,and
recovering from the failures very fast.