SlideShare a Scribd company logo
1 of 23
Download to read offline
Confidential
End to End Pipelines Using Apache Spark/Livy/Airflow
An integrated solution to batch data processing
Rikin Tanna and Karunasri Maram
Capital One Auto Finance
February 12, 2020
2
Agenda
1. Problem Statement
2. Integrated Solution, Brief Overview
3. Explanation of Components
• Apache Spark
• Apache Livy
• Apache Airflow
4. Integrated Solution, Fully Explained
5. Demo
3
Problem StatementNeed: Batch data processing on schedule
4
Solution Requirements
?
Scalable
Handle jobs with
growing data sets
End to End Data Pipeline
Parallel Execution
Ability to run multiple jobs
in parallel
Open-Source Support
Active contributions to
components used to
stay efficient
Dynamic
Generation of pipeline on
demand to support varying
characteristics
Dependency Enabled
Support ordering of tasks based
on dependency
5
A fully integrated big data pipeline…. with just 3
components!
• Apache Spark
• Unified data analytics engine for large-scale data processing
• Served on EMR cluster
• Apache Livy
• REST Interface to enable easy interaction with Apache Spark
• Served on master node of EMR cluster
• Apache Airflow
• WMS to schedule, trigger, and monitor workflows on a single
compute resource
• Served on single compute instance, with metadata DB on
separate RDS instance
Solution: Brief
5
Confidential
7
What is Airflow?
An open source platform to programatically author, schedule, and monitor workflows
Dynamic: Airflow pipelines are
configured as code, allowing
for dynamic pipeline
generation as DAGs
Extensible: Easily extend the
library and usability by
creating your own operators
and executors
General-Purpose: Airflow is
written in Python, and all
pipelines are configured in
Python
Accessible: Rich UI allows for
non-technical users to
monitor workflows
8
Airflow Luigi Oozie Azkaban
Dynamic Pipelines
Rich, Interactable UI
General Purpose
Usability
Scalability
Dependency
Management
Maturity/Support
Why Airflow?
Comparison of common open source workflow management systems
Use this box for citations, sources, statements, notes, and legal disclaimers that are required.
Use this box for citations, sources, statements, notes, and legal disclaimers that are required.
9
Apache Airflow Architecture
Source: https://medium.com/@dustinstansbury/understanding-apache-airflows-key-concepts-a96efed52b1a
● Metadata Database
○ stores information necessary for scheduler/executor
and webserver
○ task state, DAG definitions, log locations, etc
● Scheduler/Executor
○ process that uses DAG definitions and task states to push
tasks onto queue for execution
● Workers
○ process(es) that execute the logic of the tasks
● Webserver
○ process that renders web UI, interacting with metadata
database to allow user monitoring and interaction with
workflows
10
How to Get Started with Apache Airflow
1. Install Python
2. “pip install apache-airflow”
a. install from pypi using pip
b. AIRFLOW_HOME = ~/airflow (default)
3. “airflow initdb”
a. initialize database
4. “airflow webserver -p 8080”
a. start web server, default port is 8080
5. “airflow scheduler”
a. start scheduler (also starts executor processes)
6. visit localhost:8080 in browser and enable example dags
Deeper Understanding
1. Connect to database (using Datagrip or DBeaver)
and view tables. See how the data is altered as
workflows execute and changes are made
2. Dig into source code
(https://github.com/apache/airflow) and view
actions triggered by scheduler CLI command
Confidential
12
Spark Flink Storm
Streaming
Batch/Interactive/Iterative
Processing
General Purpose Usability
Scalability
Product Maturity
Community Support
Why Spark?
Comparison of common open source big data processing systems.
Use this box for citations, sources, statements, notes, and legal disclaimers that are required.
Use this box for citations, sources, statements, notes, and legal disclaimers that are required.
13
14
15
Apache Spark Architecture
Confidential
What is that?
Why do we need?
17
Livy spark-jobserver Mist Apache Toree
Streaming jobs
Batch Jobs
General Purpose
Usability
High Availability
Supports major languages
(Scala/Java/Python)
Dependency (No Code
changes required)
Why Livy?
Comparison of common open source Spark Interfaces.
Use this box for citations, sources, statements, notes, and legal disclaimers that are required.
Use this box for citations, sources, statements, notes, and legal disclaimers that are required.
18
○ Apache Livy is a service that enables
easy interaction with a Spark cluster
over a REST interface.
○ It enables easy submission of Spark jobs
or snippets of Spark code, synchronous
or asynchronous result retrieval, as well
as Spark Context management, all via a
simple REST interface or an RPC client
library.
A Rest Service for Spark Jobs
18Confidential
19
Sample Request
20
• Workflow Management
• Schedules, monitors, and triggers
workflows
• Characteristics
• Dynamic
• Dependency Enabled
• Open-Source
Apache Airflow
• REST Interface to Interact
with Apache Spark
Apache Livy
• Big-Data Processing
• platform to execute large scale
data processing
• Characteristics
• Parallel jobs
• Scalable
• Open-Source
Apache Spark
Summary of Components
21
Current Solution
22
Failure Resiliency
● Current weakness
○ Current solution lacks resiliency in Airflow (single EC2 instance)
○ solution:
■ containerize Airflow, deploy on pod with separate worker pod,
distribute tasks using external queue
● Livy
○ It supports session recovery using Zookeeper and reconnects to the
existing session even if its fails while executing the job.
● Spark
○ Failed tasks can be re-launched in parallel on all the other nodes in the cluster and distribute the recomputations across many nodes,and
recovering from the failures very fast.
Confidential
Thank you!
rikin.tanna@capitalone.com
https://www.linkedin.com/in/rikin-tanna/

More Related Content

What's hot

Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkSlim Baltagi
 
Fluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshellFluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshellN Masahiro
 
Bridge to Cloud: Using Apache Kafka to Migrate to GCP
Bridge to Cloud: Using Apache Kafka to Migrate to GCPBridge to Cloud: Using Apache Kafka to Migrate to GCP
Bridge to Cloud: Using Apache Kafka to Migrate to GCPconfluent
 
Apache Hadoop Security - Ranger
Apache Hadoop Security - RangerApache Hadoop Security - Ranger
Apache Hadoop Security - RangerIsheeta Sanghi
 
A Kafka journey and why migrate to Confluent Cloud?
A Kafka journey and why migrate to Confluent Cloud?A Kafka journey and why migrate to Confluent Cloud?
A Kafka journey and why migrate to Confluent Cloud?confluent
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
 
Managing 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with AmbariManaging 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with AmbariDataWorks Summit
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used forAljoscha Krettek
 
NATS Streaming - an alternative to Apache Kafka?
NATS Streaming - an alternative to Apache Kafka?NATS Streaming - an alternative to Apache Kafka?
NATS Streaming - an alternative to Apache Kafka?Anton Zadorozhniy
 
Dapr - A 10x Developer Framework for Any Language
Dapr - A 10x Developer Framework for Any LanguageDapr - A 10x Developer Framework for Any Language
Dapr - A 10x Developer Framework for Any LanguageBilgin Ibryam
 
On-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceOn-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceChin Huang
 

What's hot (20)

Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Fluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshellFluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshell
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Nifi workshop
Nifi workshopNifi workshop
Nifi workshop
 
Bridge to Cloud: Using Apache Kafka to Migrate to GCP
Bridge to Cloud: Using Apache Kafka to Migrate to GCPBridge to Cloud: Using Apache Kafka to Migrate to GCP
Bridge to Cloud: Using Apache Kafka to Migrate to GCP
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Apache Hadoop Security - Ranger
Apache Hadoop Security - RangerApache Hadoop Security - Ranger
Apache Hadoop Security - Ranger
 
The basics of fluentd
The basics of fluentdThe basics of fluentd
The basics of fluentd
 
A Kafka journey and why migrate to Confluent Cloud?
A Kafka journey and why migrate to Confluent Cloud?A Kafka journey and why migrate to Confluent Cloud?
A Kafka journey and why migrate to Confluent Cloud?
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Managing 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with AmbariManaging 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with Ambari
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 
NATS Streaming - an alternative to Apache Kafka?
NATS Streaming - an alternative to Apache Kafka?NATS Streaming - an alternative to Apache Kafka?
NATS Streaming - an alternative to Apache Kafka?
 
Dapr - A 10x Developer Framework for Any Language
Dapr - A 10x Developer Framework for Any LanguageDapr - A 10x Developer Framework for Any Language
Dapr - A 10x Developer Framework for Any Language
 
On-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceOn-boarding with JanusGraph Performance
On-boarding with JanusGraph Performance
 

Similar to E2E Data Pipeline - Apache Spark/Airflow/Livy

ApacheCon 2021 - Apache NiFi Deep Dive 300
ApacheCon 2021 - Apache NiFi Deep Dive 300ApacheCon 2021 - Apache NiFi Deep Dive 300
ApacheCon 2021 - Apache NiFi Deep Dive 300Timothy Spann
 
AIDevWorldApacheNiFi101
AIDevWorldApacheNiFi101AIDevWorldApacheNiFi101
AIDevWorldApacheNiFi101Timothy Spann
 
Using Apache NiFi with Apache Pulsar for Fast Data On-Ramp
Using Apache NiFi with Apache Pulsar for Fast Data On-RampUsing Apache NiFi with Apache Pulsar for Fast Data On-Ramp
Using Apache NiFi with Apache Pulsar for Fast Data On-RampTimothy Spann
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Cask Data
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
 
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data TransformationsKafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data TransformationsApache Apex
 
Presto for the Enterprise @ Hadoop Meetup
Presto for the Enterprise @ Hadoop MeetupPresto for the Enterprise @ Hadoop Meetup
Presto for the Enterprise @ Hadoop MeetupWojciech Biela
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Frank Munz
 
Real time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrReal time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrTimothy Spann
 
ApacheCon 2021: Apache NiFi 101- introduction and best practices
ApacheCon 2021:   Apache NiFi 101- introduction and best practicesApacheCon 2021:   Apache NiFi 101- introduction and best practices
ApacheCon 2021: Apache NiFi 101- introduction and best practicesTimothy Spann
 
xPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, TachyonxPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, TachyonClaudiu Barbura
 
Cloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureCloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureTimothy Spann
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...Yahoo Developer Network
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architectureSohil Jain
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architectureSohil Jain
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...DataWorks Summit
 

Similar to E2E Data Pipeline - Apache Spark/Airflow/Livy (20)

ApacheCon 2021 - Apache NiFi Deep Dive 300
ApacheCon 2021 - Apache NiFi Deep Dive 300ApacheCon 2021 - Apache NiFi Deep Dive 300
ApacheCon 2021 - Apache NiFi Deep Dive 300
 
AIDevWorldApacheNiFi101
AIDevWorldApacheNiFi101AIDevWorldApacheNiFi101
AIDevWorldApacheNiFi101
 
Using Apache NiFi with Apache Pulsar for Fast Data On-Ramp
Using Apache NiFi with Apache Pulsar for Fast Data On-RampUsing Apache NiFi with Apache Pulsar for Fast Data On-Ramp
Using Apache NiFi with Apache Pulsar for Fast Data On-Ramp
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data TransformationsKafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
 
Presto for the Enterprise @ Hadoop Meetup
Presto for the Enterprise @ Hadoop MeetupPresto for the Enterprise @ Hadoop Meetup
Presto for the Enterprise @ Hadoop Meetup
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
 
Real time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrReal time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solr
 
ApacheCon 2021: Apache NiFi 101- introduction and best practices
ApacheCon 2021:   Apache NiFi 101- introduction and best practicesApacheCon 2021:   Apache NiFi 101- introduction and best practices
ApacheCon 2021: Apache NiFi 101- introduction and best practices
 
xPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, TachyonxPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, Tachyon
 
Cloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureCloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azure
 
Apache phoenix
Apache phoenixApache phoenix
Apache phoenix
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
Spark Workshop
Spark WorkshopSpark Workshop
Spark Workshop
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...
 

Recently uploaded

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 

Recently uploaded (20)

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 

E2E Data Pipeline - Apache Spark/Airflow/Livy

  • 1. Confidential End to End Pipelines Using Apache Spark/Livy/Airflow An integrated solution to batch data processing Rikin Tanna and Karunasri Maram Capital One Auto Finance February 12, 2020
  • 2. 2 Agenda 1. Problem Statement 2. Integrated Solution, Brief Overview 3. Explanation of Components • Apache Spark • Apache Livy • Apache Airflow 4. Integrated Solution, Fully Explained 5. Demo
  • 3. 3 Problem StatementNeed: Batch data processing on schedule
  • 4. 4 Solution Requirements ? Scalable Handle jobs with growing data sets End to End Data Pipeline Parallel Execution Ability to run multiple jobs in parallel Open-Source Support Active contributions to components used to stay efficient Dynamic Generation of pipeline on demand to support varying characteristics Dependency Enabled Support ordering of tasks based on dependency
  • 5. 5 A fully integrated big data pipeline…. with just 3 components! • Apache Spark • Unified data analytics engine for large-scale data processing • Served on EMR cluster • Apache Livy • REST Interface to enable easy interaction with Apache Spark • Served on master node of EMR cluster • Apache Airflow • WMS to schedule, trigger, and monitor workflows on a single compute resource • Served on single compute instance, with metadata DB on separate RDS instance Solution: Brief 5
  • 7. 7 What is Airflow? An open source platform to programatically author, schedule, and monitor workflows Dynamic: Airflow pipelines are configured as code, allowing for dynamic pipeline generation as DAGs Extensible: Easily extend the library and usability by creating your own operators and executors General-Purpose: Airflow is written in Python, and all pipelines are configured in Python Accessible: Rich UI allows for non-technical users to monitor workflows
  • 8. 8 Airflow Luigi Oozie Azkaban Dynamic Pipelines Rich, Interactable UI General Purpose Usability Scalability Dependency Management Maturity/Support Why Airflow? Comparison of common open source workflow management systems Use this box for citations, sources, statements, notes, and legal disclaimers that are required. Use this box for citations, sources, statements, notes, and legal disclaimers that are required.
  • 9. 9 Apache Airflow Architecture Source: https://medium.com/@dustinstansbury/understanding-apache-airflows-key-concepts-a96efed52b1a ● Metadata Database ○ stores information necessary for scheduler/executor and webserver ○ task state, DAG definitions, log locations, etc ● Scheduler/Executor ○ process that uses DAG definitions and task states to push tasks onto queue for execution ● Workers ○ process(es) that execute the logic of the tasks ● Webserver ○ process that renders web UI, interacting with metadata database to allow user monitoring and interaction with workflows
  • 10. 10 How to Get Started with Apache Airflow 1. Install Python 2. “pip install apache-airflow” a. install from pypi using pip b. AIRFLOW_HOME = ~/airflow (default) 3. “airflow initdb” a. initialize database 4. “airflow webserver -p 8080” a. start web server, default port is 8080 5. “airflow scheduler” a. start scheduler (also starts executor processes) 6. visit localhost:8080 in browser and enable example dags Deeper Understanding 1. Connect to database (using Datagrip or DBeaver) and view tables. See how the data is altered as workflows execute and changes are made 2. Dig into source code (https://github.com/apache/airflow) and view actions triggered by scheduler CLI command
  • 12. 12 Spark Flink Storm Streaming Batch/Interactive/Iterative Processing General Purpose Usability Scalability Product Maturity Community Support Why Spark? Comparison of common open source big data processing systems. Use this box for citations, sources, statements, notes, and legal disclaimers that are required. Use this box for citations, sources, statements, notes, and legal disclaimers that are required.
  • 13. 13
  • 14. 14
  • 17. 17 Livy spark-jobserver Mist Apache Toree Streaming jobs Batch Jobs General Purpose Usability High Availability Supports major languages (Scala/Java/Python) Dependency (No Code changes required) Why Livy? Comparison of common open source Spark Interfaces. Use this box for citations, sources, statements, notes, and legal disclaimers that are required. Use this box for citations, sources, statements, notes, and legal disclaimers that are required.
  • 18. 18 ○ Apache Livy is a service that enables easy interaction with a Spark cluster over a REST interface. ○ It enables easy submission of Spark jobs or snippets of Spark code, synchronous or asynchronous result retrieval, as well as Spark Context management, all via a simple REST interface or an RPC client library. A Rest Service for Spark Jobs 18Confidential
  • 20. 20 • Workflow Management • Schedules, monitors, and triggers workflows • Characteristics • Dynamic • Dependency Enabled • Open-Source Apache Airflow • REST Interface to Interact with Apache Spark Apache Livy • Big-Data Processing • platform to execute large scale data processing • Characteristics • Parallel jobs • Scalable • Open-Source Apache Spark Summary of Components
  • 22. 22 Failure Resiliency ● Current weakness ○ Current solution lacks resiliency in Airflow (single EC2 instance) ○ solution: ■ containerize Airflow, deploy on pod with separate worker pod, distribute tasks using external queue ● Livy ○ It supports session recovery using Zookeeper and reconnects to the existing session even if its fails while executing the job. ● Spark ○ Failed tasks can be re-launched in parallel on all the other nodes in the cluster and distribute the recomputations across many nodes,and recovering from the failures very fast.