SlideShare a Scribd company logo
Building Scalable Robust Data Pipelines with
Apache Airflow
Agenda
❖ A brief introduction to Qubole
❖ Apache Airflow
❖ Operational Challenges in managing an ETL
❖ Alerts and Monitoring
❖ Quality Assurance in ETL’s
3
About Qubole Data service
❖ A self-service platform for big data analytics.
❖ Delivers best-in-class Apache tools such as Hadoop, Hive, Spark,
etc. integrated into an enterprise-feature rich platform optimized
to run in the cloud.
❖ Enables users to focus on their data rather than the platform.
4
Data Team @ Qubole
❖ Data Warehouse for Qubole
❖ Provides Insights and Recommendations to users
❖ Just Another Qubole Account
❖ Enabling data driven features within QDS
5
Multi Tenant Nature Of Data
Team
6
Qubole
Distribution 2
(azure.qubole.com)
Distribution 1
(api.qubole.com)
Data Warehouse
Data Warehouse
Apache Airflow For ETL
❖ Developer Friendly
❖ A rich collection of Operators, CLI utilities and UI to author and manage your
Data Pipeline.
❖ Horizontally Scalable.
❖ Tight Integration With Qubole
7
DAG creation in Airflow
8
Operational Challenges In ETL World.
9
How to achieve
continuous
integration and
deployment for
ETL’s
?
How to effectively
manage
configuration for
ETL’s in a multi
tenant environment
?
How we do we
make ETL’s aware of
the Data Warehouse
migrations
?
Configuration Management
10
Use Airflow Variables For Saving ETL
configuration
IDEA!
Airflow Variables for ETL Configuration
❖ Stores the information as a key value pair in airflow.
❖ Extensive support like CLI, UI and API to manage the variables
❖ Can be used from within the airflow script as
variable.get(“variable_name”)
12
Warehouse Management.
❖ A leaf out of Ruby on Rails: Active Record
Migrations.
❖ Each migration is tagged and committed as a
single commit to version control along with ETL
changes.
13
The PROCESS IS EASY
14
Checkout
from version
control the
target tag.
Update the
migration
number
Run any new
relevant
migrations
Fetch Current
Migration
Number from
Airflow
Variables.
❖ Traditional deployment too messy when multiple users are handling airflow.
❖ Data Apps for ETL deployment.
❖ Provides cli option like <ETL_NAME> deploy -r <version_tag> -d <start_date>
Deployment
Checkout the
airflow
template file
from version
control.
Copy the final
script file to
airflow
directory.
Read Config
Values from
Airflow and
translate the
config values
Alerts And Monitoring.
16
DAG in
Qubole
❖ This graph has 90+
operators!
❖ 8 -9 different types.
❖ Clearly, error prone!
17
DATA QUALITY
ISSUES
Missing
Data
Data
Corruption
Data
Duplication
System
Issues
18
IMPORTANCE OF DATA
VALIDATION
❖ Application’s correctness depends on correctness of data.
❖ Increase confidence on data by quantifying data quality.
❖ Correcting existing data can be expensive - prevention better than cure!
❖ Stopping critical downstream tasks if the data is invalid.
19
TREND MONITORING
❖ Monitor dips, peaks, anomalies.
❖ Hard problem!
❖ Not real time.
❖ One size doesn’t fit all - Different ETLs manipulate data in different ways.
❖ Difficult to maintain.
20
Use assert queries for data validation!
IDEA!
Using Apache Airflow Check operators:
Approach:
Extend open
source airflow
check operator for
queries running on
Qubole platform
Run data
validation queries
Fail the operator if
the validation fails
22
Creating QuboleCheck operator
23
Limitations and Enhancements
to open source Apache Airflow
Operator
Problem: Airflow Check operators required pass_value to be defined
before
the ETL starts.
Use case: Validating data import logic
Solution: Make pass_value an Airflow template field
This way it can be configured at run-time. The pass value can be injected
through multiple mechanisms once it’s an airflow template field.
1. Compare Data across engines
25
Pass value as an Airflow Template field
26
Problem: Currently, Apache airflow check operators consider single row for
comparison.
Use case: Run group queries, compare each of the values against the pass_value.
Solution: Qubole_check_operator adds `results_parser_callable` parameter
The function pointed to by `results_parser_callable` holds the logic to return a list
of records on which the checks would be performed.
2. Validate multiline results
27
Parser function as parameter to Check
operator
28
Integration of Apache Airflow
Check Operators with Qubole ETLs
ETL # 1: Data Ingestion Imports data from RDS tables into Data Warehouse for analysis
purposes.
Historical Issues:
Mismatch with source data
1. Data duplication
2. Data missing for certain duration
Checks employed:
- Count comparison across the two data stores - source and
destination.
How checks have helped us:
- Verify and rectify upsert logic (which is not plain copy of
RDS)
PS: Runtime fetching of expected values!
30
ETL # 2: Data
Transformation
Repartitions a day’s worth of data into hourly partitions.
Historical Issues:
1. Data ending up in single partition field (Default hive
partition).
2. Wrong ordering of values in fields.
Checks employed:
1. Number of partitions getting created are 24 (one for every
hour).
2. Check the value of critical field, “source” .
How checks have helped us: Verify and rectify repartitioning
logic.
31
ETL # 3: Cost Computation
Computes Qubole Compute Unit
Hour (QCUH)
Situation: We are narrowing down on the
granularity of cost computation from daily to hourly.
How Checks have helped?
To monitor new data and alarm in case of
mismatch in trends of old and new data.
32
ETL # 4: Data
Transformation
Parses customer queries and outputs table usage information.
Historical Issues:
1. Data missing for a customer account.
2. Data loss due to different syntaxes across engines.
3. Data loss due to query syntax changes across different versions of data-
engines.
Checks employed:
1. Group by account ids, if any of them is 0, raise an alert.
2. Group by on engine type, account ids. If high error %, raise an alert.
How checks have helped us:
- Insights into amount of data loss.
- Provides feedback, helped us make syntax checking more robust.
33
FEATURES
❖ Ability to plug-in different alerting mechanisms.
❖ Dependency management and Failure handling.
❖ Ability to parse the output of assert query in a user defined manner.
❖ Run time fetching of the pass_value against which the comparison is made.
❖ Ability to generate failure/success report.
34
LESSONS LEARNT
One size doesn’t fit
all- Estimation of data
trends is a difficult
problem
Delegate the
validation task to the
ETL itself
35
Source code has been
contributed to Apache Airflow
AIRFLOW-2228: Enhancements in Check operator
AIRFLOW-2213: Adding Qubole Check Operator
36
In data we trust!
THANKS!
Any questions?
You can find us at:
sakshib@qubole.com
sreenathk@qubole.com
AIRflow at Scale

More Related Content

What's hot

From business requirements to working pipelines with apache airflow
From business requirements to working pipelines with apache airflowFrom business requirements to working pipelines with apache airflow
From business requirements to working pipelines with apache airflow
Derrick Qin
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in Production
Robert Sanders
 
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
Kaxil Naik
 
Clearing Airflow Obstructions
Clearing Airflow ObstructionsClearing Airflow Obstructions
Clearing Airflow Obstructions
Tatiana Al-Chueyr
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
PyData
 
Fyber - airflow best practices in production
Fyber - airflow best practices in productionFyber - airflow best practices in production
Fyber - airflow best practices in production
Itai Yaffe
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
Sumit Maheshwari
 
From airflow to google cloud composer
From airflow to google cloud composerFrom airflow to google cloud composer
From airflow to google cloud composer
Bruce Kuo
 
Data Pipelines with Apache Airflow
Data Pipelines with Apache AirflowData Pipelines with Apache Airflow
Data Pipelines with Apache Airflow
Manning Publications
 
Apache airflow
Apache airflowApache airflow
Apache airflow
Pavel Alexeev
 
Apache airflow
Apache airflowApache airflow
Apache airflow
Purna Chander
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
Anant Corporation
 
Airflow tutorials hands_on
Airflow tutorials hands_onAirflow tutorials hands_on
Airflow tutorials hands_on
pko89403
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
Knoldus Inc.
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Bolke de Bruin
 
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
Itai Yaffe
 
Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow Architecture
Gerard Toonstra
 
Industrializing Machine learning pipelines
Industrializing Machine learning pipelinesIndustrializing Machine learning pipelines
Industrializing Machine learning pipelines
Germain Tanguy
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
Kaxil Naik
 
Airflow for Beginners
Airflow for BeginnersAirflow for Beginners
Airflow for Beginners
Varya Karpenko
 

What's hot (20)

From business requirements to working pipelines with apache airflow
From business requirements to working pipelines with apache airflowFrom business requirements to working pipelines with apache airflow
From business requirements to working pipelines with apache airflow
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in Production
 
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
 
Clearing Airflow Obstructions
Clearing Airflow ObstructionsClearing Airflow Obstructions
Clearing Airflow Obstructions
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
Fyber - airflow best practices in production
Fyber - airflow best practices in productionFyber - airflow best practices in production
Fyber - airflow best practices in production
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
From airflow to google cloud composer
From airflow to google cloud composerFrom airflow to google cloud composer
From airflow to google cloud composer
 
Data Pipelines with Apache Airflow
Data Pipelines with Apache AirflowData Pipelines with Apache Airflow
Data Pipelines with Apache Airflow
 
Apache airflow
Apache airflowApache airflow
Apache airflow
 
Apache airflow
Apache airflowApache airflow
Apache airflow
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
Airflow tutorials hands_on
Airflow tutorials hands_onAirflow tutorials hands_on
Airflow tutorials hands_on
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
 
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
 
Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow Architecture
 
Industrializing Machine learning pipelines
Industrializing Machine learning pipelinesIndustrializing Machine learning pipelines
Industrializing Machine learning pipelines
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
 
Airflow for Beginners
Airflow for BeginnersAirflow for Beginners
Airflow for Beginners
 

Similar to AIRflow at Scale

Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)
Sid Anand
 
From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...
Cognizant
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex
Apache Apex
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Dataconomy Media
 
Semantic Validation: Enforcing Kafka Data Quality Through Schema-Driven Verif...
Semantic Validation: Enforcing Kafka Data Quality Through Schema-Driven Verif...Semantic Validation: Enforcing Kafka Data Quality Through Schema-Driven Verif...
Semantic Validation: Enforcing Kafka Data Quality Through Schema-Driven Verif...
HostedbyConfluent
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
AboutYouGmbH
 
Transcend Automation's Kepware OPC Products
Transcend Automation's Kepware OPC ProductsTranscend Automation's Kepware OPC Products
Transcend Automation's Kepware OPC Products
Baiju P.S.
 
Enterprise resource planning_system
Enterprise resource planning_systemEnterprise resource planning_system
Enterprise resource planning_system
Jithin Zcs
 
ebs-performance-tuning-part-1-470542.pdf
ebs-performance-tuning-part-1-470542.pdfebs-performance-tuning-part-1-470542.pdf
ebs-performance-tuning-part-1-470542.pdf
ElboulmaniMohamed
 
Power of the Run Graph
Power of the Run GraphPower of the Run Graph
Power of the Run Graph
Vaticle
 
big-book-of-data-science-2ndedition.pdf
big-book-of-data-science-2ndedition.pdfbig-book-of-data-science-2ndedition.pdf
big-book-of-data-science-2ndedition.pdf
ssuserd397dd
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation Criteria
ScyllaDB
 
Qubole on AWS - White paper
Qubole on AWS - White paper Qubole on AWS - White paper
Qubole on AWS - White paper
Vasu S
 
Traffic Simulator
Traffic SimulatorTraffic Simulator
Traffic Simulator
gystell
 
Implement Test Harness For Streaming Data Pipelines
Implement Test Harness For Streaming Data PipelinesImplement Test Harness For Streaming Data Pipelines
Implement Test Harness For Streaming Data Pipelines
Knoldus Inc.
 
E&P data management: Implementing data standards
E&P data management: Implementing data standardsE&P data management: Implementing data standards
E&P data management: Implementing data standards
ETLSolutions
 
Database performance management
Database performance managementDatabase performance management
Database performance management
scottaver
 

Similar to AIRflow at Scale (20)

Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)
 
From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
 
Semantic Validation: Enforcing Kafka Data Quality Through Schema-Driven Verif...
Semantic Validation: Enforcing Kafka Data Quality Through Schema-Driven Verif...Semantic Validation: Enforcing Kafka Data Quality Through Schema-Driven Verif...
Semantic Validation: Enforcing Kafka Data Quality Through Schema-Driven Verif...
 
ETL Testing
ETL TestingETL Testing
ETL Testing
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
 
Data migration
Data migrationData migration
Data migration
 
Transcend Automation's Kepware OPC Products
Transcend Automation's Kepware OPC ProductsTranscend Automation's Kepware OPC Products
Transcend Automation's Kepware OPC Products
 
RakeshDhanani
RakeshDhananiRakeshDhanani
RakeshDhanani
 
Enterprise resource planning_system
Enterprise resource planning_systemEnterprise resource planning_system
Enterprise resource planning_system
 
ebs-performance-tuning-part-1-470542.pdf
ebs-performance-tuning-part-1-470542.pdfebs-performance-tuning-part-1-470542.pdf
ebs-performance-tuning-part-1-470542.pdf
 
Power of the Run Graph
Power of the Run GraphPower of the Run Graph
Power of the Run Graph
 
big-book-of-data-science-2ndedition.pdf
big-book-of-data-science-2ndedition.pdfbig-book-of-data-science-2ndedition.pdf
big-book-of-data-science-2ndedition.pdf
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation Criteria
 
Qubole on AWS - White paper
Qubole on AWS - White paper Qubole on AWS - White paper
Qubole on AWS - White paper
 
Traffic Simulator
Traffic SimulatorTraffic Simulator
Traffic Simulator
 
Implement Test Harness For Streaming Data Pipelines
Implement Test Harness For Streaming Data PipelinesImplement Test Harness For Streaming Data Pipelines
Implement Test Harness For Streaming Data Pipelines
 
E&P data management: Implementing data standards
E&P data management: Implementing data standardsE&P data management: Implementing data standards
E&P data management: Implementing data standards
 
Database performance management
Database performance managementDatabase performance management
Database performance management
 

More from Digital Vidya

Emerging Trends in Marketing-Role of AI & Data Science
Emerging Trends in Marketing-Role of AI & Data ScienceEmerging Trends in Marketing-Role of AI & Data Science
Emerging Trends in Marketing-Role of AI & Data Science
Digital Vidya
 
Digital Marketing Beyond Facebook & Google
Digital Marketing Beyond Facebook & GoogleDigital Marketing Beyond Facebook & Google
Digital Marketing Beyond Facebook & Google
Digital Vidya
 
Making Money Out of Data
Making Money Out of DataMaking Money Out of Data
Making Money Out of Data
Digital Vidya
 
Say Yes To No SQL
Say Yes To No SQLSay Yes To No SQL
Say Yes To No SQL
Digital Vidya
 
Persuasion Strategies That Work Building Influence To Open Up Your Revenue St...
Persuasion Strategies That Work Building Influence To Open Up Your Revenue St...Persuasion Strategies That Work Building Influence To Open Up Your Revenue St...
Persuasion Strategies That Work Building Influence To Open Up Your Revenue St...
Digital Vidya
 
How To Set-up An SEO Agency From Scratch As A Newbie
How To Set-up An SEO Agency From Scratch As A NewbieHow To Set-up An SEO Agency From Scratch As A Newbie
How To Set-up An SEO Agency From Scratch As A Newbie
Digital Vidya
 
Lifecycle of a Data Science Project
Lifecycle of a Data Science ProjectLifecycle of a Data Science Project
Lifecycle of a Data Science Project
Digital Vidya
 
7 B2B Marketing Trends for Driving Growth
7 B2B Marketing Trends for Driving Growth7 B2B Marketing Trends for Driving Growth
7 B2B Marketing Trends for Driving Growth
Digital Vidya
 
Social Video Analytics: From Demography to Psychography of User Behaviour
Social Video Analytics: From Demography to Psychography of User BehaviourSocial Video Analytics: From Demography to Psychography of User Behaviour
Social Video Analytics: From Demography to Psychography of User Behaviour
Digital Vidya
 
How to Use Marketing Automation to Convert More Leads to Sales
How to Use Marketing Automation to Convert More Leads to SalesHow to Use Marketing Automation to Convert More Leads to Sales
How to Use Marketing Automation to Convert More Leads to Sales
Digital Vidya
 
Native Advertising: Changing Digital Advertising Landscape
Native Advertising: Changing Digital Advertising LandscapeNative Advertising: Changing Digital Advertising Landscape
Native Advertising: Changing Digital Advertising Landscape
Digital Vidya
 
Personal Branding Using Social Media
Personal Branding Using Social MediaPersonal Branding Using Social Media
Personal Branding Using Social Media
Digital Vidya
 
Anomaly Detection Using Machine Learning In Industrial IoT
Anomaly Detection Using Machine Learning In Industrial IoTAnomaly Detection Using Machine Learning In Industrial IoT
Anomaly Detection Using Machine Learning In Industrial IoT
Digital Vidya
 
Structured Streaming in Spark
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in Spark
Digital Vidya
 
Community Development with Social Media
Community Development with Social MediaCommunity Development with Social Media
Community Development with Social Media
Digital Vidya
 
Framework of Digital Media Marketing in India
Framework of Digital Media Marketing in IndiaFramework of Digital Media Marketing in India
Framework of Digital Media Marketing in India
Digital Vidya
 
The Secret to Search Engine Marketing Success in 2018
The Secret to Search Engine Marketing Success in 2018The Secret to Search Engine Marketing Success in 2018
The Secret to Search Engine Marketing Success in 2018
Digital Vidya
 
People Centric Marketing - Create Impact by Putting People First
People Centric Marketing - Create Impact by Putting People First People Centric Marketing - Create Impact by Putting People First
People Centric Marketing - Create Impact by Putting People First
Digital Vidya
 
Going Global? Key Steps to Expanding Your Business Globally
Going Global? Key Steps to Expanding Your Business GloballyGoing Global? Key Steps to Expanding Your Business Globally
Going Global? Key Steps to Expanding Your Business Globally
Digital Vidya
 
How to Optimize your Online Presence for 6X Growth in Sales?
 How to Optimize your Online Presence for 6X Growth in Sales? How to Optimize your Online Presence for 6X Growth in Sales?
How to Optimize your Online Presence for 6X Growth in Sales?
Digital Vidya
 

More from Digital Vidya (20)

Emerging Trends in Marketing-Role of AI & Data Science
Emerging Trends in Marketing-Role of AI & Data ScienceEmerging Trends in Marketing-Role of AI & Data Science
Emerging Trends in Marketing-Role of AI & Data Science
 
Digital Marketing Beyond Facebook & Google
Digital Marketing Beyond Facebook & GoogleDigital Marketing Beyond Facebook & Google
Digital Marketing Beyond Facebook & Google
 
Making Money Out of Data
Making Money Out of DataMaking Money Out of Data
Making Money Out of Data
 
Say Yes To No SQL
Say Yes To No SQLSay Yes To No SQL
Say Yes To No SQL
 
Persuasion Strategies That Work Building Influence To Open Up Your Revenue St...
Persuasion Strategies That Work Building Influence To Open Up Your Revenue St...Persuasion Strategies That Work Building Influence To Open Up Your Revenue St...
Persuasion Strategies That Work Building Influence To Open Up Your Revenue St...
 
How To Set-up An SEO Agency From Scratch As A Newbie
How To Set-up An SEO Agency From Scratch As A NewbieHow To Set-up An SEO Agency From Scratch As A Newbie
How To Set-up An SEO Agency From Scratch As A Newbie
 
Lifecycle of a Data Science Project
Lifecycle of a Data Science ProjectLifecycle of a Data Science Project
Lifecycle of a Data Science Project
 
7 B2B Marketing Trends for Driving Growth
7 B2B Marketing Trends for Driving Growth7 B2B Marketing Trends for Driving Growth
7 B2B Marketing Trends for Driving Growth
 
Social Video Analytics: From Demography to Psychography of User Behaviour
Social Video Analytics: From Demography to Psychography of User BehaviourSocial Video Analytics: From Demography to Psychography of User Behaviour
Social Video Analytics: From Demography to Psychography of User Behaviour
 
How to Use Marketing Automation to Convert More Leads to Sales
How to Use Marketing Automation to Convert More Leads to SalesHow to Use Marketing Automation to Convert More Leads to Sales
How to Use Marketing Automation to Convert More Leads to Sales
 
Native Advertising: Changing Digital Advertising Landscape
Native Advertising: Changing Digital Advertising LandscapeNative Advertising: Changing Digital Advertising Landscape
Native Advertising: Changing Digital Advertising Landscape
 
Personal Branding Using Social Media
Personal Branding Using Social MediaPersonal Branding Using Social Media
Personal Branding Using Social Media
 
Anomaly Detection Using Machine Learning In Industrial IoT
Anomaly Detection Using Machine Learning In Industrial IoTAnomaly Detection Using Machine Learning In Industrial IoT
Anomaly Detection Using Machine Learning In Industrial IoT
 
Structured Streaming in Spark
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in Spark
 
Community Development with Social Media
Community Development with Social MediaCommunity Development with Social Media
Community Development with Social Media
 
Framework of Digital Media Marketing in India
Framework of Digital Media Marketing in IndiaFramework of Digital Media Marketing in India
Framework of Digital Media Marketing in India
 
The Secret to Search Engine Marketing Success in 2018
The Secret to Search Engine Marketing Success in 2018The Secret to Search Engine Marketing Success in 2018
The Secret to Search Engine Marketing Success in 2018
 
People Centric Marketing - Create Impact by Putting People First
People Centric Marketing - Create Impact by Putting People First People Centric Marketing - Create Impact by Putting People First
People Centric Marketing - Create Impact by Putting People First
 
Going Global? Key Steps to Expanding Your Business Globally
Going Global? Key Steps to Expanding Your Business GloballyGoing Global? Key Steps to Expanding Your Business Globally
Going Global? Key Steps to Expanding Your Business Globally
 
How to Optimize your Online Presence for 6X Growth in Sales?
 How to Optimize your Online Presence for 6X Growth in Sales? How to Optimize your Online Presence for 6X Growth in Sales?
How to Optimize your Online Presence for 6X Growth in Sales?
 

Recently uploaded

一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
correoyaya
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 

Recently uploaded (20)

一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 

AIRflow at Scale

  • 1.
  • 2. Building Scalable Robust Data Pipelines with Apache Airflow
  • 3. Agenda ❖ A brief introduction to Qubole ❖ Apache Airflow ❖ Operational Challenges in managing an ETL ❖ Alerts and Monitoring ❖ Quality Assurance in ETL’s 3
  • 4. About Qubole Data service ❖ A self-service platform for big data analytics. ❖ Delivers best-in-class Apache tools such as Hadoop, Hive, Spark, etc. integrated into an enterprise-feature rich platform optimized to run in the cloud. ❖ Enables users to focus on their data rather than the platform. 4
  • 5. Data Team @ Qubole ❖ Data Warehouse for Qubole ❖ Provides Insights and Recommendations to users ❖ Just Another Qubole Account ❖ Enabling data driven features within QDS 5
  • 6. Multi Tenant Nature Of Data Team 6 Qubole Distribution 2 (azure.qubole.com) Distribution 1 (api.qubole.com) Data Warehouse Data Warehouse
  • 7. Apache Airflow For ETL ❖ Developer Friendly ❖ A rich collection of Operators, CLI utilities and UI to author and manage your Data Pipeline. ❖ Horizontally Scalable. ❖ Tight Integration With Qubole 7
  • 8. DAG creation in Airflow 8
  • 9. Operational Challenges In ETL World. 9 How to achieve continuous integration and deployment for ETL’s ? How to effectively manage configuration for ETL’s in a multi tenant environment ? How we do we make ETL’s aware of the Data Warehouse migrations ?
  • 11. Use Airflow Variables For Saving ETL configuration IDEA!
  • 12. Airflow Variables for ETL Configuration ❖ Stores the information as a key value pair in airflow. ❖ Extensive support like CLI, UI and API to manage the variables ❖ Can be used from within the airflow script as variable.get(“variable_name”) 12
  • 13. Warehouse Management. ❖ A leaf out of Ruby on Rails: Active Record Migrations. ❖ Each migration is tagged and committed as a single commit to version control along with ETL changes. 13
  • 14. The PROCESS IS EASY 14 Checkout from version control the target tag. Update the migration number Run any new relevant migrations Fetch Current Migration Number from Airflow Variables.
  • 15. ❖ Traditional deployment too messy when multiple users are handling airflow. ❖ Data Apps for ETL deployment. ❖ Provides cli option like <ETL_NAME> deploy -r <version_tag> -d <start_date> Deployment Checkout the airflow template file from version control. Copy the final script file to airflow directory. Read Config Values from Airflow and translate the config values
  • 17. DAG in Qubole ❖ This graph has 90+ operators! ❖ 8 -9 different types. ❖ Clearly, error prone! 17
  • 19. IMPORTANCE OF DATA VALIDATION ❖ Application’s correctness depends on correctness of data. ❖ Increase confidence on data by quantifying data quality. ❖ Correcting existing data can be expensive - prevention better than cure! ❖ Stopping critical downstream tasks if the data is invalid. 19
  • 20. TREND MONITORING ❖ Monitor dips, peaks, anomalies. ❖ Hard problem! ❖ Not real time. ❖ One size doesn’t fit all - Different ETLs manipulate data in different ways. ❖ Difficult to maintain. 20
  • 21. Use assert queries for data validation! IDEA!
  • 22. Using Apache Airflow Check operators: Approach: Extend open source airflow check operator for queries running on Qubole platform Run data validation queries Fail the operator if the validation fails 22
  • 24. Limitations and Enhancements to open source Apache Airflow Operator
  • 25. Problem: Airflow Check operators required pass_value to be defined before the ETL starts. Use case: Validating data import logic Solution: Make pass_value an Airflow template field This way it can be configured at run-time. The pass value can be injected through multiple mechanisms once it’s an airflow template field. 1. Compare Data across engines 25
  • 26. Pass value as an Airflow Template field 26
  • 27. Problem: Currently, Apache airflow check operators consider single row for comparison. Use case: Run group queries, compare each of the values against the pass_value. Solution: Qubole_check_operator adds `results_parser_callable` parameter The function pointed to by `results_parser_callable` holds the logic to return a list of records on which the checks would be performed. 2. Validate multiline results 27
  • 28. Parser function as parameter to Check operator 28
  • 29. Integration of Apache Airflow Check Operators with Qubole ETLs
  • 30. ETL # 1: Data Ingestion Imports data from RDS tables into Data Warehouse for analysis purposes. Historical Issues: Mismatch with source data 1. Data duplication 2. Data missing for certain duration Checks employed: - Count comparison across the two data stores - source and destination. How checks have helped us: - Verify and rectify upsert logic (which is not plain copy of RDS) PS: Runtime fetching of expected values! 30
  • 31. ETL # 2: Data Transformation Repartitions a day’s worth of data into hourly partitions. Historical Issues: 1. Data ending up in single partition field (Default hive partition). 2. Wrong ordering of values in fields. Checks employed: 1. Number of partitions getting created are 24 (one for every hour). 2. Check the value of critical field, “source” . How checks have helped us: Verify and rectify repartitioning logic. 31
  • 32. ETL # 3: Cost Computation Computes Qubole Compute Unit Hour (QCUH) Situation: We are narrowing down on the granularity of cost computation from daily to hourly. How Checks have helped? To monitor new data and alarm in case of mismatch in trends of old and new data. 32
  • 33. ETL # 4: Data Transformation Parses customer queries and outputs table usage information. Historical Issues: 1. Data missing for a customer account. 2. Data loss due to different syntaxes across engines. 3. Data loss due to query syntax changes across different versions of data- engines. Checks employed: 1. Group by account ids, if any of them is 0, raise an alert. 2. Group by on engine type, account ids. If high error %, raise an alert. How checks have helped us: - Insights into amount of data loss. - Provides feedback, helped us make syntax checking more robust. 33
  • 34. FEATURES ❖ Ability to plug-in different alerting mechanisms. ❖ Dependency management and Failure handling. ❖ Ability to parse the output of assert query in a user defined manner. ❖ Run time fetching of the pass_value against which the comparison is made. ❖ Ability to generate failure/success report. 34
  • 35. LESSONS LEARNT One size doesn’t fit all- Estimation of data trends is a difficult problem Delegate the validation task to the ETL itself 35
  • 36. Source code has been contributed to Apache Airflow AIRFLOW-2228: Enhancements in Check operator AIRFLOW-2213: Adding Qubole Check Operator 36
  • 37. In data we trust! THANKS! Any questions? You can find us at: sakshib@qubole.com sreenathk@qubole.com