SlideShare a Scribd company logo
Resilient Predictive Data
Pipelines
Sid Anand (@r39132)
QCon London 2016
1
About Me
2
Work [ed | s] @
Maintainer on
Report to
Co-Chair for
airbnb/airflow
Motivation
Why is a Data Pipeline talk in this High
Availability Track?
3
Different Types of Data Pipelines
4
ETL
• used for : loading data related
to business health into a Data
Warehouse
• user-engagement stats (e.g.
social networking)
• product success stats (e.g.
e-commerce)
• audience : Business, BizOps
• downtime? : 1-3 days
Predictive
• used for :
•building recommendation
products (e.g. social
networking, shopping)
•updating fraud prevention
endpoints (e.g. security,
payments, e-commerce)
• audience : Customers
• downtime? : < 1 hour
VS
5
ETL
• used for
Data Warehouse —> Reports
• user-engagement stats (
social networking
• product success stats (
e-commerce
• audience
• downtime?
Predictive
• used for :
•building recommendation
products (e.g. social
networking, shopping)
•updating fraud prevention
endpoints (e.g. security,
payments, e-commerce)
• audience : Customers
• downtime? : < 1 hour
VS
Different Types of Data Pipelines
Why Do We Care About Resilience?
6
d4d5d6d7d8
DB
Search
Engines
Recommenders
d1 d2 d3
DB
Why Do We Care About Resilience?
7
d4
d6d7d8
Search
Engines
Recommenders
d1 d2 d3
d5
Why Do We Care About Resilience?
8
d7d8
DB
Search
Engines
Recommenders
d1 d2 d3 d4
d5
d6
Why Do We Care About Resilience?
9
d7d8
DB
Search
Engines
Recommenders
d1 d2 d3 d4
d5
d6
Any Take-aways?
10
d7d8
DB
Search
Engines
Recommenders
d1 d2 d3 d4
d5
d6
• Bugs happen!
• Bugs in Predictive Data Pipelines have a large blast
radius
• The bugs can affect customers and a company’s
profits & reputation!
Design Goals
Desirable Qualities of a Resilient Data Pipeline
11
12
• Scalable
• Available
• Instrumented, Monitored, & Alert-enabled
• Quickly Recoverable
Desirable Qualities of a Resilient
Data Pipeline
13
• Scalable
• Build your pipelines using [infinitely] scalable components
• The scalability of your system is determined by its least-scalable
component
• Available
• Instrumented, Monitored, & Alert-enabled
• Quickly Recoverable
Desirable Qualities of a Resilient
Data Pipeline
Desirable Qualities of a Resilient
Data Pipeline
14
• Scalable
• Build your pipelines using [infinitely] scalable components
• The scalability of your system is determined by its least-scalable
component
• Available
• Ditto
• Instrumented, Monitored, & Alert-enabled
• Quickly Recoverable
Instrumented
15
Instrumentation must reveal SLA metrics at each stage of the pipeline!
What SLA metrics do we care about? Correctness & Timeliness
• Correctness
• No Data Loss
• No Data Corruption
• No Data Duplication
• A Defined Acceptable Staleness of Intermediate Data
• Timeliness
• A late result == a useless result
• Delayed processing of now()’s data may delay the processing of future
data
Instrumented, Monitored, & Alert-
enabled
16
• Instrument
• Instrument Correctness & Timeliness SLA metrics at each stage of the
pipeline
• Monitor
• Continuously monitor that SLA metrics fall within acceptable bounds (i.e.
pre-defined SLAs)
• Alert
• Alert when we miss SLAs
Desirable Qualities of a Resilient
Data Pipeline
17
• Scalable
• Build your pipelines using [infinitely] scalable components
• The scalability of your system is determined by its least-scalable
component
• Available
• Ditto
• Instrumented, Monitored, & Alert-enabled
• Quickly Recoverable
Quickly Recoverable
18
• Bugs happen!
• Bugs in Predictive Data Pipelines have a large blast radius
• Optimize for MTTR
Implementation
Using AWS to meet Design Goals
19
SQS
Simple Queue Service
20
SQS - Overview
21
AWS’s low-latency, highly scalable, highly available message queue
Infinitely Scalable Queue (though not FIFO)
Low End-to-end latency (generally sub-second)
Pull-based
How it Works!
Producers publish messages, which can be batched, to an SQS queue
Consumers
consume messages, which can be batched, from the queue
commit message contents to a data store
ACK the messages as a batch
visibility
timer
SQS - Typical Operation Flow
22
Producer
Producer
Producer
m1m2m3m4m5
Consumer
Consumer
Consumer
DB
m1
SQS
Step 1: A consumer reads a message from
SQS. This starts a visibility timer!
visibility
timer
SQS - Typical Operation Flow
23
Producer
Producer
Producer
m1m2m3m4m5
Consumer
Consumer
Consumer
DB
m1
SQS
Step 2: Consumer persists message
contents to DB
visibility
timer
SQS - Typical Operation Flow
24
Producer
Producer
Producer
m1m2m3m4m5
Consumer
Consumer
Consumer
DB
m1
SQS
Step 3: Consumer ACKs message in SQS
visibility
timer
SQS - Time Out Example
25
Producer
Producer
Producer
m1m2m3m4m5
Consumer
Consumer
Consumer
DB
m1
SQS
Step 1: A consumer reads a message from
SQS
visibility
timer
SQS - Time Out Example
26
Producer
Producer
Producer
m1m2m3m4m5
Consumer
Consumer
Consumer
DB
m1
SQS
Step 2: Consumer attempts persists
message contents to DB
visibility
time out
SQS - Time Out Example
27
Producer
Producer
Producer
m1m2m3m4m5
Consumer
Consumer
Consumer
DB
m1
SQS
Step 3: A Visibility Timeout occurs & the
message becomes visible again.
visibility
timer
SQS - Time Out Example
28
Producer
Producer
Producer
m1m2m3m4m5
Consumer
Consumer
Consumer
DB
m1
m1
SQS
Step 4: Another consumer reads and
persists the same message
visibility
timer
SQS - Time Out Example
29
Producer
Producer
Producer
m1m2m3m4m5
Consumer
Consumer
Consumer
DB
m1
SQS
Step 5: Consumer ACKs message in SQS
SQS - Dead Letter Queue
30
SQS - DLQ
visibility
timer
Producer
Producer
Producer
m2m3m4m5
Consumer
Consumer
Consumer
DB
m1
SQS
Redrive
rule : 2x
m1
SNS
Simple Notification Service
31
SNS - Overview
32
Highly Scalable, Highly Available, Push-based Topic Service
Whereas SQS ensures each message is seen by at least 1 consumer
SNS ensures that each message is seen by every consumer
Reliable Multi-Push
Whereas SQS is pull-based, SNS is push-based
There is no message retention & there is a finite retry count
No Reliable Message Delivery
Can we work around this limitation?
SNS + SQS Design Pattern
33
m1m2
m1m2
m1m2
SQS Q1
SQS Q2
SNS T1
Reliable
Multi
Push
Reliable
Message
Delivery
SNS + SQS
34
Producer
Producer
Producer
m1m2
Consumer
Consumer
Consumer
DB
m1
m1m2
m1m2
SQS Q1
SQS Q2
SNS T1
Consumer
Consumer
Consumer
ES
m1
S3 + SNS + SQS Design Pattern
35
m1m2
m1m2
m1m2
SQS Q1
SQS Q2
SNS T1
Reliable
Multi
Push
Transactions
S3
d1
d2
Batch Pipeline Architecture
Putting the Pieces Together
36
Architecture
37
Architectural Elements
38
A Schema-aware Data format for all data (Avro)
The entire pipeline is built from Highly-Available/Highly-Scalable
components
S3, SNS, SQS, ASG, EMR Spark (exception DB)
The pipeline is never blocked because we use a DLQ for messages we
cannot process
We use queue-based auto-scaling to get high on-demand ingest rates
We manage everything with Airflow
Every stage in the pipeline is idempotent
Every stage in the pipeline is instrumented
ASG
Auto Scaling Group
39
ASG - Overview
40
What is it?
A means to automatically scale out/in
clusters to handle variable load/traffic
A means to keep a cluster/service always
up
Fulfills AWS’s pay-per-use promise!
When to use it?
Feed-processing, web traffic load
balancing, zone outage, etc…
ASG - Data Pipeline
41
importer
importer
importer
importer
Importer
ASG
scaleout/in
SQS
DB
ASG : CPU-based
42
Sent
CPU
ACKd/Recvd
CPU-based auto-scaling is
good at scaling in/out to
keep the average CPU
constant
ASG : CPU-based
43
Sent
CPU
Recv
Premature
Scale-in
Premature Scale-in: The CPU drops to noise-levels before all
messages are consumed. This causes scale in to occur while the
last few messages are still being committed resulting in a long time-
to-drain for the queue!
ASG - Queue-based
44
Scale-out: When Visible Messages > 0 (a.k.a. when queue depth > 0)
Scale-in: When Invisible Messages = 0 (a.k.a. when the last in-flight message is
ACK’d)
This causes the
ASG to grow
This causes the
ASG to shrink
Architecture
45
Reliable Hourly Job
Scheduling
Workflow Automation & Scheduling
46
47
Historical Context
Our first cut at the pipeline used cron to schedule hourly runs of Spark
Problem
We only knew if Spark succeeded. What if a downstream task failed?
We needed something smarter than cron that
Reliably managed a graph of tasks (DAG - Directed Acyclic Graph)
Orchestrated hourly runs of that DAG
Retried failed tasks
Tracked the performance of each run and its tasks
Reported on failure/success of runs
Our Needs
Airflow
Workflow Automation & Scheduling
48
Airflow - DAG Dashboard
49
Airflow: It’s easy to manage multiple DAGs
Airflow - Authoring DAGs
50
Airflow: Visualizing a DAG
51
Airflow: Author DAGs in Python! No need to bundle many config files!
Airflow - Authoring DAGs
Airflow - Performance Insights
52
Airflow: Gantt chart view reveals the slowest tasks for a run!
53
Airflow: …And we can easily see performance trends over time
Airflow - Performance Insights
Airflow - Alerting
54
Airflow: …And easy to integrate with Ops tools!
Airflow - Monitoring
55
56
Airflow - Join the Community
With >30 Companies, >100 Contributors , and >2500 Commits,
Airflow is growing rapidly!
We are looking for more contributors to help support the
community!
Disclaimer : I’m a maintainer on the project
Design Goal Scorecard
Are We Meeting Our Design Goals?
57
58
• Scalable
• Available
• Instrumented, Monitored, & Alert-enabled
• Quickly Recoverable
Desirable Qualities of a Resilient
Data Pipeline
59
• Scalable
• Build using scalable components from AWS
• SQS, SNS, S3, ASG, EMR Spark
• Exception = DB (WIP)
• Available
• Build using available components from AWS
• Airflow for reliable job scheduling
• Instrumented, Monitored, & Alert-enabled
• Airflow
• Quickly Recoverable
• Airflow, DLQs, ASGs, Spark & DB
Desirable Qualities of a Resilient
Data Pipeline
Questions? (@r39132)
60

More Related Content

What's hot

Data Processing with Apache Spark Meetup Talk
Data Processing with Apache Spark Meetup TalkData Processing with Apache Spark Meetup Talk
Data Processing with Apache Spark Meetup Talk
Eren Avşaroğulları
 
Lambda at Weather Scale by Robbie Strickland
Lambda at Weather Scale by Robbie StricklandLambda at Weather Scale by Robbie Strickland
Lambda at Weather Scale by Robbie Strickland
Spark Summit
 
8 Lessons Learned from Using Kafka in 1500 microservices - confluent streamin...
8 Lessons Learned from Using Kafka in 1500 microservices - confluent streamin...8 Lessons Learned from Using Kafka in 1500 microservices - confluent streamin...
8 Lessons Learned from Using Kafka in 1500 microservices - confluent streamin...
Natan Silnitsky
 
What Crimean War gunboats teach us about the need for schema registries
What Crimean War gunboats teach us about the need for schema registriesWhat Crimean War gunboats teach us about the need for schema registries
What Crimean War gunboats teach us about the need for schema registries
Alexander Dean
 
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR BenchmarksExtending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Jamie Grier
 
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
Alexander Dean
 
Extending the Yahoo Streaming Benchmark
Extending the Yahoo Streaming BenchmarkExtending the Yahoo Streaming Benchmark
Extending the Yahoo Streaming Benchmark
Jamie Grier
 
Spark streaming + kafka 0.10
Spark streaming + kafka 0.10Spark streaming + kafka 0.10
Spark streaming + kafka 0.10
Joan Viladrosa Riera
 
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Michael Noll
 
Building Stream Processing Applications with Apache Kafka Using KSQL (Robin M...
Building Stream Processing Applications with Apache Kafka Using KSQL (Robin M...Building Stream Processing Applications with Apache Kafka Using KSQL (Robin M...
Building Stream Processing Applications with Apache Kafka Using KSQL (Robin M...
confluent
 
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...
Paul Brebner
 
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
Joan Viladrosa Riera
 
Flink at netflix paypal speaker series
Flink at netflix   paypal speaker seriesFlink at netflix   paypal speaker series
Flink at netflix paypal speaker series
Monal Daxini
 
Blueflood: Open Source Metrics Processing at CassandraEU 2013
Blueflood: Open Source Metrics Processing at CassandraEU 2013Blueflood: Open Source Metrics Processing at CassandraEU 2013
Blueflood: Open Source Metrics Processing at CassandraEU 2013
gdusbabek
 
Blueflood and Beyond: The Future of Metrics - Berlin Buzzwords 2014
Blueflood and Beyond: The Future of Metrics - Berlin Buzzwords 2014Blueflood and Beyond: The Future of Metrics - Berlin Buzzwords 2014
Blueflood and Beyond: The Future of Metrics - Berlin Buzzwords 2014
gdusbabek
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms comparedApache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Guido Schmutz
 
Large-Scale ETL Data Flows With Data Pipeline and Dataduct
Large-Scale ETL Data Flows With Data Pipeline and DataductLarge-Scale ETL Data Flows With Data Pipeline and Dataduct
Large-Scale ETL Data Flows With Data Pipeline and Dataduct
Sourabh Bajaj
 
Ultimate journey towards realtime data platform with 2.5M events per sec
Ultimate journey towards realtime data platform with 2.5M events per secUltimate journey towards realtime data platform with 2.5M events per sec
Ultimate journey towards realtime data platform with 2.5M events per sec
b0ris_1
 
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache ZeppelinMoon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Flink Forward
 

What's hot (20)

Data Processing with Apache Spark Meetup Talk
Data Processing with Apache Spark Meetup TalkData Processing with Apache Spark Meetup Talk
Data Processing with Apache Spark Meetup Talk
 
Lambda at Weather Scale by Robbie Strickland
Lambda at Weather Scale by Robbie StricklandLambda at Weather Scale by Robbie Strickland
Lambda at Weather Scale by Robbie Strickland
 
8 Lessons Learned from Using Kafka in 1500 microservices - confluent streamin...
8 Lessons Learned from Using Kafka in 1500 microservices - confluent streamin...8 Lessons Learned from Using Kafka in 1500 microservices - confluent streamin...
8 Lessons Learned from Using Kafka in 1500 microservices - confluent streamin...
 
What Crimean War gunboats teach us about the need for schema registries
What Crimean War gunboats teach us about the need for schema registriesWhat Crimean War gunboats teach us about the need for schema registries
What Crimean War gunboats teach us about the need for schema registries
 
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR BenchmarksExtending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
 
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
 
Extending the Yahoo Streaming Benchmark
Extending the Yahoo Streaming BenchmarkExtending the Yahoo Streaming Benchmark
Extending the Yahoo Streaming Benchmark
 
Spark streaming + kafka 0.10
Spark streaming + kafka 0.10Spark streaming + kafka 0.10
Spark streaming + kafka 0.10
 
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
 
Building Stream Processing Applications with Apache Kafka Using KSQL (Robin M...
Building Stream Processing Applications with Apache Kafka Using KSQL (Robin M...Building Stream Processing Applications with Apache Kafka Using KSQL (Robin M...
Building Stream Processing Applications with Apache Kafka Using KSQL (Robin M...
 
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...
 
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
 
Flink at netflix paypal speaker series
Flink at netflix   paypal speaker seriesFlink at netflix   paypal speaker series
Flink at netflix paypal speaker series
 
Blueflood: Open Source Metrics Processing at CassandraEU 2013
Blueflood: Open Source Metrics Processing at CassandraEU 2013Blueflood: Open Source Metrics Processing at CassandraEU 2013
Blueflood: Open Source Metrics Processing at CassandraEU 2013
 
Blueflood and Beyond: The Future of Metrics - Berlin Buzzwords 2014
Blueflood and Beyond: The Future of Metrics - Berlin Buzzwords 2014Blueflood and Beyond: The Future of Metrics - Berlin Buzzwords 2014
Blueflood and Beyond: The Future of Metrics - Berlin Buzzwords 2014
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms comparedApache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
 
Large-Scale ETL Data Flows With Data Pipeline and Dataduct
Large-Scale ETL Data Flows With Data Pipeline and DataductLarge-Scale ETL Data Flows With Data Pipeline and Dataduct
Large-Scale ETL Data Flows With Data Pipeline and Dataduct
 
Ultimate journey towards realtime data platform with 2.5M events per sec
Ultimate journey towards realtime data platform with 2.5M events per secUltimate journey towards realtime data platform with 2.5M events per sec
Ultimate journey towards realtime data platform with 2.5M events per sec
 
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache ZeppelinMoon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
 

Similar to Resilient Predictive Data Pipelines (QCon London 2016)

Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016
Sid Anand
 
AIRflow at Scale
AIRflow at ScaleAIRflow at Scale
AIRflow at Scale
Digital Vidya
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
Monitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at DatabricksMonitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at Databricks
Anyscale
 
Managing Performance Globally with MySQL
Managing Performance Globally with MySQLManaging Performance Globally with MySQL
Managing Performance Globally with MySQL
Daniel Austin
 
Citi Tech Talk: Monitoring and Performance
Citi Tech Talk: Monitoring and PerformanceCiti Tech Talk: Monitoring and Performance
Citi Tech Talk: Monitoring and Performance
confluent
 
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
ScyllaDB
 
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
Amazon Web Services
 
Cloud Native Data Pipelines
Cloud Native Data PipelinesCloud Native Data Pipelines
Cloud Native Data Pipelines
Bill Liu
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Startupfest
 
Azure Monitoring Overview
Azure Monitoring OverviewAzure Monitoring Overview
Azure Monitoring Overview
gjuljo
 
Enterprise Use Case Webinar - PaaS Metering and Monitoring
Enterprise Use Case Webinar - PaaS Metering and Monitoring Enterprise Use Case Webinar - PaaS Metering and Monitoring
Enterprise Use Case Webinar - PaaS Metering and Monitoring WSO2
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex
Apache Apex
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Dataconomy Media
 
Large scale, distributed access management deployment with aruba clear pass
Large scale, distributed access management deployment with aruba clear passLarge scale, distributed access management deployment with aruba clear pass
Large scale, distributed access management deployment with aruba clear pass
Aruba, a Hewlett Packard Enterprise company
 
ADF Performance Monitor
ADF Performance MonitorADF Performance Monitor
Real Time Test Data with Grafana
Real Time Test Data with GrafanaReal Time Test Data with Grafana
Real Time Test Data with Grafana
Ioannis Papadakis
 
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & TableauBig Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Sam Palani
 
Apache Pulsar Overview
Apache Pulsar OverviewApache Pulsar Overview
Apache Pulsar Overview
Streamlio
 

Similar to Resilient Predictive Data Pipelines (QCon London 2016) (20)

Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016
 
AIRflow at Scale
AIRflow at ScaleAIRflow at Scale
AIRflow at Scale
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
Monitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at DatabricksMonitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at Databricks
 
Managing Performance Globally with MySQL
Managing Performance Globally with MySQLManaging Performance Globally with MySQL
Managing Performance Globally with MySQL
 
Citi Tech Talk: Monitoring and Performance
Citi Tech Talk: Monitoring and PerformanceCiti Tech Talk: Monitoring and Performance
Citi Tech Talk: Monitoring and Performance
 
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
 
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
 
Cloud Native Data Pipelines
Cloud Native Data PipelinesCloud Native Data Pipelines
Cloud Native Data Pipelines
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
 
Azure Monitoring Overview
Azure Monitoring OverviewAzure Monitoring Overview
Azure Monitoring Overview
 
Enterprise Use Case Webinar - PaaS Metering and Monitoring
Enterprise Use Case Webinar - PaaS Metering and Monitoring Enterprise Use Case Webinar - PaaS Metering and Monitoring
Enterprise Use Case Webinar - PaaS Metering and Monitoring
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
 
Large scale, distributed access management deployment with aruba clear pass
Large scale, distributed access management deployment with aruba clear passLarge scale, distributed access management deployment with aruba clear pass
Large scale, distributed access management deployment with aruba clear pass
 
ADF Performance Monitor
ADF Performance MonitorADF Performance Monitor
ADF Performance Monitor
 
Real Time Test Data with Grafana
Real Time Test Data with GrafanaReal Time Test Data with Grafana
Real Time Test Data with Grafana
 
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & TableauBig Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
 
Apache Pulsar Overview
Apache Pulsar OverviewApache Pulsar Overview
Apache Pulsar Overview
 

More from Sid Anand

Building High Fidelity Data Streams (QCon London 2023)
Building High Fidelity Data Streams (QCon London 2023)Building High Fidelity Data Streams (QCon London 2023)
Building High Fidelity Data Streams (QCon London 2023)
Sid Anand
 
Building & Operating High-Fidelity Data Streams - QCon Plus 2021
Building & Operating High-Fidelity Data Streams - QCon Plus 2021Building & Operating High-Fidelity Data Streams - QCon Plus 2021
Building & Operating High-Fidelity Data Streams - QCon Plus 2021
Sid Anand
 
Low Latency Fraud Detection & Prevention
Low Latency Fraud Detection & PreventionLow Latency Fraud Detection & Prevention
Low Latency Fraud Detection & Prevention
Sid Anand
 
YOW! Data Keynote (2021)
YOW! Data Keynote (2021)YOW! Data Keynote (2021)
YOW! Data Keynote (2021)
Sid Anand
 
Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)
Sid Anand
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
Sid Anand
 
Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)
Sid Anand
 
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
Sid Anand
 
Hands On with Maven
Hands On with MavenHands On with Maven
Hands On with Maven
Sid Anand
 
Learning git
Learning gitLearning git
Learning git
Sid Anand
 
LinkedIn Data Infrastructure Slides (Version 2)
LinkedIn Data Infrastructure Slides (Version 2)LinkedIn Data Infrastructure Slides (Version 2)
LinkedIn Data Infrastructure Slides (Version 2)
Sid Anand
 
LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)
Sid Anand
 
Linked in nosql_atnetflix_2012_v1
Linked in nosql_atnetflix_2012_v1Linked in nosql_atnetflix_2012_v1
Linked in nosql_atnetflix_2012_v1
Sid Anand
 
Keeping Movies Running Amid Thunderstorms!
Keeping Movies Running Amid Thunderstorms!Keeping Movies Running Amid Thunderstorms!
Keeping Movies Running Amid Thunderstorms!
Sid Anand
 
OSCON Data 2011 -- NoSQL @ Netflix, Part 2
OSCON Data 2011 -- NoSQL @ Netflix, Part 2OSCON Data 2011 -- NoSQL @ Netflix, Part 2
OSCON Data 2011 -- NoSQL @ Netflix, Part 2
Sid Anand
 
Intuit CTOF 2011 - Netflix for Mobile in the Cloud
Intuit CTOF 2011 - Netflix for Mobile in the CloudIntuit CTOF 2011 - Netflix for Mobile in the Cloud
Intuit CTOF 2011 - Netflix for Mobile in the Cloud
Sid Anand
 
Svccg nosql 2011_v4
Svccg nosql 2011_v4Svccg nosql 2011_v4
Svccg nosql 2011_v4
Sid Anand
 
Netflix's Transition to High-Availability Storage (QCon SF 2010)
Netflix's Transition to High-Availability Storage (QCon SF 2010)Netflix's Transition to High-Availability Storage (QCon SF 2010)
Netflix's Transition to High-Availability Storage (QCon SF 2010)
Sid Anand
 

More from Sid Anand (18)

Building High Fidelity Data Streams (QCon London 2023)
Building High Fidelity Data Streams (QCon London 2023)Building High Fidelity Data Streams (QCon London 2023)
Building High Fidelity Data Streams (QCon London 2023)
 
Building & Operating High-Fidelity Data Streams - QCon Plus 2021
Building & Operating High-Fidelity Data Streams - QCon Plus 2021Building & Operating High-Fidelity Data Streams - QCon Plus 2021
Building & Operating High-Fidelity Data Streams - QCon Plus 2021
 
Low Latency Fraud Detection & Prevention
Low Latency Fraud Detection & PreventionLow Latency Fraud Detection & Prevention
Low Latency Fraud Detection & Prevention
 
YOW! Data Keynote (2021)
YOW! Data Keynote (2021)YOW! Data Keynote (2021)
YOW! Data Keynote (2021)
 
Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
 
Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)
 
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
 
Hands On with Maven
Hands On with MavenHands On with Maven
Hands On with Maven
 
Learning git
Learning gitLearning git
Learning git
 
LinkedIn Data Infrastructure Slides (Version 2)
LinkedIn Data Infrastructure Slides (Version 2)LinkedIn Data Infrastructure Slides (Version 2)
LinkedIn Data Infrastructure Slides (Version 2)
 
LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)
 
Linked in nosql_atnetflix_2012_v1
Linked in nosql_atnetflix_2012_v1Linked in nosql_atnetflix_2012_v1
Linked in nosql_atnetflix_2012_v1
 
Keeping Movies Running Amid Thunderstorms!
Keeping Movies Running Amid Thunderstorms!Keeping Movies Running Amid Thunderstorms!
Keeping Movies Running Amid Thunderstorms!
 
OSCON Data 2011 -- NoSQL @ Netflix, Part 2
OSCON Data 2011 -- NoSQL @ Netflix, Part 2OSCON Data 2011 -- NoSQL @ Netflix, Part 2
OSCON Data 2011 -- NoSQL @ Netflix, Part 2
 
Intuit CTOF 2011 - Netflix for Mobile in the Cloud
Intuit CTOF 2011 - Netflix for Mobile in the CloudIntuit CTOF 2011 - Netflix for Mobile in the Cloud
Intuit CTOF 2011 - Netflix for Mobile in the Cloud
 
Svccg nosql 2011_v4
Svccg nosql 2011_v4Svccg nosql 2011_v4
Svccg nosql 2011_v4
 
Netflix's Transition to High-Availability Storage (QCon SF 2010)
Netflix's Transition to High-Availability Storage (QCon SF 2010)Netflix's Transition to High-Availability Storage (QCon SF 2010)
Netflix's Transition to High-Availability Storage (QCon SF 2010)
 

Recently uploaded

test test test test testtest test testtest test testtest test testtest test ...
test test  test test testtest test testtest test testtest test testtest test ...test test  test test testtest test testtest test testtest test testtest test ...
test test test test testtest test testtest test testtest test testtest test ...
Arif0071
 
How to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptxHow to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptx
Gal Baras
 
This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!
nirahealhty
 
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesMulti-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Sanjeev Rampal
 
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shopHistory+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
laozhuseo02
 
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
3ipehhoa
 
Living-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptx
Living-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptxLiving-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptx
Living-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptx
TristanJasperRamos
 
Output determination SAP S4 HANA SAP SD CC
Output determination SAP S4 HANA SAP SD CCOutput determination SAP S4 HANA SAP SD CC
Output determination SAP S4 HANA SAP SD CC
ShahulHameed54211
 
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
3ipehhoa
 
BASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptxBASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptx
natyesu
 
The+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptxThe+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptx
laozhuseo02
 
guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...
Rogerio Filho
 
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
3ipehhoa
 
ER(Entity Relationship) Diagram for online shopping - TAE
ER(Entity Relationship) Diagram for online shopping - TAEER(Entity Relationship) Diagram for online shopping - TAE
ER(Entity Relationship) Diagram for online shopping - TAE
Himani415946
 
Latest trends in computer networking.pptx
Latest trends in computer networking.pptxLatest trends in computer networking.pptx
Latest trends in computer networking.pptx
JungkooksNonexistent
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
JeyaPerumal1
 

Recently uploaded (16)

test test test test testtest test testtest test testtest test testtest test ...
test test  test test testtest test testtest test testtest test testtest test ...test test  test test testtest test testtest test testtest test testtest test ...
test test test test testtest test testtest test testtest test testtest test ...
 
How to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptxHow to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptx
 
This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!
 
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesMulti-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
 
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shopHistory+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
 
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
 
Living-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptx
Living-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptxLiving-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptx
Living-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptx
 
Output determination SAP S4 HANA SAP SD CC
Output determination SAP S4 HANA SAP SD CCOutput determination SAP S4 HANA SAP SD CC
Output determination SAP S4 HANA SAP SD CC
 
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
 
BASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptxBASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptx
 
The+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptxThe+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptx
 
guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...
 
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
 
ER(Entity Relationship) Diagram for online shopping - TAE
ER(Entity Relationship) Diagram for online shopping - TAEER(Entity Relationship) Diagram for online shopping - TAE
ER(Entity Relationship) Diagram for online shopping - TAE
 
Latest trends in computer networking.pptx
Latest trends in computer networking.pptxLatest trends in computer networking.pptx
Latest trends in computer networking.pptx
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
 

Resilient Predictive Data Pipelines (QCon London 2016)

  • 1. Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1
  • 2. About Me 2 Work [ed | s] @ Maintainer on Report to Co-Chair for airbnb/airflow
  • 3. Motivation Why is a Data Pipeline talk in this High Availability Track? 3
  • 4. Different Types of Data Pipelines 4 ETL • used for : loading data related to business health into a Data Warehouse • user-engagement stats (e.g. social networking) • product success stats (e.g. e-commerce) • audience : Business, BizOps • downtime? : 1-3 days Predictive • used for : •building recommendation products (e.g. social networking, shopping) •updating fraud prevention endpoints (e.g. security, payments, e-commerce) • audience : Customers • downtime? : < 1 hour VS
  • 5. 5 ETL • used for Data Warehouse —> Reports • user-engagement stats ( social networking • product success stats ( e-commerce • audience • downtime? Predictive • used for : •building recommendation products (e.g. social networking, shopping) •updating fraud prevention endpoints (e.g. security, payments, e-commerce) • audience : Customers • downtime? : < 1 hour VS Different Types of Data Pipelines
  • 6. Why Do We Care About Resilience? 6 d4d5d6d7d8 DB Search Engines Recommenders d1 d2 d3
  • 7. DB Why Do We Care About Resilience? 7 d4 d6d7d8 Search Engines Recommenders d1 d2 d3 d5
  • 8. Why Do We Care About Resilience? 8 d7d8 DB Search Engines Recommenders d1 d2 d3 d4 d5 d6
  • 9. Why Do We Care About Resilience? 9 d7d8 DB Search Engines Recommenders d1 d2 d3 d4 d5 d6
  • 10. Any Take-aways? 10 d7d8 DB Search Engines Recommenders d1 d2 d3 d4 d5 d6 • Bugs happen! • Bugs in Predictive Data Pipelines have a large blast radius • The bugs can affect customers and a company’s profits & reputation!
  • 11. Design Goals Desirable Qualities of a Resilient Data Pipeline 11
  • 12. 12 • Scalable • Available • Instrumented, Monitored, & Alert-enabled • Quickly Recoverable Desirable Qualities of a Resilient Data Pipeline
  • 13. 13 • Scalable • Build your pipelines using [infinitely] scalable components • The scalability of your system is determined by its least-scalable component • Available • Instrumented, Monitored, & Alert-enabled • Quickly Recoverable Desirable Qualities of a Resilient Data Pipeline
  • 14. Desirable Qualities of a Resilient Data Pipeline 14 • Scalable • Build your pipelines using [infinitely] scalable components • The scalability of your system is determined by its least-scalable component • Available • Ditto • Instrumented, Monitored, & Alert-enabled • Quickly Recoverable
  • 15. Instrumented 15 Instrumentation must reveal SLA metrics at each stage of the pipeline! What SLA metrics do we care about? Correctness & Timeliness • Correctness • No Data Loss • No Data Corruption • No Data Duplication • A Defined Acceptable Staleness of Intermediate Data • Timeliness • A late result == a useless result • Delayed processing of now()’s data may delay the processing of future data
  • 16. Instrumented, Monitored, & Alert- enabled 16 • Instrument • Instrument Correctness & Timeliness SLA metrics at each stage of the pipeline • Monitor • Continuously monitor that SLA metrics fall within acceptable bounds (i.e. pre-defined SLAs) • Alert • Alert when we miss SLAs
  • 17. Desirable Qualities of a Resilient Data Pipeline 17 • Scalable • Build your pipelines using [infinitely] scalable components • The scalability of your system is determined by its least-scalable component • Available • Ditto • Instrumented, Monitored, & Alert-enabled • Quickly Recoverable
  • 18. Quickly Recoverable 18 • Bugs happen! • Bugs in Predictive Data Pipelines have a large blast radius • Optimize for MTTR
  • 19. Implementation Using AWS to meet Design Goals 19
  • 21. SQS - Overview 21 AWS’s low-latency, highly scalable, highly available message queue Infinitely Scalable Queue (though not FIFO) Low End-to-end latency (generally sub-second) Pull-based How it Works! Producers publish messages, which can be batched, to an SQS queue Consumers consume messages, which can be batched, from the queue commit message contents to a data store ACK the messages as a batch
  • 22. visibility timer SQS - Typical Operation Flow 22 Producer Producer Producer m1m2m3m4m5 Consumer Consumer Consumer DB m1 SQS Step 1: A consumer reads a message from SQS. This starts a visibility timer!
  • 23. visibility timer SQS - Typical Operation Flow 23 Producer Producer Producer m1m2m3m4m5 Consumer Consumer Consumer DB m1 SQS Step 2: Consumer persists message contents to DB
  • 24. visibility timer SQS - Typical Operation Flow 24 Producer Producer Producer m1m2m3m4m5 Consumer Consumer Consumer DB m1 SQS Step 3: Consumer ACKs message in SQS
  • 25. visibility timer SQS - Time Out Example 25 Producer Producer Producer m1m2m3m4m5 Consumer Consumer Consumer DB m1 SQS Step 1: A consumer reads a message from SQS
  • 26. visibility timer SQS - Time Out Example 26 Producer Producer Producer m1m2m3m4m5 Consumer Consumer Consumer DB m1 SQS Step 2: Consumer attempts persists message contents to DB
  • 27. visibility time out SQS - Time Out Example 27 Producer Producer Producer m1m2m3m4m5 Consumer Consumer Consumer DB m1 SQS Step 3: A Visibility Timeout occurs & the message becomes visible again.
  • 28. visibility timer SQS - Time Out Example 28 Producer Producer Producer m1m2m3m4m5 Consumer Consumer Consumer DB m1 m1 SQS Step 4: Another consumer reads and persists the same message
  • 29. visibility timer SQS - Time Out Example 29 Producer Producer Producer m1m2m3m4m5 Consumer Consumer Consumer DB m1 SQS Step 5: Consumer ACKs message in SQS
  • 30. SQS - Dead Letter Queue 30 SQS - DLQ visibility timer Producer Producer Producer m2m3m4m5 Consumer Consumer Consumer DB m1 SQS Redrive rule : 2x m1
  • 32. SNS - Overview 32 Highly Scalable, Highly Available, Push-based Topic Service Whereas SQS ensures each message is seen by at least 1 consumer SNS ensures that each message is seen by every consumer Reliable Multi-Push Whereas SQS is pull-based, SNS is push-based There is no message retention & there is a finite retry count No Reliable Message Delivery Can we work around this limitation?
  • 33. SNS + SQS Design Pattern 33 m1m2 m1m2 m1m2 SQS Q1 SQS Q2 SNS T1 Reliable Multi Push Reliable Message Delivery
  • 35. S3 + SNS + SQS Design Pattern 35 m1m2 m1m2 m1m2 SQS Q1 SQS Q2 SNS T1 Reliable Multi Push Transactions S3 d1 d2
  • 36. Batch Pipeline Architecture Putting the Pieces Together 36
  • 38. Architectural Elements 38 A Schema-aware Data format for all data (Avro) The entire pipeline is built from Highly-Available/Highly-Scalable components S3, SNS, SQS, ASG, EMR Spark (exception DB) The pipeline is never blocked because we use a DLQ for messages we cannot process We use queue-based auto-scaling to get high on-demand ingest rates We manage everything with Airflow Every stage in the pipeline is idempotent Every stage in the pipeline is instrumented
  • 40. ASG - Overview 40 What is it? A means to automatically scale out/in clusters to handle variable load/traffic A means to keep a cluster/service always up Fulfills AWS’s pay-per-use promise! When to use it? Feed-processing, web traffic load balancing, zone outage, etc…
  • 41. ASG - Data Pipeline 41 importer importer importer importer Importer ASG scaleout/in SQS DB
  • 42. ASG : CPU-based 42 Sent CPU ACKd/Recvd CPU-based auto-scaling is good at scaling in/out to keep the average CPU constant
  • 43. ASG : CPU-based 43 Sent CPU Recv Premature Scale-in Premature Scale-in: The CPU drops to noise-levels before all messages are consumed. This causes scale in to occur while the last few messages are still being committed resulting in a long time- to-drain for the queue!
  • 44. ASG - Queue-based 44 Scale-out: When Visible Messages > 0 (a.k.a. when queue depth > 0) Scale-in: When Invisible Messages = 0 (a.k.a. when the last in-flight message is ACK’d) This causes the ASG to grow This causes the ASG to shrink
  • 46. Reliable Hourly Job Scheduling Workflow Automation & Scheduling 46
  • 47. 47 Historical Context Our first cut at the pipeline used cron to schedule hourly runs of Spark Problem We only knew if Spark succeeded. What if a downstream task failed? We needed something smarter than cron that Reliably managed a graph of tasks (DAG - Directed Acyclic Graph) Orchestrated hourly runs of that DAG Retried failed tasks Tracked the performance of each run and its tasks Reported on failure/success of runs Our Needs
  • 49. Airflow - DAG Dashboard 49 Airflow: It’s easy to manage multiple DAGs
  • 50. Airflow - Authoring DAGs 50 Airflow: Visualizing a DAG
  • 51. 51 Airflow: Author DAGs in Python! No need to bundle many config files! Airflow - Authoring DAGs
  • 52. Airflow - Performance Insights 52 Airflow: Gantt chart view reveals the slowest tasks for a run!
  • 53. 53 Airflow: …And we can easily see performance trends over time Airflow - Performance Insights
  • 54. Airflow - Alerting 54 Airflow: …And easy to integrate with Ops tools!
  • 56. 56 Airflow - Join the Community With >30 Companies, >100 Contributors , and >2500 Commits, Airflow is growing rapidly! We are looking for more contributors to help support the community! Disclaimer : I’m a maintainer on the project
  • 57. Design Goal Scorecard Are We Meeting Our Design Goals? 57
  • 58. 58 • Scalable • Available • Instrumented, Monitored, & Alert-enabled • Quickly Recoverable Desirable Qualities of a Resilient Data Pipeline
  • 59. 59 • Scalable • Build using scalable components from AWS • SQS, SNS, S3, ASG, EMR Spark • Exception = DB (WIP) • Available • Build using available components from AWS • Airflow for reliable job scheduling • Instrumented, Monitored, & Alert-enabled • Airflow • Quickly Recoverable • Airflow, DLQs, ASGs, Spark & DB Desirable Qualities of a Resilient Data Pipeline