Slides from "Cloud Native Data Pipelines" talk given @ QCon Tokyo 2016. The slides are in both English and Japanese. Thanks to Kiro Harada (https://jp.linkedin.com/in/haradakiro) for the translation.
31. BI Predictive
Common Focus of this talk
Data Pipelines
31
Web Servers
OLTP
DB
Data
Warehouse
Repor6ng
Tools
Query
Browsers
ETL (batch)
MySQL,
Oracle,
Cassandra
Terradata,
RedShi;
BigQuery
OLTP DB
or cache
ETL (batch or streaming)
MySQL,
Oracle,
Cassandra,
Redis
Spark,
Flink,
Beam,
Storm
Web Servers
Ranking (Search, News Feed),
Recommender Products,
Fraud DetecGon / PrevenGon
Data
Source
34. Cloud Native Data Pipelines
34
Big Data Companies like LinkedIn, Facebook, Twitter, & Google
build custom, large scale data pipelines that run in their own
Data Centers
36. Cloud Native Data Pipelines
36
Big Data Companies like LinkedIn, Facebook, Twitter, & Google
build custom, large scale data pipelines that run in their own
Data Centers
Most start-ups run in the public cloud. Can they leverage
aspects of the public cloud to build comparable pipelines?
44. 44
Desirable Qualities of a Resilient
Data Pipeline
OperabilityCorrectness
Timeliness Cost
• Data Integrity (no loss, etc…)
• Expected data distributions
• All output within time-bound SLAs
• Fine-grained Monitoring &
Alerting of Correctness &
Timeliness SLAs
• Quick Recoverability
• Pay-as-you-go
62. Use-Case : Message Scoring
62
enterprise A
enterprise B
enterprise C
S3
An Autoscale Group
(ASG) of Importers spins
up when it detects SQS
messages
S3
SNS
SQS
Importers
ASG
64. 64
enterprise A
enterprise B
enterprise C
S3
The importers rapidly ingest scored
messages and aggregate statistics into
the DB
S3
SNS
SQS
Importers
ASG
DB
Use-Case : Message Scoring
66. 66
enterprise A
enterprise B
enterprise C
S3
Users receive alerts of
untrusted emails &
can review them in
the web app
S3
SNS
SQS
Importers
ASG
DB
Use-Case : Message Scoring
74. Tackling Cost
74
Between Hourly Runs During Hourly Runs
When running daily, for 23 hours of a day, we didn’t pay for
instances in the ASG or EMR
This does not help when runs are hourly since AWS charges at
an hourly rate for EC2 instances!
78. ASG - Overview
78
What is it?
A means to automatically scale out/in clusters to handle
variable load/traffic
A means to keep a cluster/service of a fixed size always up
86. 86
Scale-out: When Visible Messages > 0 (a.k.a. when queue depth > 0)
Scale-in: When Invisible Messages = 0 (a.k.a. when the last in-flight
message is ACK’d)
This causes the
ASG to grow
This causes the
ASG to shrink
ASG : Queue-based
87. 87
Scale-out: When Visible Messages > 0 (a.k.a. when queue depth > 0)
Scale-in: When Invisible Messages = 0 (a.k.a. when the last in-flight
message is ACK’d)
This causes the
ASG to grow
This causes the
ASG to shrink
ASG : Queue-
94. 94
A simple way to author and manage workflows
Provides visual insight into the state & performance of workflow
runs
Integrates with our alerting and monitoring tools
Tackling Operability : Requirements
118. Use-Case : Message Scoring
118
enterprise A
enterprise B
enterprise C
K
As ASG of scorers is
scaled up to one process
per core per kinesis shard
Scorers
ASG
120. Use-Case : Message Scoring
120
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Scorers apply the trust
model and send scored
messages downstream
122. Use-Case : Message Scoring
122
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
As ASG of importers is
scaled up to rapidly
import messages
DB
124. Use-Case : Message Scoring
124
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
Imported messages are
also consumed by the
alerter
DB
K
Alerters
ASG
126. Use-Case : Message Scoring
126
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
Imported messages are
also consumed by the
alerter
DB
K
Alerters
ASG
Quarantine Email
132. 132
What is Avro?
Avro is a self-describing serialization format that supports
primitive data types : int, long, boolean, float, string, bytes, etc…
complex data types : records, arrays, unions, maps, enums, etc…
many language bindings : Java, Scala, Python, Ruby, etc…
134. 134
What is Avro?
Avro is a self-describing serialization format that supports
primitive data types : int, long, boolean, float, string, bytes, etc…
complex data types : records, arrays, unions, maps, enums, etc…
many language bindings : Java, Scala, Python, Ruby, etc…
The most common format for storing structured Big Data at rest in
HDFS, S3, Google Cloud Storage, etc…
Supports Schema Evolution!
144. 144
{"namespace": "agari",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
Data
x 1,000,000,000
Avro Schema Data File Example
Schema
Data
0.0001 %
99.999 %
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
145. 145
{"namespace": "agari",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
Data
x 1,000,000,000
Avro
0.0001 %
99.999 %
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
146. 146
{"namespace": "agari",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
Binary Data block
Avro Schema Streaming Example
Schema
Data
99 %
1 %
Data
162. 162
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
Imported messages are
also consumed by the
alerter
DB
K
Alerters
ASG
SR
SR
SR
Innovation 1 : Avro Schema Registry
164. 164
The Architecture is composed of repeated patterns of :
ASG-based compute consumer
Kinesis transport streams (i.e. AWS’ managed “Kafka”)
A Lambda-based Avro Schema Registry
Innovation 2 : Repeatable Units
Compute i Kinesis i
ASG i
SR
165. 165
You can chain these repeatable units together to make arbitrary
DAGs (Directed Acyclic Graphs)
User Hashicorp’s Terraform to compose your DAG through
automation
The example above is a simple Linear DAG with 3 units
Innovation 2 : Repeatable Units
Compute i Kinesis i
ASG i
SR
Compute i Kinesis i
ASG i
SR
Compute i Kinesis i
ASG i
SR
166. 166
DAG( )
Hashicorp’s Terraform DAG
DAG
2 :
Compute i Kinesis i
ASG i
SR
Compute i Kinesis i
ASG i
SR
Compute i Kinesis i
ASG i
SR
167. Airflow Job Reactively Scales
Innovation 3 : Reactive-Scaling (WIP)
167
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
DB
K
Alerters
ASG
SR
SR
SR
169. 169
If the ADR is triggered and a model build or code push
was recently done to Compute 1, ADR will revert the last
code or model push to ASG Compute 1
Innovation 4 : Anomaly-based Rollback
(WIP)
ASG
Compute 1 Compute 2 Kinesis
ASG
SR
Anomaly-
detector &
Reverter
171. Open Source Plans
171
Follow us to be notified when the following is open-
sourced
• Avro Schema Registry
• Agari (Kinesis+ASG) scaling tool (Airflow Job)
• Anomaly-detector & Reverter
To be notified, follow @AgariEng & @r39132
173. Acknowledgments
173
• Vidur Apparao
• Stephen Cattaneo
• Jon Chase
• Andrew Flury
• William Forrester
• Chris Haag
• Mike Jones
• Scot Kennedy
• Thede Loder
• Paul Lorence
• Kevin Mandich
• Gabriel Ortiz
• Jacob Rideout
• Josh Yang
• Julian Mehnle
None of this work would be possible without the
contributions of the strong team below
174. 174
• Vidur Apparao
• Stephen Cattaneo
• Jon Chase
• Andrew Flury
• William Forrester
• Chris Haag
• Mike Jones
• Scot Kennedy
• Thede Loder
• Paul Lorence
• Kevin Mandich
• Gabriel Ortiz
• Jacob Rideout
• Josh Yang
• Julian Mehnle