Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo

Cloud Native Data
Pipelines
Sid Anand
QCon Shanghai & Tokyo 2016
1

Sid Anand
QCon Shanghai & Tokyo 2016
2
Japanese Translation: Kiro Harada (@haradakiro)

About Me
3
Work [ed | s] @
Committer &
PPMC on
Father of 2
Co-Chair for
Apache Airﬂow

4
[ | ] @
& PPMC Apache Airﬂow

19
Enterprise
Customers
email
metadata
apply
trust
models
email md
+ trust
score
Agari’s Previous EP Version
Agari : What We Do
Batch

21
email
metadata
apply
trust
models
email md +
trust score
Agari’s Current EP VersionEnterprise
Customers
Agari : What We Do
Near-real
time
Quarantine

Data Pipelines
BI vs Predictive
23

Data Pipelines (BI)
25
Web Servers
OLTP
DB
Data
Warehouse
Repor6ng
Tools
Query
Browsers
ETL (batch)
MySQL,
Oracle,
Cassandra
Terradata,
RedShi;
BigQuery

(BI)
26
Web Servers
OLTP
DB
Data
Warehouse
Repor6ng
Tools
Query
Browsers
ETL (batch)
MySQL,
Oracle,
Cassandra
Terradata,
RedShi;
BigQuery

Data Pipelines (Predictive)
27
OLTP DB
or cache
ETL (batch or streaming)
MySQL,
Oracle,
Cassandra,
Redis
Spark,
Flink,
Beam,
Storm
Web Servers
Data Products
Ranking (Search, News Feed),
Recommender Products,
Fraud DetecGon / PrevenGon
Data
Source

(Predictive)
28
OLTP DB
or cache
MySQL,
Oracle,
Cassandra,
Redis
Spark,
Flink,
Beam,
Storm
Web Servers
Data Products
Data
Source

BI Predictive
Common Focus of this talk
Data Pipelines
31
Web Servers
OLTP
DB
Data
Warehouse
Repor6ng
Tools
Query
Browsers
ETL (batch)
MySQL,
Oracle,
Cassandra
Terradata,
RedShi;
BigQuery
OLTP DB
or cache
MySQL,
Oracle,
Cassandra,
Redis
Spark,
Flink,
Beam,
Storm
Web Servers
Data
Source

BI Predictive
32
Web Servers
OLTP
DB
Data
Warehouse
Repor6ng
Tools
Query
Browsers
ETL (batch)
MySQL,
Oracle,
Cassandra
Terradata,
RedShi;
BigQuery
OLTP DB
or cache
MySQL,
Oracle,
Cassandra,
Redis
Spark,
Flink,
Beam,
Storm
Web Servers
Data
Source

Motivation
Cloud Native Data Pipelines
33

34
Big Data Companies like LinkedIn, Facebook, Twitter, & Google
build custom, large scale data pipelines that run in their own
Data Centers

35
LinkedIn Facebook Twitter Google

36
Big Data Companies like LinkedIn, Facebook, Twitter, & Google
build custom, large scale data pipelines that run in their own
Data Centers

Most start-ups run in the public cloud. Can they leverage
aspects of the public cloud to build comparable pipelines?

37
LinkedIn Facebook Twitter Google

38
Cloud Native
Techniques

Open Source
Technogies
Custom Data Pipeline
Stacks seen in Big
Data companies

~

Design Goals
Desirable Qualities of a Resilient Data Pipeline
40

42
Desirable Qualities of a Resilient
Data Pipeline
OperabilityCorrectness
Timeliness Cost

44
Data Pipeline
Timeliness Cost
• Data Integrity (no loss, etc…)
• Expected data distributions
• All output within time-bound SLAs
• Fine-grained Monitoring &
Alerting of Correctness &
Timeliness SLAs
• Quick Recoverability
• Pay-as-you-go

45
•
• ( …)
•
• SLA
•
SLA
•
•

Quickly Recoverable
46
• Bugs happen!
• Bugs in Predictive Data Pipelines have a large blast radius
• Optimize for MTTR

Predictive Analytics @ Agari
Use Cases
48

Predictive Analytics @ Agari
49

Use Cases
50
Apply trust models
(message scoring)
batch + near real
time
Build trust models
batch
(Enterprise Protect)

51
(message scoring)
+
(Enterprise Protect)

Use-Case : Message
Scoring (batch)
Batch Pipeline Architecture
52

:
(batch)
Batch Pipeline Architecture
53

Use-Case : Message Scoring
54
enterprise A
enterprise B
enterprise C
S3
S3 uploads an Avro ﬁle
every 15 minutes

Use-Case :
55
enterprise A
enterprise B
enterprise C
S3
Avro 15
S3

56
enterprise A
enterprise B
enterprise C
S3
Airﬂow kicks of a Spark
message scoring job
every hour (EMR)

Use-Case :
57
enterprise A
enterprise B
enterprise C
S3
Airﬂow Spark
(EMR)

58
enterprise A
enterprise B
enterprise C
S3
Spark job writes scored
messages and stats to
another S3 bucket
S3

Use-Case :
59
enterprise A
enterprise B
enterprise C
S3
Spark
S3
S3

60
enterprise A
enterprise B
enterprise C
S3
This triggers SNS/SQS
messages events
S3
SNS
SQS

Use-Case :
61
enterprise A
enterprise B
enterprise C
S3
SNS/SQS
S3
SNS
SQS

62
enterprise A
enterprise B
enterprise C
S3
An Autoscale Group
(ASG) of Importers spins
up when it detects SQS
messages
S3
SNS
SQS
Importers
ASG

Use-Case :
63
enterprise A
enterprise B
enterprise C
S3
SQS
(ASG)
S3
SNS
SQS
Importers
ASG

64
enterprise A
enterprise B
enterprise C
S3
The importers rapidly ingest scored
messages and aggregate statistics into
the DB
S3
SNS
SQS
Importers
ASG
DB

65
enterprise A
enterprise B
enterprise C
S3 S3
SNS
SQS
Importers
ASG
DB
Use-Case :

66
enterprise A
enterprise B
enterprise C
S3
Users receive alerts of
untrusted emails &
can review them in
the web app
S3
SNS
SQS
Importers
ASG
DB

67
enterprise A
enterprise B
enterprise C
S3
WebApp
S3
SNS
SQS
Importers
ASG
DB
Use-Case :

68
enterprise A
enterprise B
enterprise C
S3 S3
SNS
SQS
Importers
ASG
DB
Airﬂow manages the entire process

69
enterprise A
enterprise B
enterprise C
S3 S3
SNS
SQS
Importers
ASG
DB
Airﬂow
Use-Case :

Tackling Cost & Timeliness
Leveraging the AWS Cloud
70

Tackling Cost
72
Between Daily Runs During Daily Runs
When running daily, for 23 hours of a day, we didn’t
pay for instances in the ASG or EMR

Tackling Cost
74
Between Hourly Runs During Hourly Runs
When running daily, for 23 hours of a day, we didn’t pay for
instances in the ASG or EMR
This does not help when runs are hourly since AWS charges at
an hourly rate for EC2 instances!

Tackling Cost
75
23 ASG EMR
AWS

Tackling Timeliness
Auto Scaling Group (ASG)
76

ASG - Overview
78
What is it?
A means to automatically scale out/in clusters to handle
variable load/trafﬁc
A means to keep a cluster/service of a ﬁxed size always up

ASG - Data Pipeline
80
importer
importer
importer
importer
Importer
ASG
scaleout/in
SQS
DB

ASG -
81
importer
importer
importer
importer
ASG
scaleout/in
SQS
DB

82
Sent
CPU
ACKd/Recvd
CPU-based auto-scaling is
good at scaling in/out to
keep the average CPU
constant
ASG : CPU-based

83
Sent
CPU
ACKd/Recvd
CPU-
CPU
ASG : CPU-

ASG : CPU-based
84
Sent
CPU
Recv
Premature
Scale-in
Premature Scale-in:
• The CPU drops to noise-levels before all messages are
consumed
• This causes scale in to occur while the last few
messages are still being committed

ASG : CPU-
85
Sent
CPU
Recv
Premature
Scale-in
:
• CPU
•

86
Scale-out: When Visible Messages > 0 (a.k.a. when queue depth > 0)
Scale-in: When Invisible Messages = 0 (a.k.a. when the last in-ﬂight
message is ACK’d)
This causes the
ASG to grow
This causes the
ASG to shrink
ASG : Queue-based

87
Scale-out: When Visible Messages > 0 (a.k.a. when queue depth > 0)
Scale-in: When Invisible Messages = 0 (a.k.a. when the last in-ﬂight
message is ACK’d)
This causes the
ASG to grow
This causes the
ASG to shrink
ASG : Queue-

88
ASG : Queue-based
Shoyu Koto Da!!!!

89
ASG : Queue-
Shoyu Koto Da!!!!

90
Data Pipeline
Timeliness Cost
• ASG
• EMR Spark
Daily
• ASG
• EMR Spark
Hourly ASG
• No Cost Savings

91
• ASG
• EMR Spark
• ASG
• EMR Spark
ASG
•

Tackling Operability &
Correctness
Leveraging Tooling
92

94
A simple way to author and manage workﬂows
Provides visual insight into the state & performance of workﬂow
runs
Integrates with our alerting and monitoring tools
Tackling Operability : Requirements

Apache Airﬂow
Workﬂow Automation & Scheduling
96

98
Airflow: Author DAGs in Python! No need to bundle many config files!
Apache Airflow - Authoring DAGs

99
Airﬂow: DAG Python !
Apache Airﬂow - DAG

100
Airﬂow: Visualizing a DAG
Apache Airﬂow - Authoring DAGs

101
Airﬂow: DAG

102
Airﬂow: It’s easy to manage multiple DAGs
Apache Airﬂow - Managing DAGs

103
Airﬂow: DAG

Apache Airﬂow - Perf. Insights
104
Airﬂow: Gantt chart view reveals the slowest tasks for a run!

Apache Airﬂow -
105
Airﬂow:

106
Apache Airﬂow - Perf. Insights
Airﬂow: Task Duration chart view show task completion time trends!

107
Apache Airﬂow -
Airﬂow:

108
Airﬂow: …And easy to integrate with Ops tools!
Apache Airﬂow - Alerting

109
Airﬂow: …And easy to integrate with Ops tools!
Apache Airﬂow -

110
Apache Airﬂow - Correctness

112
Data Pipeline
Timeliness Cost

Use-Case : Message
Scoring (near-real time)
NRT Pipeline Architecture
114

116
enterprise A
enterprise B
enterprise C
Kinesis batch put every
second
K

:
117
enterprise A
enterprise B
enterprise C
Kinesis
K

118
enterprise A
enterprise B
enterprise C
K
As ASG of scorers is
scaled up to one process
per core per kinesis shard
Scorers
ASG

:
119
enterprise A
enterprise B
enterprise C
K
ASG
Kinesis CPU
1
Scorers
ASG

120
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Scorers apply the trust
model and send scored
messages downstream

:
121
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis

122
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
As ASG of importers is
scaled up to rapidly
import messages
DB

:
123
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
ASG
DB

124
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
Imported messages are
also consumed by the
alerter
DB
K
Alerters
ASG

:
125
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
DB
K
Alerters
ASG

126
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
alerter
DB
K
Alerters
ASG
Quarantine Email

:
127
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
DB
K
Alerters
ASG
Email

Innovations
NRT Pipeline Architecture
128

132
What is Avro?
Avro is a self-describing serialization format that supports
primitive data types : int, long, boolean, ﬂoat, string, bytes, etc…
complex data types : records, arrays, unions, maps, enums, etc…
many language bindings : Java, Scala, Python, Ruby, etc…

133
What is Avro?
Avro
: int, long, boolean, ﬂoat, string, bytes,
etc…
: records, arrays, unions, maps, enums, etc…
: Java, Scala, Python, Ruby, etc…

134
What is Avro?
Avro is a self-describing serialization format that supports
primitive data types : int, long, boolean, ﬂoat, string, bytes, etc…
complex data types : records, arrays, unions, maps, enums, etc…
many language bindings : Java, Scala, Python, Ruby, etc…
The most common format for storing structured Big Data at rest in
HDFS, S3, Google Cloud Storage, etc…
Supports Schema Evolution!

135
What is Avro?
Avro
: int, long, boolean, ﬂoat, string, bytes, etc…
: records, arrays, unions, maps, enums, etc…
: Java, Scala, Python, Ruby, etc…
HDFS, S3, Google Cloud Storage
!

136
Avro Schema Example
{"namespace": "agari",
"type": "record",
"name": "User",
"ﬁelds": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}

137
Avro
"type": "record",
"name": "User",
"ﬁelds": [
]
}

138
"type": "record",
"name": "User",
"ﬁelds": [
]
}
complex type (record)
Avro Schema Example

139
"type": "record",
"name": "User",
"ﬁelds": [
]
}
Avro

140
"type": "record",
"name": "User",
"ﬁelds": [
]
}
Schema name : User
Avro Schema Example

141
"type": "record",
"name": "User",
"ﬁelds": [
]
}
Schema name : User
Avro

142
"type": "record",
"name": "User",
"ﬁelds": [
]
}
Schema name : User
3 ﬁelds in the record: 1 required, 2
optional
Avro Schema Example

143
"type": "record",
"name": "User",
"ﬁelds": [
]
}
Schema name : User
3
1 2
Avro

144
"type": "record",
"name": "User",
"ﬁelds": [
]
}
Data
x 1,000,000,000
Avro Schema Data File Example
Schema
Data
0.0001 %
99.999 %
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data

145
"type": "record",
"name": "User",
"ﬁelds": [
]
}
Data
x 1,000,000,000
Avro
0.0001 %
99.999 %
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data

146
"type": "record",
"name": "User",
"ﬁelds": [
]
}
Binary Data block
Avro Schema Streaming Example
Schema
Data
99 %
1 %
Data

147
"type": "record",
"name": "User",
"ﬁelds": [
]
}
Binary Data block
Avro
99 %
1 %
Data

148
"type": "record",
"name": "User",
"ﬁelds": [
]
}
Binary Data block
Avro Schema Streaming Example
Schema
Data
99 %
1 %
Data
OVERHEAD!!

149
"type": "record",
"name": "User",
"ﬁelds": [
]
}
Binary Data block
Avro
99 %
1 %
Data
!!

150
Schema
Registry
(Lambda)
Innovation 1 : Avro Schema Registry
"type": "record",
"name": "User",
"ﬁelds": [
]
}
register_schema
Message
Producer (P)

151
(Lambda)
1 : Avro
"type": "record",
"name": "User",
"ﬁelds": [
]
}
register_schema
(P)

152
Schema
Registry
(Lambda)
register_schema returns a UUID
Message
Producer (P)

153
(Lambda)
1 : Avro
register_schema UUID
(P)

154
Schema
Registry
(Lambda)
Message Producer sends UUID +
Message
Producer (P)
Data
Message
Consumer (C)

155
(Lambda)
1 : Avro
UUID +
(P)
Data
(C)

156
Schema
Registry
(Lambda)
Message
Producer (P)
Data
Message
Consumer (C)
getSchemaById (UUID)

157
(Lambda)
1 : Avro
(P) (C)

158
Schema
Registry
(Lambda)
Message
Producer (P)
Data
Message
Consumer (C)
"type": "record",
"name": "User",
"ﬁelds": [
]
}

159
(Lambda)
1 : Avro
(P) (C)
"type": "record",
"name": "User",
"ﬁelds": [
]
}

160
Schema
Registry
(Lambda)
Message
Producer (P)
Message
Consumer (C)
"type": "record",
"name": "User",
"ﬁelds": [
]
}
Message Consumers
• download & cache the schema
• then decode the data

161
(Lambda)
1 : Avro
(P) (C)
"type": "record",
"name": "User",
"ﬁelds": [
]
}
• &
•

162
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
alerter
DB
K
Alerters
ASG
SR
SR
SR

163
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
alterer
DB
K
Alerters
ASG
SR
SR
SR
1 : Avro

164
The Architecture is composed of repeated patterns of :
ASG-based compute consumer
Kinesis transport streams (i.e. AWS’ managed “Kafka”)
A Lambda-based Avro Schema Registry
Innovation 2 : Repeatable Units
Compute i Kinesis i
ASG i
SR

165
You can chain these repeatable units together to make arbitrary
DAGs (Directed Acyclic Graphs)
User Hashicorp’s Terraform to compose your DAG through
automation
The example above is a simple Linear DAG with 3 units
Innovation 2 : Repeatable Units
Compute i Kinesis i
ASG i
SR
Compute i Kinesis i
ASG i
SR
Compute i Kinesis i
ASG i
SR

166
DAG( )
Hashicorp’s Terraform DAG
DAG
2 :
Compute i Kinesis i
ASG i
SR
Compute i Kinesis i
ASG i
SR
Compute i Kinesis i
ASG i
SR

Airﬂow Job Reactively Scales
Innovation 3 : Reactive-Scaling (WIP)
167
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
DB
K
Alerters
ASG
SR
SR
SR

Airﬂow
3 :
168
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
DB
K
Alerters
ASG
SR
SR
SR

169
If the ADR is triggered and a model build or code push
was recently done to Compute 1, ADR will revert the last
code or model push to ASG Compute 1
Innovation 4 : Anomaly-based Rollback
(WIP)
ASG
Compute 1 Compute 2 Kinesis
ASG
SR
Anomaly-
detector &
Reverter

170
ADR Compute 1
ADR
Compute1
4 :
(WIP)
ASG
Compute 1 Compute 2 Kinesis
ASG
SR
Anomaly-
detector &
Reverter

Open Source Plans
171
Follow us to be notified when the following is open-
sourced
• Avro Schema Registry
• Agari (Kinesis+ASG) scaling tool (Airflow Job)
• Anomaly-detector & Reverter
To be notified, follow @AgariEng & @r39132

172
Twitter
• Avro Schema Registry
• Agari (Kinesis+ASG) scaling tool (Airﬂow Job)
• Anomaly-detector & Reverter
@AgariEng & @r39132

Acknowledgments
173
• Vidur Apparao
• Stephen Cattaneo
• Jon Chase
• Andrew Flury
• William Forrester
• Chris Haag
• Mike Jones
• Scot Kennedy
• Thede Loder
• Paul Lorence
• Kevin Mandich
• Gabriel Ortiz
• Jacob Rideout
• Josh Yang
• Julian Mehnle
None of this work would be possible without the
contributions of the strong team below

174
• Vidur Apparao
• Stephen Cattaneo
• Jon Chase
• Andrew Flury
• William Forrester
• Chris Haag
• Mike Jones
• Scot Kennedy
• Thede Loder
• Paul Lorence
• Kevin Mandich
• Gabriel Ortiz
• Jacob Rideout
• Josh Yang
• Julian Mehnle

Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo

Similar to Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo (20)

More from Sid Anand

More from Sid Anand (16)

Recently uploaded

Recently uploaded (20)

Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo