16. BI Predictive
Common Focus of this talk
Data Pipelines
16
Web Servers
OLTP
DB
Data
Warehouse
Repor6ng
Tools
Query
Browsers
ETL (batch)
MySQL,
Oracle,
Cassandra
Terradata,
RedShi;
BigQuery
OLTP DB
or cache
ETL (batch or streaming)
MySQL,
Oracle,
Cassandra,
Redis
Spark,
Flink,
Beam,
Storm
Web Servers
Ranking (Search, News Feed),
Recommender Products,
Fraud DetecGon / PrevenGon
Data
Source
18. Cloud Native Data Pipelines
18
Big Data Companies like LinkedIn, Facebook, Twitter, & Google
build custom, large scale data pipelines that run in their own
Data Centers
19. Cloud Native Data Pipelines
19
Big Data Companies like LinkedIn, Facebook, Twitter, & Google
build custom, large scale data pipelines that run in their own
Data Centers
Most start-ups run in the public cloud. Can they leverage
aspects of the public cloud to build comparable pipelines?
20. Cloud Native Data Pipelines
20
Cloud Native
Techniques
Open Source
Technogies
Custom Data Pipeline
Stacks seen in Big
Data companies
~
23. 23
Desirable Qualities of a Resilient
Data Pipeline
OperabilityCorrectness
Timeliness Cost
• Data Integrity (no loss, etc…)
• Expected data distributions
• All output within time-bound SLAs
• Fine-grained Monitoring &
Alerting of Correctness &
Timeliness SLAs
• Quick Recoverability
• Pay-as-you-go
24. Quickly Recoverable
24
• Bugs happen!
• Bugs in Predictive Data Pipelines have a large blast radius
• Optimize for MTTR
28. Use-Case : Message Scoring
28
enterprise A
enterprise B
enterprise C
S3
S3 uploads an Avro file
every 15 minutes
29. Use-Case : Message Scoring
29
enterprise A
enterprise B
enterprise C
S3
Airflow kicks of a Spark
message scoring job
every hour (EMR)
30. Use-Case : Message Scoring
30
enterprise A
enterprise B
enterprise C
S3
Spark job writes scored
messages and stats to
another S3 bucket
S3
31. Use-Case : Message Scoring
31
enterprise A
enterprise B
enterprise C
S3
This triggers SNS/SQS
messages events
S3
SNS
SQS
32. Use-Case : Message Scoring
32
enterprise A
enterprise B
enterprise C
S3
An Autoscale Group
(ASG) of Importers spins
up when it detects SQS
messages
S3
SNS
SQS
Importers
ASG
33. 33
enterprise A
enterprise B
enterprise C
S3
The importers rapidly ingest scored
messages and aggregate statistics into
the DB
S3
SNS
SQS
Importers
ASG
DB
Use-Case : Message Scoring
34. 34
enterprise A
enterprise B
enterprise C
S3
Users receive alerts of
untrusted emails &
can review them in
the web app
S3
SNS
SQS
Importers
ASG
DB
Use-Case : Message Scoring
37. Tackling Cost
37
Between Daily Runs During Daily Runs
When running daily, for 23 hours of a day, we didn’t
pay for instances in the ASG or EMR
38. Tackling Cost
38
Between Hourly Runs During Hourly Runs
When running daily, for 23 hours of a day, we didn’t pay for
instances in the ASG or EMR
This does not help when runs are hourly since AWS charges at
an hourly rate for EC2 instances!
40. ASG - Overview
40
What is it?
A means to automatically scale out/in clusters to handle
variable load/traffic
A means to keep a cluster/service of a fixed size always up
41. ASG - Data Pipeline
41
importer
importer
importer
importer
Importer
ASG
scaleout/in
SQS
DB
44. 44
Scale-out: When Visible Messages > 0 (a.k.a. when queue depth > 0)
Scale-in: When Invisible Messages = 0 (a.k.a. when the last in-flight
message is ACK’d)
This causes the
ASG to grow
This causes the
ASG to shrink
ASG : Queue-based
48. 48
A simple way to author and manage workflows
Provides visual insight into the state & performance of workflow
runs
Integrates with our alerting and monitoring tools
Tackling Operability : Requirements
59. Use-Case : Message Scoring
59
enterprise A
enterprise B
enterprise C
Kinesis batch put every
second
K
60. Use-Case : Message Scoring
60
enterprise A
enterprise B
enterprise C
K
As ASG of scorers is
scaled up to one process
per core per kinesis shard
Scorers
ASG
61. Use-Case : Message Scoring
61
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Scorers apply the trust
model and send scored
messages downstream
62. Use-Case : Message Scoring
62
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
As ASG of importers is
scaled up to rapidly
import messages
DB
63. Use-Case : Message Scoring
63
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
Imported messages are
also consumed by the
alerter
DB
K
Alerters
ASG
64. Use-Case : Message Scoring
64
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
Imported messages are
also consumed by the
alerter
DB
K
Alerters
ASG
Quarantine Email
67. 67
What is Avro?
Avro is a self-describing serialization format that supports
primitive data types : int, long, boolean, float, string, bytes, etc…
complex data types : records, arrays, unions, maps, enums, etc…
many language bindings : Java, Scala, Python, Ruby, etc…
68. 68
What is Avro?
Avro is a self-describing serialization format that supports
primitive data types : int, long, boolean, float, string, bytes, etc…
complex data types : records, arrays, unions, maps, enums, etc…
many language bindings : Java, Scala, Python, Ruby, etc…
The most common format for storing structured Big Data at rest in
HDFS, S3, Google Cloud Storage, etc…
Supports Schema Evolution!
71. 71
{"namespace": "agari",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
complex type (record)
Schema name : User
Avro Schema Example
72. 72
{"namespace": "agari",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
complex type (record)
Schema name : User
3 fields in the record: 1 required, 2
optional
Avro Schema Example
73. 73
{"namespace": "agari",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
Data
x 1,000,000,000
Avro Schema Data File Example
Schema
Data
0.0001 %
99.999 %
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
74. 74
{"namespace": "agari",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
Binary Data block
Avro Schema Streaming Example
Schema
Data
99 %
1 %
Data
82. 82
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
Imported messages are
also consumed by the
alerter
DB
K
Alerters
ASG
SR
SR
SR
Innovation 1 : Avro Schema Registry
83. 83
The Architecture is composed of repeated patterns of :
ASG-based compute consumer
Kinesis transport streams (i.e. AWS’ managed “Kafka”)
A Lambda-based Avro Schema Registry
Innovation 2 : Repeatable Units
Compute i Kinesis i
ASG i
SR
84. 84
You can chain these repeatable units together to make arbitrary
DAGs (Directed Acyclic Graphs)
User Hashicorp’s Terraform to compose your DAG through
automation
The example above is a simple Linear DAG with 3 units
Innovation 2 : Repeatable Units
Compute i Kinesis i
ASG i
SR
Compute i Kinesis i
ASG i
SR
Compute i Kinesis i
ASG i
SR
85. Airflow Job Reactively Scales
Innovation 3 : Reactive-Scaling (WIP)
85
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
DB
K
Alerters
ASG
SR
SR
SR
86. 86
If the ADR is triggered and a model build or code push
was recently done to Compute 1, ADR will revert the last
code or model push to ASG Compute 1
Innovation 4 : Anomaly-based Rollback
(WIP)
ASG
Compute 1 Compute 2 Kinesis
ASG
SR
Anomaly-
detector &
Reverter
87. Open Source Plans
87
Follow us to be notified when the following is open-
sourced
• Avro Schema Registry
• Agari (Kinesis+ASG) scaling tool (Airflow Job)
• Anomaly-detector & Reverter
To be notified, follow @AgariEng & @r39132
88. Acknowledgments
88
• Vidur Apparao
• Stephen Cattaneo
• Jon Chase
• Andrew Flury
• William Forrester
• Chris Haag
• Mike Jones
• Scot Kennedy
• Thede Loder
• Paul Lorence
• Kevin Mandich
• Gabriel Ortiz
• Jacob Rideout
• Josh Yang
• Julian Mehnle
None of this work would be possible without the
contributions of the strong team below