Data Architectures for Robust Decision Making

Designing Data
Architectures
for Robust Decision
Making
Gwen Shapira / Software Engineer

2©2014 Cloudera, Inc. All rights reserved.
• 15 years of moving data around
• Formerly consultant
• Now Cloudera Engineer:
– Sqoop Committer
– Kafka
– Flume
• @gwenshap
About Me

There’s a book on that!

“Big Data”
is stuck at
The Lab.

6
We want to move to The Factory

7Click to enter confidentiality information

8
What does it mean to “Systemize”?
• Ability to easily add new data sources
• Easily improve and expend analytics
• Ease data access by standardizing metadata and storage
• Ability to discover mistakes and to recover from them
• Ability to safely experiment with new approaches
Click to enter confidentiality information

9
We will discuss:
• Actual decision making
• Data Science
• Machine learning
• Algorithms
We will not discuss:
• Architectures
• Patterns
• Ingest
• Storage
• Schemas
• Metadata
• Streaming
• Experimenting
• Recovery

10
So how do we build
real data
architectures?

1212
Client Source
Data Pipelines Start like this.

1313
Client Source
Client
Client
Client
Then we reuse them

1414
Client Backend
Client
Client
Client
Then we add consumers to the
existing sources
Another
Backend

1515
Client Backend
Client
Client
Client
Then it starts to look like this
Another
Backend
Another
Backend
Another
Backend

1616
Client Backend
Client
Client
Client
With maybe some of this
Another
Backend
Another
Backend
Another
Backend

17
Adding applications should be easier
We need:
• Shared infrastructure for sending records
• Infrastructure must scale
• Set of agreed-upon record schemas

18
Kafka Based Ingest Architecture
18
Source System Source System Source System Source System
Kafka decouples Data Pipelines
Hadoop
Security
Systems
Real-time
monitoring
Data
Warehouse
Kafka
Producer
s
Brokers
Consume
rs
Kafka decouples Data Pipelines

19
Retain All Data

20
Data Pipeline – Traditional View
Raw data
Raw data Clean data
Aggregated dataClean data Enriched data
Input Output
Waste of
diskspace

It is all valuable data
Raw data
Raw data Clean data
Aggregated dataClean data Enriched data
Filtered data
Dash
board
Report
Data
scientis
t
Alerts
OMG

22
Hadoop Based ETL – The FileSystem is the
DB
/user/…
/user/gshapira/testdata/orders
/data/<database>/<table>/<partition>
/data/<biz unit>/<app>/<dataset>/partition
/data/pharmacy/fraud/orders/date=20131101
/etl/<biz unit>/<app>/<dataset>/<stage>
/etl/pharmacy/fraud/orders/validated

23
Store intermediate data
/etl/<biz unit>/<app>/<dataset>/<stage>/<dataset_id>
/etl/pharmacy/fraud/orders/raw/date=20131101
/etl/pharmacy/fraud/orders/deduped/date=20131101
/etl/pharmacy/fraud/orders/validated/date=20131101
/etl/pharmacy/fraud/orders_labs/merged/date=20131101
/etl/pharmacy/fraud/orders_labs/aggregated/date=20131101
/etl/pharmacy/fraud/orders_labs/ranked/date=20131101

24
Batch ETL is old news

25
Small Problem!
• HDFS is optimized for large chunks of data
• Don’t write individual events of micro-batches
• Think 100M-2G batches
• What do we do with small events?

26
Well, we have this data bus…
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
1
3
0 1 2 3 4 5 6 7 8 9
1
0
1
1
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
1
3
Partition
1
Partition
2
Partition
3
Writes
Old New

27
Kafka has topics
How about?
<biz unit>.<app>.<dataset>.<stage>
pharmacy.fraud.orders.raw
pharmacy.fraud.orders.deduped
pharmacy.fraud.orders.validated
pharmacy.fraud.orders_labs.merged

It’s (almost) all topics
Raw data
Raw data Clean data
Aggregated dataClean data
Filtered data
Dash
board
Report
Data
scientis
t
Alerts
OMG
Enriched Data

29
Benefits
• Recover from accidents
• Debug suspicious results
• Fix algorithm errors
• Experiment with new algorithms
• Expend pipelines
• Jump-start expended pipelines

31
Lambda Architecture
• Immutable events
• Store intermediate stages
• Combine Batches and Streams
• Reprocessing

32
What we don’t like
Maintaining two applications
Often in two languages
That do the same thing

33
Pain Avoidance #1 – Use Spark +
SparkStreaming
• Spark is awesome for batch, so why not?
– The New Kid that isn’t that New Anymore
– Easily 10x less code
– Extremely Easy and Powerful API
– Very good for machine learning
– Scala, Java, and Python
– RDDs
– DAG Engine

34
Spark Streaming
• Calling Spark in a Loop
• Extends RDDs with DStream
• Very Little Code Changes from ETL to Streaming
Confidentiality Information Goes Here

35
Spark Streaming
Confidentiality Information Goes Here
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count Print
Source Receiver RDD
RDD
RDD
Single Pass
Filter Count Print
Pre-first
Batch
First
Batch
Second
Batch

36
Small Example
val sparkConf = new SparkConf()
.setMaster(args(0)).setAppName(this.getClass.getCanonicalName)
val ssc = new StreamingContext(sparkConf, Seconds(10))
// Create the DStream from data sent over the network
val dStream = ssc.socketTextStream(args(1), args(2).toInt, StorageLevel.MEMORY_AND_DISK_SER)
// Counting the errors in each RDD in the stream
val errCountStream = dStream.transform(rdd => ErrorCount.countErrors(rdd))
val stateStream = errCountStream.updateStateByKey[Int](updateFunc)
errCountStream.foreachRDD(rdd => {
System.out.println("Errors this minute:%d".format(rdd.first()._2))
})

37
Pain Avoidance #2 – Split the Stream
Why do we even need stream + batch?
• Batch efficiencies
• Re-process to fix errors
• Re-process after delayed arrival
What if we could re-play data?

38
Lets Re-Process with new algorithm
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
1
3
Streaming App v1
Streaming App v2
Result set 1
Result set 2
App

39
Lets Re-Process with new algorithm
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
1
3
Streaming App v1
Streaming App v2
Result set 1
Result set 2
App

40
Oh no, we just got a bunch of data for
yesterday!
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
1
3
Streaming App
Streaming App
Today
Yesterday

41
Note:
No need to choose between the
approaches.
There are good reasons to do both.

42
Prediction:
Batch vs. Streaming distinction is going
away.

43
Yes, you really need
a Schema

44
Schema is a MUST HAVE for
data integration

4545
Client Backend
Client
Client
Client
Another
Backend
Another
Backend
Another
Backend

46
Remember that we want this?
46
Hadoop
Security
Systems
Real-time
monitoring
Data
Warehouse
Kafka
Producer
s
Brokers
Consume
rs

47
This means we need this:
Hadoop
Security
Systems
Real-time
monitoring
Data
Warehouse
Kafka
Schema
Repository

48
We can do it in few ways
• People go around asking each other:
“So, what does the 5th field of the messages in topic Blah contain?”
• There’s utility code for reading/writing messages that everyone
reuses
• Schema embedded in the message
• A centralized repository for schemas
– Each message has Schema ID
– Each topic has Schema ID

49
I Avro
• Define Schema
• Generate code for objects
• Serialize / Deserialize into Bytes or JSON
• Embed schema in files / records… or not
• Support for our favorite languages… Except Go.
• Schema Evolution
– Add and remove fields without breaking anything

50
Schemas are Agile
• Leave out MySQL and your favorite DBA for a second
• Schemas allow adding readers and writers easily
• Schemas allow modifying readers and writers independently
• Schemas can evolve as the system grows
• Allows validating data soon after its written
– No need to throw away data that doesn’t fit!

51Click to enter confidentiality information

52
Woah, that was lots of
stuff!

53
Recap – if you remember nothing else…
• After the POC, its time for production
• Goal: Evolve fast without breaking things
For this you need:
• Keep all data
• Design pipeline for error recovery – batch or stream
• Integrate with a data bus
• And Schemas

Data Architectures for Robust Decision Making

More Related Content

What's hot

Viewers also liked

Similar to Data Architectures for Robust Decision Making

More from Gwen (Chen) Shapira

Recently uploaded

Data Architectures for Robust Decision Making

Editor's Notes