This document discusses building a data platform in the cloud. It covers the evolution of data platforms from monolithic architectures to distributed event-driven architectures using a data lake. Key aspects of a cloud data platform include collecting and persisting all data in a data lake for standardized access, near real-time processing using streaming technologies, and building the platform using either fully managed or DIY/hybrid approaches on AWS. Design principles focus on event-driven separation of data producers and consumers and choosing the right technology for the problem.
3. Agenda
● The evolution of a data platform
● Data platform design principles
● Data platform technologies
● Data platform in the cloud
○ Data Lake - How to build
○ Data Lake - Technology selection
○ Data Propagation and Near real time processing - How to build
4. ● A unified platform for collecting, accessing and processing ALL of NI data
○ Collection - collect and persist
○ Standardized - consistent business data
○ Access - Standardized, Optimized, Ad-hoc, Applicative
● All in a stable, flexible, monitored, fast and cost effective data platform
● Making all of the company’s business related data available quickly for easy
consumption for creating insights and driving the business forward.
The Data Platform
5. “You have to be careful if you don’t know where you are going
because you might not get there!”
Yogi Berra
Data Platform Evolution
Technology always develops from the primitive, via the
complicated, to the simple.
Antoine de Saint-Exupéry
7. Data Platform Evolution - The monolith grows
● A Bigger monolith with a DB
Issues:
● Deployments start to slow down
8. Data Platform Evolution
● A bigger monolith with a DB
● With some data tool
Issues:
● Dependency between
Monolith and data service
● Monolith
9. Data Platform Evolution
● A bigger monolith with a DB
● With some data tool
Issues:
● Dependency between
Monolith and data service
● Changes in monolith
breaks the data tools
● Data tools impact performance
10. Data Platform Evolution
● A distributed Monolith with a DB
● With more data tools
Issues:
● Changes in DB schema break
The data tools
● Data tools impact performance
● The new services lock each other
● Monolith
Refactoring
11. Data Platform Evolution
● A distributed Monolith with a DB
● Data tools read from replica
Issues:
● Changes in DB schema break
The data tools
● Replica fails
● The new services lock each other
● Monolith
● Data freshness
Refactoring
12. Data Platform Evolution
● A distributed Monolith with a DB
● ETL
● With more data tools
Issues:
● Changes in DB schema break
The data tools
● The new services lock each other
● Monolith
● Data freshness
13. Data Platform Evolution
● Microservices
● Monolith DB + replica
● With more data tools
● Data warehouse
Issues:
● Changes in DB schema break
The ETL
● Getting data from Microservices
● Data warehouse flexibility +
performance
● Data freshness
Breaks all
the time
14. Data Platform Evolution
● Applications events
● Event Bus
● ETL
● Data Warehouse
● More data tools
Issues:
● Data warehouse flexibility +
performance
● Events consistency
● Data freshness
15. Data Platform Evolution
● Applications events
● Event Bus
● ETL
● Data Lake
○ Metastore
○ Processing Engines
○ Data Stores
○ SQL access
● Any data application
Issues:
● Events consistency
● Data freshness
16. Data Platform Evolution
● Applications events
● Event Bus
● Near Real Time
● ETL
● Data Lake
○ Metastore
○ Processing Engines
○ Data Stores
○ SQL access
● Any data application
Issues:
● Events consistency
Real Time
17. “Any problem in Computer Science can be solved with another
level of indirection”
– David Wheeler
“Except the problem of indirection complexity”
– Bob Morgan
Base principle used in the data platform evolution ...
18. Data Platform - Design Principles
● Event driven separation between producers and consumers of data
● Use the suitable technology for the problem
● Near real time access to all data
● Data Lake
○ All data goes to the data lake
○ Data Lake exposes data as Main flow of data
○ SQL/API/File access
○ Data is immutable
○ Data lake is the “source of truth” no other DB!
20. Data Platform Facets
● Data Propagation
○ Events Bus and Event Structuring
● Data Persistence
○ Durability, Partitioning and Formatting
● Data Access
○ Allow users/applications access to
data in any SLA needed
● Data Standardization
○ Unified business data
● Data Processing
○ ETLs, Algorithms and apps processing
infra
Real Time
22. Data Lake - Core Parts
● Scalable object store
● Data digest ETLs
● Data
○ format and partition
● A metastore/Dictionary
● Processing Engines
● Data Lake APIs
○ SQL accessible
23. Data Lake - Technologies - DIY
● HDFS
● Hive MetaStore
● Processing
○ Spark
○ Tez
○ M/R
● Data Access
○ Spark SQL
○ Impala
○ Presto
● Parquet formating
Cloudera, HortonWorks, MapR
25. Data Lake - Technologies - AWS - DIY Hybrid
● S3
● Spark on EMR -
○ ETL and Processing
● Athena
● RedShift & Spectrum
● AWS Glue Metastore
● Parquet
EMR
Glue
Metastore
26. Cloud Data Lake - DIY vs. AWS vs. ...
AWS Open - DIY Hybrid DIY Hybrid Fully Managed Proprietary
Features 7 10 10 8 8
Scalability 9 10 10 9 9
Operation Easy Hard Medium Easy Easy
Availability 10 9-10 9-10 9-10 6
Flexibility 7 10 10 7 6
Dev effort Medium Hard Medium Medium Easy
Testability 7 10 8 8 4
Cost Start - Low
Run - High
Start - High
Run - Medium
Start - Medium
Run - Medium
Start - Low
Run - High
Start - Low
Run - High
Vendor Lock High None Low Low Damn
Acronym: DVOF-FACTS :)
28. Data Propagation
● Event Structure and Format
○ Json, Avro, Protobuf...
● Event bus
○ Event based flow of information between
the systems
○ Integration with external system using
the events
○ Decouple data construction from data
consumption
○ Kinesis/firehose
○ Kafka/confluent
29. Event structure
Event Header
Platform Header
"platform_header": {
"platform": "{system}",
"service": "{service name}"
},
A single
Event
{
"event_header": {
"id": "{guid}",
"event type": "{map the schema} ",
"action": "publish",
"schema_version": "{schema evolution}",
“event_time” : "2017-09-07T07:17:31.503Z"
},
Specific Event Data
“data”: {
// all other specific fields of the event
…
}
}
Other Optional
Headers
"some_header": {
"from": "2017-04-01",
"to": "2017-04-01",
"someType": "bla",
},
30. Near Real Time - Core Parts
● Event Bus
● Streaming processing engines
● NoSQL DBs
Real Time
31. Near Real Time - DIY
● Amazon
○ Kinesis firehose - write to s3/RedShift
warehouse
○ Kinesis Analytics
○ DynamoDB
● Streaming processing engines
○ Spark Streaming
○ Flink
○ Confluent Kafka
○ Kinesis Streams
○ ...
● Proprietary NoSQL DBs
○ MemSQL
○ Snowflake
○ Couchbase
○ Arrowspike
○ Cassandra
○ Elastic
Real Time
32. Near Real Time - AWS
● Data propagation
○ Kinesis firehose - write to s3/RedShift
warehouse
○ DynamoDB
○ RedShift
● Streaming processing engines
○ Kinesis Analytics
○ ...
● NoSQL DBs
○ Managed Elastic DynamoDB
firehose
Real Time
33. Near Real Time - AWS - DIY Hybrid
● Data propagation
○ Confluent Kafka
● Streaming processing engines
○ EMR + Spark Streaming
○ EMR + Flink
● NoSQL DBs
○ Managed Elastic
○ DynamoDB
○ MemSQL
○ Snowflake
○ Couchbase
○ Arrowspike
○ Cassandra
DynamoDB
EMR
Real Time
34. Near Real Time - DIY vs. AWS vs. ...
AWS Open - DIY Hybrid DIY Hybrid Fully Managed Proprietary
Features 5 10 10 9 ?
Scalability 8 10 10 9 8
Operation Easy Hard Medium Easy Easy
Availability 10 9-10 9-10 9-10 6
Flexibility 6 10 10 9 6
Dev effort Medium Hard Medium Medium Easy
Testability 7 10 10 9 4
Cost Start - Low
Run - High
Start - High
Run - Medium
Start - Medium
Run - Medium
Start - Low
Run - High
Start - Low
Run - High
Vendor Lock High None Low Low Damn
35. ● A data platform in the cloud is the same as a private data platform but with the
option of using managed solutions!
● Structure your data from your producers - remember: garbage in, garbage out!
● Pick the right technology for your problem!
● Choose your solution using these aspects:
○ Dev effort
○ Vendor Locking
○ Operation effort
○ Flexibility
○ Features
○ Availability
○ Cost
○ Testability
○ Scalability
Bottom Line
Acronym: DVOF-FACTS :)