Data Platform in the Cloud

Data Platform in the Cloud
Amihay Zer-Kavod, Code Naturally, Apr 2018
Date: Apr-2018

Amihay Zer-Kavod
Software Architect
Been in software Since 1989
Who Am I

Agenda
● The evolution of a data platform
● Data platform design principles
● Data platform technologies
● Data platform in the cloud
○ Data Lake - How to build
○ Data Lake - Technology selection
○ Data Propagation and Near real time processing - How to build

● A unified platform for collecting, accessing and processing ALL of NI data
○ Collection - collect and persist
○ Standardized - consistent business data
○ Access - Standardized, Optimized, Ad-hoc, Applicative
● All in a stable, flexible, monitored, fast and cost effective data platform
● Making all of the company’s business related data available quickly for easy
consumption for creating insights and driving the business forward.
The Data Platform

“You have to be careful if you don’t know where you are going
because you might not get there!”
Yogi Berra
Data Platform Evolution
Technology always develops from the primitive, via the
complicated, to the simple.
Antoine de Saint-Exupéry

● A monolith with a DB
Issues:
● “All is good in the land of monolith”

Data Platform Evolution - The monolith grows
● A Bigger monolith with a DB
Issues:
● Deployments start to slow down

● A bigger monolith with a DB
● With some data tool
Issues:
● Dependency between
Monolith and data service
● Monolith

● A bigger monolith with a DB
● With some data tool
Issues:
● Dependency between
Monolith and data service
● Changes in monolith
breaks the data tools
● Data tools impact performance

● A distributed Monolith with a DB
● With more data tools
Issues:
● Changes in DB schema break
The data tools
● Data tools impact performance
● The new services lock each other
● Monolith
Refactoring

● Data tools read from replica
Issues:
The data tools
● Replica fails
● Monolith
● Data freshness
Refactoring

● ETL
Issues:
The data tools
● Monolith
● Data freshness

● Microservices
● Monolith DB + replica
● Data warehouse
Issues:
The ETL
● Getting data from Microservices
● Data warehouse flexibility +
performance
● Data freshness
Breaks all
the time

● Applications events
● Event Bus
● ETL
● Data Warehouse
● More data tools
Issues:
● Data warehouse flexibility +
performance
● Events consistency
● Data freshness

● Event Bus
● ETL
● Data Lake
○ Metastore
○ Processing Engines
○ Data Stores
○ SQL access
● Any data application
Issues:
● Data freshness

● Event Bus
● Near Real Time
● ETL
● Data Lake
○ Metastore
○ Processing Engines
○ Data Stores
○ SQL access
● Any data application
Issues:
Real Time

“Any problem in Computer Science can be solved with another
level of indirection”
– David Wheeler
“Except the problem of indirection complexity”
– Bob Morgan
Base principle used in the data platform evolution ...

Data Platform - Design Principles
● Event driven separation between producers and consumers of data
● Use the suitable technology for the problem
● Near real time access to all data
● Data Lake
○ All data goes to the data lake
○ Data Lake exposes data as Main flow of data
○ SQL/API/File access
○ Data is immutable
○ Data lake is the “source of truth” no other DB!

Data Platform Facets
● Data Propagation
○ Events Bus and Event Structuring
● Data Persistence
○ Durability, Partitioning and Formatting
● Data Access
○ Allow users/applications access to
data in any SLA needed
● Data Standardization
○ Unified business data
● Data Processing
○ ETLs, Algorithms and apps processing
infra
Real Time

Data Lake - Core Parts
● Scalable object store
● Data digest ETLs
● Data
○ format and partition
● A metastore/Dictionary
● Processing Engines
● Data Lake APIs
○ SQL accessible

Data Lake - Technologies - DIY
● HDFS
● Hive MetaStore
● Processing
○ Spark
○ Tez
○ M/R
● Data Access
○ Spark SQL
○ Impala
○ Presto
● Parquet formating
Cloudera, HortonWorks, MapR

Data Lake - Technologies - AWS
● S3
● EMR + Spark
● Athena
● RedShift & Spectrum
● AWS Glue Metastore
● AWS Glue ETL
EMR
EMR
Glue
Metastore

Data Lake - Technologies - AWS - DIY Hybrid
● S3
● Spark on EMR -
○ ETL and Processing
● Athena
● RedShift & Spectrum
● AWS Glue Metastore
● Parquet
EMR
Glue
Metastore

Cloud Data Lake - DIY vs. AWS vs. ...
AWS Open - DIY Hybrid DIY Hybrid Fully Managed Proprietary
Features 7 10 10 8 8
Scalability 9 10 10 9 9
Operation Easy Hard Medium Easy Easy
Availability 10 9-10 9-10 9-10 6
Flexibility 7 10 10 7 6
Dev effort Medium Hard Medium Medium Easy
Testability 7 10 8 8 4
Cost Start - Low
Run - High
Start - High
Run - Medium
Start - Medium
Run - Medium
Start - Low
Run - High
Start - Low
Run - High
Vendor Lock High None Low Low Damn
Acronym: DVOF-FACTS :)

Data Propagation
● Event Structure and Format
○ Json, Avro, Protobuf...
● Event bus
○ Event based flow of information between
the systems
○ Integration with external system using
the events
○ Decouple data construction from data
consumption
○ Kinesis/firehose
○ Kafka/confluent

Event structure
Event Header
Platform Header
"platform_header": {
"platform": "{system}",
"service": "{service name}"
},
A single
Event
{
"event_header": {
"id": "{guid}",
"event type": "{map the schema} ",
"action": "publish",
"schema_version": "{schema evolution}",
“event_time” : "2017-09-07T07:17:31.503Z"
},
Specific Event Data
“data”: {
// all other specific fields of the event
…
}
}
Other Optional
Headers
"some_header": {
"from": "2017-04-01",
"to": "2017-04-01",
"someType": "bla",
},

Near Real Time - Core Parts
● Event Bus
● Streaming processing engines
● NoSQL DBs
Real Time

Near Real Time - DIY
● Amazon
○ Kinesis firehose - write to s3/RedShift
warehouse
○ Kinesis Analytics
○ DynamoDB
○ Spark Streaming
○ Flink
○ Confluent Kafka
○ Kinesis Streams
○ ...
● Proprietary NoSQL DBs
○ MemSQL
○ Snowflake
○ Couchbase
○ Arrowspike
○ Cassandra
○ Elastic
Real Time

Near Real Time - AWS
● Data propagation
○ Kinesis firehose - write to s3/RedShift
warehouse
○ DynamoDB
○ RedShift
○ Kinesis Analytics
○ ...
● NoSQL DBs
○ Managed Elastic DynamoDB
firehose
Real Time

Near Real Time - AWS - DIY Hybrid
● Data propagation
○ Confluent Kafka
○ EMR + Spark Streaming
○ EMR + Flink
● NoSQL DBs
○ Managed Elastic
○ DynamoDB
○ MemSQL
○ Snowflake
○ Couchbase
○ Arrowspike
○ Cassandra
DynamoDB
EMR
Real Time

Near Real Time - DIY vs. AWS vs. ...
AWS Open - DIY Hybrid DIY Hybrid Fully Managed Proprietary
Features 5 10 10 9 ?
Scalability 8 10 10 9 8
Operation Easy Hard Medium Easy Easy
Availability 10 9-10 9-10 9-10 6
Flexibility 6 10 10 9 6
Dev effort Medium Hard Medium Medium Easy
Testability 7 10 10 9 4
Cost Start - Low
Run - High
Start - High
Run - Medium
Start - Medium
Run - Medium
Start - Low
Run - High
Start - Low
Run - High
Vendor Lock High None Low Low Damn

● A data platform in the cloud is the same as a private data platform but with the
option of using managed solutions!
● Structure your data from your producers - remember: garbage in, garbage out!
● Pick the right technology for your problem!
● Choose your solution using these aspects:
○ Dev effort
○ Vendor Locking
○ Operation effort
○ Flexibility
○ Features
○ Availability
○ Cost
○ Testability
○ Scalability
Bottom Line
Acronym: DVOF-FACTS :)

Data Platform in the Cloud

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Platform in the Cloud

Similar to Data Platform in the Cloud (20)

Recently uploaded

Recently uploaded (20)

Data Platform in the Cloud