Data Platform in the Cloud
Amihay Zer-Kavod, Code Naturally, Apr 2018
Date: Apr-2018
Amihay Zer-Kavod
Software Architect
Been in software Since 1989
Who Am I
Agenda
● The evolution of a data platform
● Data platform design principles
● Data platform technologies
● Data platform in the cloud
○ Data Lake - How to build
○ Data Lake - Technology selection
○ Data Propagation and Near real time processing - How to build
● A unified platform for collecting, accessing and processing ALL of NI data
○ Collection - collect and persist
○ Standardized - consistent business data
○ Access - Standardized, Optimized, Ad-hoc, Applicative
● All in a stable, flexible, monitored, fast and cost effective data platform
● Making all of the company’s business related data available quickly for easy
consumption for creating insights and driving the business forward.
The Data Platform
“You have to be careful if you don’t know where you are going
because you might not get there!”
Yogi Berra
Data Platform Evolution
Technology always develops from the primitive, via the
complicated, to the simple.
Antoine de Saint-Exupéry
Data Platform Evolution
● A monolith with a DB
Issues:
● “All is good in the land of monolith”
Data Platform Evolution - The monolith grows
● A Bigger monolith with a DB
Issues:
● Deployments start to slow down
Data Platform Evolution
● A bigger monolith with a DB
● With some data tool
Issues:
● Dependency between
Monolith and data service
● Monolith
Data Platform Evolution
● A bigger monolith with a DB
● With some data tool
Issues:
● Dependency between
Monolith and data service
● Changes in monolith
breaks the data tools
● Data tools impact performance
Data Platform Evolution
● A distributed Monolith with a DB
● With more data tools
Issues:
● Changes in DB schema break
The data tools
● Data tools impact performance
● The new services lock each other
● Monolith
Refactoring
Data Platform Evolution
● A distributed Monolith with a DB
● Data tools read from replica
Issues:
● Changes in DB schema break
The data tools
● Replica fails
● The new services lock each other
● Monolith
● Data freshness
Refactoring
Data Platform Evolution
● A distributed Monolith with a DB
● ETL
● With more data tools
Issues:
● Changes in DB schema break
The data tools
● The new services lock each other
● Monolith
● Data freshness
Data Platform Evolution
● Microservices
● Monolith DB + replica
● With more data tools
● Data warehouse
Issues:
● Changes in DB schema break
The ETL
● Getting data from Microservices
● Data warehouse flexibility +
performance
● Data freshness
Breaks all
the time
Data Platform Evolution
● Applications events
● Event Bus
● ETL
● Data Warehouse
● More data tools
Issues:
● Data warehouse flexibility +
performance
● Events consistency
● Data freshness
Data Platform Evolution
● Applications events
● Event Bus
● ETL
● Data Lake
○ Metastore
○ Processing Engines
○ Data Stores
○ SQL access
● Any data application
Issues:
● Events consistency
● Data freshness
Data Platform Evolution
● Applications events
● Event Bus
● Near Real Time
● ETL
● Data Lake
○ Metastore
○ Processing Engines
○ Data Stores
○ SQL access
● Any data application
Issues:
● Events consistency
Real Time
“Any problem in Computer Science can be solved with another
level of indirection”
– David Wheeler
“Except the problem of indirection complexity”
– Bob Morgan
Base principle used in the data platform evolution ...
Data Platform - Design Principles
● Event driven separation between producers and consumers of data
● Use the suitable technology for the problem
● Near real time access to all data
● Data Lake
○ All data goes to the data lake
○ Data Lake exposes data as Main flow of data
○ SQL/API/File access
○ Data is immutable
○ Data lake is the “source of truth” no other DB!
Data Platform Technologies
Data Platform Facets
● Data Propagation
○ Events Bus and Event Structuring
● Data Persistence
○ Durability, Partitioning and Formatting
● Data Access
○ Allow users/applications access to
data in any SLA needed
● Data Standardization
○ Unified business data
● Data Processing
○ ETLs, Algorithms and apps processing
infra
Real Time
The Data Lake
Data Lake - Core Parts
● Scalable object store
● Data digest ETLs
● Data
○ format and partition
● A metastore/Dictionary
● Processing Engines
● Data Lake APIs
○ SQL accessible
Data Lake - Technologies - DIY
● HDFS
● Hive MetaStore
● Processing
○ Spark
○ Tez
○ M/R
● Data Access
○ Spark SQL
○ Impala
○ Presto
● Parquet formating
Cloudera, HortonWorks, MapR
Data Lake - Technologies - AWS
● S3
● EMR + Spark
● Athena
● RedShift & Spectrum
● AWS Glue Metastore
● AWS Glue ETL
EMR
EMR
Glue
Metastore
Data Lake - Technologies - AWS - DIY Hybrid
● S3
● Spark on EMR -
○ ETL and Processing
● Athena
● RedShift & Spectrum
● AWS Glue Metastore
● Parquet
EMR
Glue
Metastore
Cloud Data Lake - DIY vs. AWS vs. ...
AWS Open - DIY Hybrid DIY Hybrid Fully Managed Proprietary
Features 7 10 10 8 8
Scalability 9 10 10 9 9
Operation Easy Hard Medium Easy Easy
Availability 10 9-10 9-10 9-10 6
Flexibility 7 10 10 7 6
Dev effort Medium Hard Medium Medium Easy
Testability 7 10 8 8 4
Cost Start - Low
Run - High
Start - High
Run - Medium
Start - Medium
Run - Medium
Start - Low
Run - High
Start - Low
Run - High
Vendor Lock High None Low Low Damn
Acronym: DVOF-FACTS :)
Near Real Time processing
Data Propagation
● Event Structure and Format
○ Json, Avro, Protobuf...
● Event bus
○ Event based flow of information between
the systems
○ Integration with external system using
the events
○ Decouple data construction from data
consumption
○ Kinesis/firehose
○ Kafka/confluent
Event structure
Event Header
Platform Header
"platform_header": {
"platform": "{system}",
"service": "{service name}"
},
A single
Event
{
"event_header": {
"id": "{guid}",
"event type": "{map the schema} ",
"action": "publish",
"schema_version": "{schema evolution}",
“event_time” : "2017-09-07T07:17:31.503Z"
},
Specific Event Data
“data”: {
// all other specific fields of the event
…
}
}
Other Optional
Headers
"some_header": {
"from": "2017-04-01",
"to": "2017-04-01",
"someType": "bla",
},
Near Real Time - Core Parts
● Event Bus
● Streaming processing engines
● NoSQL DBs
Real Time
Near Real Time - DIY
● Amazon
○ Kinesis firehose - write to s3/RedShift
warehouse
○ Kinesis Analytics
○ DynamoDB
● Streaming processing engines
○ Spark Streaming
○ Flink
○ Confluent Kafka
○ Kinesis Streams
○ ...
● Proprietary NoSQL DBs
○ MemSQL
○ Snowflake
○ Couchbase
○ Arrowspike
○ Cassandra
○ Elastic
Real Time
Near Real Time - AWS
● Data propagation
○ Kinesis firehose - write to s3/RedShift
warehouse
○ DynamoDB
○ RedShift
● Streaming processing engines
○ Kinesis Analytics
○ ...
● NoSQL DBs
○ Managed Elastic DynamoDB
firehose
Real Time
Near Real Time - AWS - DIY Hybrid
● Data propagation
○ Confluent Kafka
● Streaming processing engines
○ EMR + Spark Streaming
○ EMR + Flink
● NoSQL DBs
○ Managed Elastic
○ DynamoDB
○ MemSQL
○ Snowflake
○ Couchbase
○ Arrowspike
○ Cassandra
DynamoDB
EMR
Real Time
Near Real Time - DIY vs. AWS vs. ...
AWS Open - DIY Hybrid DIY Hybrid Fully Managed Proprietary
Features 5 10 10 9 ?
Scalability 8 10 10 9 8
Operation Easy Hard Medium Easy Easy
Availability 10 9-10 9-10 9-10 6
Flexibility 6 10 10 9 6
Dev effort Medium Hard Medium Medium Easy
Testability 7 10 10 9 4
Cost Start - Low
Run - High
Start - High
Run - Medium
Start - Medium
Run - Medium
Start - Low
Run - High
Start - Low
Run - High
Vendor Lock High None Low Low Damn
● A data platform in the cloud is the same as a private data platform but with the
option of using managed solutions!
● Structure your data from your producers - remember: garbage in, garbage out!
● Pick the right technology for your problem!
● Choose your solution using these aspects:
○ Dev effort
○ Vendor Locking
○ Operation effort
○ Flexibility
○ Features
○ Availability
○ Cost
○ Testability
○ Scalability
Bottom Line
Acronym: DVOF-FACTS :)
Thank You!

Data Platform in the Cloud

  • 1.
    Data Platform inthe Cloud Amihay Zer-Kavod, Code Naturally, Apr 2018 Date: Apr-2018
  • 2.
    Amihay Zer-Kavod Software Architect Beenin software Since 1989 Who Am I
  • 3.
    Agenda ● The evolutionof a data platform ● Data platform design principles ● Data platform technologies ● Data platform in the cloud ○ Data Lake - How to build ○ Data Lake - Technology selection ○ Data Propagation and Near real time processing - How to build
  • 4.
    ● A unifiedplatform for collecting, accessing and processing ALL of NI data ○ Collection - collect and persist ○ Standardized - consistent business data ○ Access - Standardized, Optimized, Ad-hoc, Applicative ● All in a stable, flexible, monitored, fast and cost effective data platform ● Making all of the company’s business related data available quickly for easy consumption for creating insights and driving the business forward. The Data Platform
  • 5.
    “You have tobe careful if you don’t know where you are going because you might not get there!” Yogi Berra Data Platform Evolution Technology always develops from the primitive, via the complicated, to the simple. Antoine de Saint-Exupéry
  • 6.
    Data Platform Evolution ●A monolith with a DB Issues: ● “All is good in the land of monolith”
  • 7.
    Data Platform Evolution- The monolith grows ● A Bigger monolith with a DB Issues: ● Deployments start to slow down
  • 8.
    Data Platform Evolution ●A bigger monolith with a DB ● With some data tool Issues: ● Dependency between Monolith and data service ● Monolith
  • 9.
    Data Platform Evolution ●A bigger monolith with a DB ● With some data tool Issues: ● Dependency between Monolith and data service ● Changes in monolith breaks the data tools ● Data tools impact performance
  • 10.
    Data Platform Evolution ●A distributed Monolith with a DB ● With more data tools Issues: ● Changes in DB schema break The data tools ● Data tools impact performance ● The new services lock each other ● Monolith Refactoring
  • 11.
    Data Platform Evolution ●A distributed Monolith with a DB ● Data tools read from replica Issues: ● Changes in DB schema break The data tools ● Replica fails ● The new services lock each other ● Monolith ● Data freshness Refactoring
  • 12.
    Data Platform Evolution ●A distributed Monolith with a DB ● ETL ● With more data tools Issues: ● Changes in DB schema break The data tools ● The new services lock each other ● Monolith ● Data freshness
  • 13.
    Data Platform Evolution ●Microservices ● Monolith DB + replica ● With more data tools ● Data warehouse Issues: ● Changes in DB schema break The ETL ● Getting data from Microservices ● Data warehouse flexibility + performance ● Data freshness Breaks all the time
  • 14.
    Data Platform Evolution ●Applications events ● Event Bus ● ETL ● Data Warehouse ● More data tools Issues: ● Data warehouse flexibility + performance ● Events consistency ● Data freshness
  • 15.
    Data Platform Evolution ●Applications events ● Event Bus ● ETL ● Data Lake ○ Metastore ○ Processing Engines ○ Data Stores ○ SQL access ● Any data application Issues: ● Events consistency ● Data freshness
  • 16.
    Data Platform Evolution ●Applications events ● Event Bus ● Near Real Time ● ETL ● Data Lake ○ Metastore ○ Processing Engines ○ Data Stores ○ SQL access ● Any data application Issues: ● Events consistency Real Time
  • 17.
    “Any problem inComputer Science can be solved with another level of indirection” – David Wheeler “Except the problem of indirection complexity” – Bob Morgan Base principle used in the data platform evolution ...
  • 18.
    Data Platform -Design Principles ● Event driven separation between producers and consumers of data ● Use the suitable technology for the problem ● Near real time access to all data ● Data Lake ○ All data goes to the data lake ○ Data Lake exposes data as Main flow of data ○ SQL/API/File access ○ Data is immutable ○ Data lake is the “source of truth” no other DB!
  • 19.
  • 20.
    Data Platform Facets ●Data Propagation ○ Events Bus and Event Structuring ● Data Persistence ○ Durability, Partitioning and Formatting ● Data Access ○ Allow users/applications access to data in any SLA needed ● Data Standardization ○ Unified business data ● Data Processing ○ ETLs, Algorithms and apps processing infra Real Time
  • 21.
  • 22.
    Data Lake -Core Parts ● Scalable object store ● Data digest ETLs ● Data ○ format and partition ● A metastore/Dictionary ● Processing Engines ● Data Lake APIs ○ SQL accessible
  • 23.
    Data Lake -Technologies - DIY ● HDFS ● Hive MetaStore ● Processing ○ Spark ○ Tez ○ M/R ● Data Access ○ Spark SQL ○ Impala ○ Presto ● Parquet formating Cloudera, HortonWorks, MapR
  • 24.
    Data Lake -Technologies - AWS ● S3 ● EMR + Spark ● Athena ● RedShift & Spectrum ● AWS Glue Metastore ● AWS Glue ETL EMR EMR Glue Metastore
  • 25.
    Data Lake -Technologies - AWS - DIY Hybrid ● S3 ● Spark on EMR - ○ ETL and Processing ● Athena ● RedShift & Spectrum ● AWS Glue Metastore ● Parquet EMR Glue Metastore
  • 26.
    Cloud Data Lake- DIY vs. AWS vs. ... AWS Open - DIY Hybrid DIY Hybrid Fully Managed Proprietary Features 7 10 10 8 8 Scalability 9 10 10 9 9 Operation Easy Hard Medium Easy Easy Availability 10 9-10 9-10 9-10 6 Flexibility 7 10 10 7 6 Dev effort Medium Hard Medium Medium Easy Testability 7 10 8 8 4 Cost Start - Low Run - High Start - High Run - Medium Start - Medium Run - Medium Start - Low Run - High Start - Low Run - High Vendor Lock High None Low Low Damn Acronym: DVOF-FACTS :)
  • 27.
    Near Real Timeprocessing
  • 28.
    Data Propagation ● EventStructure and Format ○ Json, Avro, Protobuf... ● Event bus ○ Event based flow of information between the systems ○ Integration with external system using the events ○ Decouple data construction from data consumption ○ Kinesis/firehose ○ Kafka/confluent
  • 29.
    Event structure Event Header PlatformHeader "platform_header": { "platform": "{system}", "service": "{service name}" }, A single Event { "event_header": { "id": "{guid}", "event type": "{map the schema} ", "action": "publish", "schema_version": "{schema evolution}", “event_time” : "2017-09-07T07:17:31.503Z" }, Specific Event Data “data”: { // all other specific fields of the event … } } Other Optional Headers "some_header": { "from": "2017-04-01", "to": "2017-04-01", "someType": "bla", },
  • 30.
    Near Real Time- Core Parts ● Event Bus ● Streaming processing engines ● NoSQL DBs Real Time
  • 31.
    Near Real Time- DIY ● Amazon ○ Kinesis firehose - write to s3/RedShift warehouse ○ Kinesis Analytics ○ DynamoDB ● Streaming processing engines ○ Spark Streaming ○ Flink ○ Confluent Kafka ○ Kinesis Streams ○ ... ● Proprietary NoSQL DBs ○ MemSQL ○ Snowflake ○ Couchbase ○ Arrowspike ○ Cassandra ○ Elastic Real Time
  • 32.
    Near Real Time- AWS ● Data propagation ○ Kinesis firehose - write to s3/RedShift warehouse ○ DynamoDB ○ RedShift ● Streaming processing engines ○ Kinesis Analytics ○ ... ● NoSQL DBs ○ Managed Elastic DynamoDB firehose Real Time
  • 33.
    Near Real Time- AWS - DIY Hybrid ● Data propagation ○ Confluent Kafka ● Streaming processing engines ○ EMR + Spark Streaming ○ EMR + Flink ● NoSQL DBs ○ Managed Elastic ○ DynamoDB ○ MemSQL ○ Snowflake ○ Couchbase ○ Arrowspike ○ Cassandra DynamoDB EMR Real Time
  • 34.
    Near Real Time- DIY vs. AWS vs. ... AWS Open - DIY Hybrid DIY Hybrid Fully Managed Proprietary Features 5 10 10 9 ? Scalability 8 10 10 9 8 Operation Easy Hard Medium Easy Easy Availability 10 9-10 9-10 9-10 6 Flexibility 6 10 10 9 6 Dev effort Medium Hard Medium Medium Easy Testability 7 10 10 9 4 Cost Start - Low Run - High Start - High Run - Medium Start - Medium Run - Medium Start - Low Run - High Start - Low Run - High Vendor Lock High None Low Low Damn
  • 35.
    ● A dataplatform in the cloud is the same as a private data platform but with the option of using managed solutions! ● Structure your data from your producers - remember: garbage in, garbage out! ● Pick the right technology for your problem! ● Choose your solution using these aspects: ○ Dev effort ○ Vendor Locking ○ Operation effort ○ Flexibility ○ Features ○ Availability ○ Cost ○ Testability ○ Scalability Bottom Line Acronym: DVOF-FACTS :)
  • 36.