SlideShare a Scribd company logo
Data Platform in the Cloud
Amihay Zer-Kavod, Code Naturally, Apr 2018
Date: Apr-2018
Amihay Zer-Kavod
Software Architect
Been in software Since 1989
Who Am I
Agenda
● The evolution of a data platform
● Data platform design principles
● Data platform technologies
● Data platform in the cloud
○ Data Lake - How to build
○ Data Lake - Technology selection
○ Data Propagation and Near real time processing - How to build
● A unified platform for collecting, accessing and processing ALL of NI data
○ Collection - collect and persist
○ Standardized - consistent business data
○ Access - Standardized, Optimized, Ad-hoc, Applicative
● All in a stable, flexible, monitored, fast and cost effective data platform
● Making all of the company’s business related data available quickly for easy
consumption for creating insights and driving the business forward.
The Data Platform
“You have to be careful if you don’t know where you are going
because you might not get there!”
Yogi Berra
Data Platform Evolution
Technology always develops from the primitive, via the
complicated, to the simple.
Antoine de Saint-Exupéry
Data Platform Evolution
● A monolith with a DB
Issues:
● “All is good in the land of monolith”
Data Platform Evolution - The monolith grows
● A Bigger monolith with a DB
Issues:
● Deployments start to slow down
Data Platform Evolution
● A bigger monolith with a DB
● With some data tool
Issues:
● Dependency between
Monolith and data service
● Monolith
Data Platform Evolution
● A bigger monolith with a DB
● With some data tool
Issues:
● Dependency between
Monolith and data service
● Changes in monolith
breaks the data tools
● Data tools impact performance
Data Platform Evolution
● A distributed Monolith with a DB
● With more data tools
Issues:
● Changes in DB schema break
The data tools
● Data tools impact performance
● The new services lock each other
● Monolith
Refactoring
Data Platform Evolution
● A distributed Monolith with a DB
● Data tools read from replica
Issues:
● Changes in DB schema break
The data tools
● Replica fails
● The new services lock each other
● Monolith
● Data freshness
Refactoring
Data Platform Evolution
● A distributed Monolith with a DB
● ETL
● With more data tools
Issues:
● Changes in DB schema break
The data tools
● The new services lock each other
● Monolith
● Data freshness
Data Platform Evolution
● Microservices
● Monolith DB + replica
● With more data tools
● Data warehouse
Issues:
● Changes in DB schema break
The ETL
● Getting data from Microservices
● Data warehouse flexibility +
performance
● Data freshness
Breaks all
the time
Data Platform Evolution
● Applications events
● Event Bus
● ETL
● Data Warehouse
● More data tools
Issues:
● Data warehouse flexibility +
performance
● Events consistency
● Data freshness
Data Platform Evolution
● Applications events
● Event Bus
● ETL
● Data Lake
○ Metastore
○ Processing Engines
○ Data Stores
○ SQL access
● Any data application
Issues:
● Events consistency
● Data freshness
Data Platform Evolution
● Applications events
● Event Bus
● Near Real Time
● ETL
● Data Lake
○ Metastore
○ Processing Engines
○ Data Stores
○ SQL access
● Any data application
Issues:
● Events consistency
Real Time
“Any problem in Computer Science can be solved with another
level of indirection”
– David Wheeler
“Except the problem of indirection complexity”
– Bob Morgan
Base principle used in the data platform evolution ...
Data Platform - Design Principles
● Event driven separation between producers and consumers of data
● Use the suitable technology for the problem
● Near real time access to all data
● Data Lake
○ All data goes to the data lake
○ Data Lake exposes data as Main flow of data
○ SQL/API/File access
○ Data is immutable
○ Data lake is the “source of truth” no other DB!
Data Platform Technologies
Data Platform Facets
● Data Propagation
○ Events Bus and Event Structuring
● Data Persistence
○ Durability, Partitioning and Formatting
● Data Access
○ Allow users/applications access to
data in any SLA needed
● Data Standardization
○ Unified business data
● Data Processing
○ ETLs, Algorithms and apps processing
infra
Real Time
The Data Lake
Data Lake - Core Parts
● Scalable object store
● Data digest ETLs
● Data
○ format and partition
● A metastore/Dictionary
● Processing Engines
● Data Lake APIs
○ SQL accessible
Data Lake - Technologies - DIY
● HDFS
● Hive MetaStore
● Processing
○ Spark
○ Tez
○ M/R
● Data Access
○ Spark SQL
○ Impala
○ Presto
● Parquet formating
Cloudera, HortonWorks, MapR
Data Lake - Technologies - AWS
● S3
● EMR + Spark
● Athena
● RedShift & Spectrum
● AWS Glue Metastore
● AWS Glue ETL
EMR
EMR
Glue
Metastore
Data Lake - Technologies - AWS - DIY Hybrid
● S3
● Spark on EMR -
○ ETL and Processing
● Athena
● RedShift & Spectrum
● AWS Glue Metastore
● Parquet
EMR
Glue
Metastore
Cloud Data Lake - DIY vs. AWS vs. ...
AWS Open - DIY Hybrid DIY Hybrid Fully Managed Proprietary
Features 7 10 10 8 8
Scalability 9 10 10 9 9
Operation Easy Hard Medium Easy Easy
Availability 10 9-10 9-10 9-10 6
Flexibility 7 10 10 7 6
Dev effort Medium Hard Medium Medium Easy
Testability 7 10 8 8 4
Cost Start - Low
Run - High
Start - High
Run - Medium
Start - Medium
Run - Medium
Start - Low
Run - High
Start - Low
Run - High
Vendor Lock High None Low Low Damn
Acronym: DVOF-FACTS :)
Near Real Time processing
Data Propagation
● Event Structure and Format
○ Json, Avro, Protobuf...
● Event bus
○ Event based flow of information between
the systems
○ Integration with external system using
the events
○ Decouple data construction from data
consumption
○ Kinesis/firehose
○ Kafka/confluent
Event structure
Event Header
Platform Header
"platform_header": {
"platform": "{system}",
"service": "{service name}"
},
A single
Event
{
"event_header": {
"id": "{guid}",
"event type": "{map the schema} ",
"action": "publish",
"schema_version": "{schema evolution}",
“event_time” : "2017-09-07T07:17:31.503Z"
},
Specific Event Data
“data”: {
// all other specific fields of the event
…
}
}
Other Optional
Headers
"some_header": {
"from": "2017-04-01",
"to": "2017-04-01",
"someType": "bla",
},
Near Real Time - Core Parts
● Event Bus
● Streaming processing engines
● NoSQL DBs
Real Time
Near Real Time - DIY
● Amazon
○ Kinesis firehose - write to s3/RedShift
warehouse
○ Kinesis Analytics
○ DynamoDB
● Streaming processing engines
○ Spark Streaming
○ Flink
○ Confluent Kafka
○ Kinesis Streams
○ ...
● Proprietary NoSQL DBs
○ MemSQL
○ Snowflake
○ Couchbase
○ Arrowspike
○ Cassandra
○ Elastic
Real Time
Near Real Time - AWS
● Data propagation
○ Kinesis firehose - write to s3/RedShift
warehouse
○ DynamoDB
○ RedShift
● Streaming processing engines
○ Kinesis Analytics
○ ...
● NoSQL DBs
○ Managed Elastic DynamoDB
firehose
Real Time
Near Real Time - AWS - DIY Hybrid
● Data propagation
○ Confluent Kafka
● Streaming processing engines
○ EMR + Spark Streaming
○ EMR + Flink
● NoSQL DBs
○ Managed Elastic
○ DynamoDB
○ MemSQL
○ Snowflake
○ Couchbase
○ Arrowspike
○ Cassandra
DynamoDB
EMR
Real Time
Near Real Time - DIY vs. AWS vs. ...
AWS Open - DIY Hybrid DIY Hybrid Fully Managed Proprietary
Features 5 10 10 9 ?
Scalability 8 10 10 9 8
Operation Easy Hard Medium Easy Easy
Availability 10 9-10 9-10 9-10 6
Flexibility 6 10 10 9 6
Dev effort Medium Hard Medium Medium Easy
Testability 7 10 10 9 4
Cost Start - Low
Run - High
Start - High
Run - Medium
Start - Medium
Run - Medium
Start - Low
Run - High
Start - Low
Run - High
Vendor Lock High None Low Low Damn
● A data platform in the cloud is the same as a private data platform but with the
option of using managed solutions!
● Structure your data from your producers - remember: garbage in, garbage out!
● Pick the right technology for your problem!
● Choose your solution using these aspects:
○ Dev effort
○ Vendor Locking
○ Operation effort
○ Flexibility
○ Features
○ Availability
○ Cost
○ Testability
○ Scalability
Bottom Line
Acronym: DVOF-FACTS :)
Thank You!

More Related Content

What's hot

Meetup Google BigQuery powered by ai
Meetup Google BigQuery powered by aiMeetup Google BigQuery powered by ai
Meetup Google BigQuery powered by ai
Ido Volff
 
An Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaAn Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and Kibana
ObjectRocket
 
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...
confluent
 
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ..."Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
Dataconomy Media
 
Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017
Kristi Lewandowski
 
How to Build Real-Time Streaming Analytics with an In-memory, Scale-out SQL D...
How to Build Real-Time Streaming Analytics with an In-memory, Scale-out SQL D...How to Build Real-Time Streaming Analytics with an In-memory, Scale-out SQL D...
How to Build Real-Time Streaming Analytics with an In-memory, Scale-out SQL D...
VoltDB
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
Julien Le Dem
 
Lambda architecture @ Indix
Lambda architecture @ IndixLambda architecture @ Indix
Lambda architecture @ Indix
Rajesh Muppalla
 
Challenges in Building a Data Pipeline
Challenges in Building a Data PipelineChallenges in Building a Data Pipeline
Challenges in Building a Data Pipeline
Manish Kumar
 
Google Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline PatternsGoogle Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline Patterns
Lynn Langit
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
Itai Yaffe
 
Symantec: Cassandra Data Modelling techniques in action
Symantec: Cassandra Data Modelling techniques in actionSymantec: Cassandra Data Modelling techniques in action
Symantec: Cassandra Data Modelling techniques in action
DataStax Academy
 
Add Historical Analysis of Operational Data with Easy Configurations in Fivet...
Add Historical Analysis of Operational Data with Easy Configurations in Fivet...Add Historical Analysis of Operational Data with Easy Configurations in Fivet...
Add Historical Analysis of Operational Data with Easy Configurations in Fivet...
Databricks
 
The role of databases in modern application development
The role of databases in modern application developmentThe role of databases in modern application development
The role of databases in modern application development
MariaDB plc
 
Scylla Summit 2022: An Odyssey to ScyllaDB and Apache Kafka
Scylla Summit 2022: An Odyssey to ScyllaDB and Apache KafkaScylla Summit 2022: An Odyssey to ScyllaDB and Apache Kafka
Scylla Summit 2022: An Odyssey to ScyllaDB and Apache Kafka
ScyllaDB
 
Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"
Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"
Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"
Fwdays
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scale
Itai Yaffe
 
MongoDB as a Data Warehouse: Time Series and Device History Data (Medtronic)
MongoDB as a Data Warehouse: Time Series and Device History Data (Medtronic)MongoDB as a Data Warehouse: Time Series and Device History Data (Medtronic)
MongoDB as a Data Warehouse: Time Series and Device History Data (Medtronic)MongoDB
 
NoSQL no more: SQL on Druid with Apache Calcite
NoSQL no more: SQL on Druid with Apache CalciteNoSQL no more: SQL on Druid with Apache Calcite
NoSQL no more: SQL on Druid with Apache Calcite
gianmerlino
 
Big data on AWS
Big data on AWSBig data on AWS
Big data on AWS
Stylight
 

What's hot (20)

Meetup Google BigQuery powered by ai
Meetup Google BigQuery powered by aiMeetup Google BigQuery powered by ai
Meetup Google BigQuery powered by ai
 
An Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaAn Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and Kibana
 
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...
 
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ..."Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
 
Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017
 
How to Build Real-Time Streaming Analytics with an In-memory, Scale-out SQL D...
How to Build Real-Time Streaming Analytics with an In-memory, Scale-out SQL D...How to Build Real-Time Streaming Analytics with an In-memory, Scale-out SQL D...
How to Build Real-Time Streaming Analytics with an In-memory, Scale-out SQL D...
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
 
Lambda architecture @ Indix
Lambda architecture @ IndixLambda architecture @ Indix
Lambda architecture @ Indix
 
Challenges in Building a Data Pipeline
Challenges in Building a Data PipelineChallenges in Building a Data Pipeline
Challenges in Building a Data Pipeline
 
Google Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline PatternsGoogle Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline Patterns
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
 
Symantec: Cassandra Data Modelling techniques in action
Symantec: Cassandra Data Modelling techniques in actionSymantec: Cassandra Data Modelling techniques in action
Symantec: Cassandra Data Modelling techniques in action
 
Add Historical Analysis of Operational Data with Easy Configurations in Fivet...
Add Historical Analysis of Operational Data with Easy Configurations in Fivet...Add Historical Analysis of Operational Data with Easy Configurations in Fivet...
Add Historical Analysis of Operational Data with Easy Configurations in Fivet...
 
The role of databases in modern application development
The role of databases in modern application developmentThe role of databases in modern application development
The role of databases in modern application development
 
Scylla Summit 2022: An Odyssey to ScyllaDB and Apache Kafka
Scylla Summit 2022: An Odyssey to ScyllaDB and Apache KafkaScylla Summit 2022: An Odyssey to ScyllaDB and Apache Kafka
Scylla Summit 2022: An Odyssey to ScyllaDB and Apache Kafka
 
Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"
Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"
Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scale
 
MongoDB as a Data Warehouse: Time Series and Device History Data (Medtronic)
MongoDB as a Data Warehouse: Time Series and Device History Data (Medtronic)MongoDB as a Data Warehouse: Time Series and Device History Data (Medtronic)
MongoDB as a Data Warehouse: Time Series and Device History Data (Medtronic)
 
NoSQL no more: SQL on Druid with Apache Calcite
NoSQL no more: SQL on Druid with Apache CalciteNoSQL no more: SQL on Druid with Apache Calcite
NoSQL no more: SQL on Druid with Apache Calcite
 
Big data on AWS
Big data on AWSBig data on AWS
Big data on AWS
 

Similar to Data Platform in the Cloud

Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Jaroslav Gergic
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
Designing for operability and managability
Designing for operability and managabilityDesigning for operability and managability
Designing for operability and managability
Gaurav Bahrani
 
Simply Business' Data Platform
Simply Business' Data PlatformSimply Business' Data Platform
Simply Business' Data Platform
Dani Solà Lagares
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
Omid Vahdaty
 
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & TableauBig Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Sam Palani
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
Omid Vahdaty
 
Fineo Technical Overview - NextSQL for IoT
Fineo Technical Overview - NextSQL for IoTFineo Technical Overview - NextSQL for IoT
Fineo Technical Overview - NextSQL for IoT
Jesse Yates
 
Gcp dataflow
Gcp dataflowGcp dataflow
Gcp dataflow
Igor Roiter
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
markgrover
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slides
Karthik Murugesan
 
Data & Analytics - Session 1 - Big Data Analytics
Data & Analytics - Session 1 -  Big Data AnalyticsData & Analytics - Session 1 -  Big Data Analytics
Data & Analytics - Session 1 - Big Data Analytics
Amazon Web Services
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
Zhenxiao Luo
 
BigData Hadoop
BigData Hadoop BigData Hadoop
BigData Hadoop
Kumari Surabhi
 
Unlocking the Value of Your Data Lake
Unlocking the Value of Your Data LakeUnlocking the Value of Your Data Lake
Unlocking the Value of Your Data Lake
DATAVERSITY
 
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
SnapLogic
 
Cloud Big Data Architectures
Cloud Big Data ArchitecturesCloud Big Data Architectures
Cloud Big Data Architectures
Lynn Langit
 
An introduction to cloud systems architecture
An introduction to cloud systems architectureAn introduction to cloud systems architecture
An introduction to cloud systems architecture
Neela Muhil Vannan Mayavannan
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
Kriangkrai Chaonithi
 

Similar to Data Platform in the Cloud (20)

Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Designing for operability and managability
Designing for operability and managabilityDesigning for operability and managability
Designing for operability and managability
 
Simply Business' Data Platform
Simply Business' Data PlatformSimply Business' Data Platform
Simply Business' Data Platform
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
 
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & TableauBig Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Fineo Technical Overview - NextSQL for IoT
Fineo Technical Overview - NextSQL for IoTFineo Technical Overview - NextSQL for IoT
Fineo Technical Overview - NextSQL for IoT
 
Gcp dataflow
Gcp dataflowGcp dataflow
Gcp dataflow
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slides
 
Data & Analytics - Session 1 - Big Data Analytics
Data & Analytics - Session 1 -  Big Data AnalyticsData & Analytics - Session 1 -  Big Data Analytics
Data & Analytics - Session 1 - Big Data Analytics
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
BigData Hadoop
BigData Hadoop BigData Hadoop
BigData Hadoop
 
Unlocking the Value of Your Data Lake
Unlocking the Value of Your Data LakeUnlocking the Value of Your Data Lake
Unlocking the Value of Your Data Lake
 
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
 
Cloud Big Data Architectures
Cloud Big Data ArchitecturesCloud Big Data Architectures
Cloud Big Data Architectures
 
An introduction to cloud systems architecture
An introduction to cloud systems architectureAn introduction to cloud systems architecture
An introduction to cloud systems architecture
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 

Recently uploaded

Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Hivelance Technology
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
WSO2
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
informapgpstrackings
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Globus
 
Why React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdfWhy React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdf
ayushiqss
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
kalichargn70th171
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
Jelle | Nordend
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024
Sharepoint Designs
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
XfilesPro
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 

Recently uploaded (20)

Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 
Why React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdfWhy React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdf
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 

Data Platform in the Cloud

  • 1. Data Platform in the Cloud Amihay Zer-Kavod, Code Naturally, Apr 2018 Date: Apr-2018
  • 2. Amihay Zer-Kavod Software Architect Been in software Since 1989 Who Am I
  • 3. Agenda ● The evolution of a data platform ● Data platform design principles ● Data platform technologies ● Data platform in the cloud ○ Data Lake - How to build ○ Data Lake - Technology selection ○ Data Propagation and Near real time processing - How to build
  • 4. ● A unified platform for collecting, accessing and processing ALL of NI data ○ Collection - collect and persist ○ Standardized - consistent business data ○ Access - Standardized, Optimized, Ad-hoc, Applicative ● All in a stable, flexible, monitored, fast and cost effective data platform ● Making all of the company’s business related data available quickly for easy consumption for creating insights and driving the business forward. The Data Platform
  • 5. “You have to be careful if you don’t know where you are going because you might not get there!” Yogi Berra Data Platform Evolution Technology always develops from the primitive, via the complicated, to the simple. Antoine de Saint-Exupéry
  • 6. Data Platform Evolution ● A monolith with a DB Issues: ● “All is good in the land of monolith”
  • 7. Data Platform Evolution - The monolith grows ● A Bigger monolith with a DB Issues: ● Deployments start to slow down
  • 8. Data Platform Evolution ● A bigger monolith with a DB ● With some data tool Issues: ● Dependency between Monolith and data service ● Monolith
  • 9. Data Platform Evolution ● A bigger monolith with a DB ● With some data tool Issues: ● Dependency between Monolith and data service ● Changes in monolith breaks the data tools ● Data tools impact performance
  • 10. Data Platform Evolution ● A distributed Monolith with a DB ● With more data tools Issues: ● Changes in DB schema break The data tools ● Data tools impact performance ● The new services lock each other ● Monolith Refactoring
  • 11. Data Platform Evolution ● A distributed Monolith with a DB ● Data tools read from replica Issues: ● Changes in DB schema break The data tools ● Replica fails ● The new services lock each other ● Monolith ● Data freshness Refactoring
  • 12. Data Platform Evolution ● A distributed Monolith with a DB ● ETL ● With more data tools Issues: ● Changes in DB schema break The data tools ● The new services lock each other ● Monolith ● Data freshness
  • 13. Data Platform Evolution ● Microservices ● Monolith DB + replica ● With more data tools ● Data warehouse Issues: ● Changes in DB schema break The ETL ● Getting data from Microservices ● Data warehouse flexibility + performance ● Data freshness Breaks all the time
  • 14. Data Platform Evolution ● Applications events ● Event Bus ● ETL ● Data Warehouse ● More data tools Issues: ● Data warehouse flexibility + performance ● Events consistency ● Data freshness
  • 15. Data Platform Evolution ● Applications events ● Event Bus ● ETL ● Data Lake ○ Metastore ○ Processing Engines ○ Data Stores ○ SQL access ● Any data application Issues: ● Events consistency ● Data freshness
  • 16. Data Platform Evolution ● Applications events ● Event Bus ● Near Real Time ● ETL ● Data Lake ○ Metastore ○ Processing Engines ○ Data Stores ○ SQL access ● Any data application Issues: ● Events consistency Real Time
  • 17. “Any problem in Computer Science can be solved with another level of indirection” – David Wheeler “Except the problem of indirection complexity” – Bob Morgan Base principle used in the data platform evolution ...
  • 18. Data Platform - Design Principles ● Event driven separation between producers and consumers of data ● Use the suitable technology for the problem ● Near real time access to all data ● Data Lake ○ All data goes to the data lake ○ Data Lake exposes data as Main flow of data ○ SQL/API/File access ○ Data is immutable ○ Data lake is the “source of truth” no other DB!
  • 20. Data Platform Facets ● Data Propagation ○ Events Bus and Event Structuring ● Data Persistence ○ Durability, Partitioning and Formatting ● Data Access ○ Allow users/applications access to data in any SLA needed ● Data Standardization ○ Unified business data ● Data Processing ○ ETLs, Algorithms and apps processing infra Real Time
  • 22. Data Lake - Core Parts ● Scalable object store ● Data digest ETLs ● Data ○ format and partition ● A metastore/Dictionary ● Processing Engines ● Data Lake APIs ○ SQL accessible
  • 23. Data Lake - Technologies - DIY ● HDFS ● Hive MetaStore ● Processing ○ Spark ○ Tez ○ M/R ● Data Access ○ Spark SQL ○ Impala ○ Presto ● Parquet formating Cloudera, HortonWorks, MapR
  • 24. Data Lake - Technologies - AWS ● S3 ● EMR + Spark ● Athena ● RedShift & Spectrum ● AWS Glue Metastore ● AWS Glue ETL EMR EMR Glue Metastore
  • 25. Data Lake - Technologies - AWS - DIY Hybrid ● S3 ● Spark on EMR - ○ ETL and Processing ● Athena ● RedShift & Spectrum ● AWS Glue Metastore ● Parquet EMR Glue Metastore
  • 26. Cloud Data Lake - DIY vs. AWS vs. ... AWS Open - DIY Hybrid DIY Hybrid Fully Managed Proprietary Features 7 10 10 8 8 Scalability 9 10 10 9 9 Operation Easy Hard Medium Easy Easy Availability 10 9-10 9-10 9-10 6 Flexibility 7 10 10 7 6 Dev effort Medium Hard Medium Medium Easy Testability 7 10 8 8 4 Cost Start - Low Run - High Start - High Run - Medium Start - Medium Run - Medium Start - Low Run - High Start - Low Run - High Vendor Lock High None Low Low Damn Acronym: DVOF-FACTS :)
  • 27. Near Real Time processing
  • 28. Data Propagation ● Event Structure and Format ○ Json, Avro, Protobuf... ● Event bus ○ Event based flow of information between the systems ○ Integration with external system using the events ○ Decouple data construction from data consumption ○ Kinesis/firehose ○ Kafka/confluent
  • 29. Event structure Event Header Platform Header "platform_header": { "platform": "{system}", "service": "{service name}" }, A single Event { "event_header": { "id": "{guid}", "event type": "{map the schema} ", "action": "publish", "schema_version": "{schema evolution}", “event_time” : "2017-09-07T07:17:31.503Z" }, Specific Event Data “data”: { // all other specific fields of the event … } } Other Optional Headers "some_header": { "from": "2017-04-01", "to": "2017-04-01", "someType": "bla", },
  • 30. Near Real Time - Core Parts ● Event Bus ● Streaming processing engines ● NoSQL DBs Real Time
  • 31. Near Real Time - DIY ● Amazon ○ Kinesis firehose - write to s3/RedShift warehouse ○ Kinesis Analytics ○ DynamoDB ● Streaming processing engines ○ Spark Streaming ○ Flink ○ Confluent Kafka ○ Kinesis Streams ○ ... ● Proprietary NoSQL DBs ○ MemSQL ○ Snowflake ○ Couchbase ○ Arrowspike ○ Cassandra ○ Elastic Real Time
  • 32. Near Real Time - AWS ● Data propagation ○ Kinesis firehose - write to s3/RedShift warehouse ○ DynamoDB ○ RedShift ● Streaming processing engines ○ Kinesis Analytics ○ ... ● NoSQL DBs ○ Managed Elastic DynamoDB firehose Real Time
  • 33. Near Real Time - AWS - DIY Hybrid ● Data propagation ○ Confluent Kafka ● Streaming processing engines ○ EMR + Spark Streaming ○ EMR + Flink ● NoSQL DBs ○ Managed Elastic ○ DynamoDB ○ MemSQL ○ Snowflake ○ Couchbase ○ Arrowspike ○ Cassandra DynamoDB EMR Real Time
  • 34. Near Real Time - DIY vs. AWS vs. ... AWS Open - DIY Hybrid DIY Hybrid Fully Managed Proprietary Features 5 10 10 9 ? Scalability 8 10 10 9 8 Operation Easy Hard Medium Easy Easy Availability 10 9-10 9-10 9-10 6 Flexibility 6 10 10 9 6 Dev effort Medium Hard Medium Medium Easy Testability 7 10 10 9 4 Cost Start - Low Run - High Start - High Run - Medium Start - Medium Run - Medium Start - Low Run - High Start - Low Run - High Vendor Lock High None Low Low Damn
  • 35. ● A data platform in the cloud is the same as a private data platform but with the option of using managed solutions! ● Structure your data from your producers - remember: garbage in, garbage out! ● Pick the right technology for your problem! ● Choose your solution using these aspects: ○ Dev effort ○ Vendor Locking ○ Operation effort ○ Flexibility ○ Features ○ Availability ○ Cost ○ Testability ○ Scalability Bottom Line Acronym: DVOF-FACTS :)