Data Lakes on Public Cloud: Breaking Data Management Monoliths

Data Lakes in the Public Cloud:
Breaking Data Management Monoliths
Sharon Dashet, Sr. Data Analytics Solution Lead, GCP
https://il.linkedin.com/in/sharon-dashet

Traditional EDW
players
~1995
Big data
vendors
~2005
Cloud
platform vendors
~2010
Specialized
Cloud vendors
~2012
Data Management timeline
Relational
OLTP
80s
Database Developer
Backend Developer
Application DBA
Production DBA
Data Scientist
Data Analysts
BI/OLAP Expert
SQL Expert
Governance
MDM
Big Data Developer
Big Data Architect
ML Engineer
Hadoop Admin
Hadoop Expert
AI Scientist
CDO
Cloud Data Engineer
Cloud Data Architect

HBase
( NoSQL
datastore)
Flume
(Log aggregation and
transport)
Sqoop
(Import and export of
relational data)
Ambari
(Management and
monitoring)
MapReduce (Cluster data processing)
YARN (Cluster resource management)
HDFS (Hadoop Distributed File System)
HCatalog (Metadata)
Oozie
(Workflow
automation)
Zookeeper
(Coordination )
Pig (Scripting) Flink (Streams)
Mahout & Spark ML
(Machine learning)
Presto
(Distributed SQL query)
(Cluster data processing)
Hive (SQL DW)
The Hadoop ecosystem is very popular for Big
Data workloads

Multi-User, Shared Hadoop Cluster
Data
(HDFS)
Temp
Data
(HDFS)
Metadata
(Hive metastore,
RDBMS)
AuthZ Policies,
Audit,
Governance
(Ranger, Atlas)
Compute: YARN
Hive Spark MR R
AuthN
Kerberos,
LDAP
Kafka, Storm,
Flume,
Cassandra,
Hbase, ELK etc.
Typical on-premises deployment

The apache Data-Processing ecosystem

Resource utilization and overall
TCO of on-prem data lakes
becomes unmanageable
Data governance and security issues open up
compliance concerns
Resource intensive data and
analytics processing can lead to
missed SLAs
Analytics experimentation is slow
due to resource provisioning time
TCO Challenges Governance Challenges
Agility ChallengesScaling Challenges
On-prem Data Lakes are struggling to deliver value

Key market players are
struggling to convert
customers.

The need is still there
AI is now capable of extracting
value from unstructured data
Cloud is faster, simpler to
operate, and less expensive
“80 percent of
worldwide data will be
unstructured by 2025”
Data Lake are shifted to the cloud
“By connecting data points, we can
offer advice like hygiene laws for
certain foods, or information on
provenance. We can even integrate
their local weather forecast so a store
doesn't run out of ice cream on a
sunny day."
Sven Lipowski, Unit Owner Customer
Solutions adMETERONOMIDC (source)
“The ability to spin up purpose
driven Hadoop clusters against our
shared datasets and scale them
up/down with demand is a game
changer for us…”
Brett Uyeshiro VP Platform Services,
Pandora

02
Patterns for Data Lakes in
Public Cloud

Beyond HDFS- Storage and Compute separation
Keep your storage on GCS instead of HDFS Benefits:
● Separation of Compute/Storage
● Full HDFS-compliant GCS connector
● Facilitates Job-scoped cost effective workloads
(+ephemeral clusters)
● No need to provision x3 storage for replication
● No unused bytes on disks

Hive Analytics Business ReportingMapReduce ETL Machine Learning
Storage
Cloud Storage
Hive Metastore
Cloud Dataproc
Clusters
Job-Scoped Clusters - Beyond complicated Yarn queues
● Step away from
complicated Yarn
queues and multi
tenancy
● Control cost and
performance per
workload:
○ Ephemeral
Clusters
○ Mix regular and
preemptible VMs
in the worker
pool
○ Different VM
types

Beyond Yarn and into Modern Service Mesh

AI
Platform
Notebook
s
AI
Platform
AI
Platform
Notebook
s
1. Data sources
2. Data Lake storage
3. Data Pipelines
4. Data
Warehouse/Lake
5. ML and analytics
workloads
Converged Smart Analytics

Data Lakes on Public Cloud: Breaking Data Management Monoliths

More Related Content

What's hot

Similar to Data Lakes on Public Cloud: Breaking Data Management Monoliths

More from Itai Yaffe

Recently uploaded

Data Lakes on Public Cloud: Breaking Data Management Monoliths