Data Lakes in the Public Cloud:
Breaking Data Management Monoliths
Sharon Dashet, Sr. Data Analytics Solution Lead, GCP
https://il.linkedin.com/in/sharon-dashet
01Intro to Data Lakes
It all started with RDBMS….
Traditional EDW
players
~1995
Big data
vendors
~2005
Cloud
platform vendors
~2010
Specialized
Cloud vendors
~2012
Data Management timeline
Relational
OLTP
80s
Database Developer
Backend Developer
Application DBA
Production DBA
Data Scientist
Data Analysts
BI/OLAP Expert
SQL Expert
Governance
MDM
Big Data Developer
Big Data Architect
ML Engineer
Hadoop Admin
Hadoop Expert
AI Scientist
CDO
Cloud Data Engineer
Cloud Data Architect
HBase
( NoSQL
datastore)
Flume
(Log aggregation and
transport)
Sqoop
(Import and export of
relational data)
Ambari
(Management and
monitoring)
MapReduce (Cluster data processing)
YARN (Cluster resource management)
HDFS (Hadoop Distributed File System)
HCatalog (Metadata)
Oozie
(Workflow
automation)
Zookeeper
(Coordination )
Pig (Scripting) Flink (Streams)
Mahout & Spark ML
(Machine learning)
Presto
(Distributed SQL query)
(Cluster data processing)
Hive (SQL DW)
The Hadoop ecosystem is very popular for Big
Data workloads
Multi-User, Shared Hadoop Cluster
Data
(HDFS)
Temp
Data
(HDFS)
Metadata
(Hive metastore,
RDBMS)
AuthZ Policies,
Audit,
Governance
(Ranger, Atlas)
Compute: YARN
Hive Spark MR R
AuthN
Kerberos,
LDAP
Kafka, Storm,
Flume,
Cassandra,
Hbase, ELK etc.
Typical on-premises deployment
The apache Data-Processing ecosystem
Resource utilization and overall
TCO of on-prem data lakes
becomes unmanageable
Data governance and security issues open up
compliance concerns
Resource intensive data and
analytics processing can lead to
missed SLAs
Analytics experimentation is slow
due to resource provisioning time
TCO Challenges Governance Challenges
Agility ChallengesScaling Challenges
On-prem Data Lakes are struggling to deliver value
Key market players are
struggling to convert
customers.
The need is still there
AI is now capable of extracting
value from unstructured data
Cloud is faster, simpler to
operate, and less expensive
“80 percent of
worldwide data will be
unstructured by 2025”
Data Lake are shifted to the cloud
“By connecting data points, we can
offer advice like hygiene laws for
certain foods, or information on
provenance. We can even integrate
their local weather forecast so a store
doesn't run out of ice cream on a
sunny day."
Sven Lipowski, Unit Owner Customer
Solutions adMETERONOMIDC (source)
“The ability to spin up purpose
driven Hadoop clusters against our
shared datasets and scale them
up/down with demand is a game
changer for us…”
Brett Uyeshiro VP Platform Services,
Pandora
02
Patterns for Data Lakes in
Public Cloud
Beyond HDFS- Storage and Compute separation
Keep your storage on GCS instead of HDFS Benefits:
● Separation of Compute/Storage
● Full HDFS-compliant GCS connector
● Facilitates Job-scoped cost effective workloads
(+ephemeral clusters)
● No need to provision x3 storage for replication
● No unused bytes on disks
Hive Analytics Business ReportingMapReduce ETL Machine Learning
Storage
Cloud Storage
Hive Metastore
Cloud Dataproc
Clusters
Job-Scoped Clusters - Beyond complicated Yarn queues
● Step away from
complicated Yarn
queues and multi
tenancy
● Control cost and
performance per
workload:
○ Ephemeral
Clusters
○ Mix regular and
preemptible VMs
in the worker
pool
○ Different VM
types
Beyond Yarn and into Modern Service Mesh
AI
Platform
Notebook
s
AI
Platform
AI
Platform
Notebook
s
1. Data sources
2. Data Lake storage
3. Data Pipelines
4. Data
Warehouse/Lake
5. ML and analytics
workloads
Converged Smart Analytics
Thank you

Data Lakes on Public Cloud: Breaking Data Management Monoliths

  • 1.
    Data Lakes inthe Public Cloud: Breaking Data Management Monoliths Sharon Dashet, Sr. Data Analytics Solution Lead, GCP https://il.linkedin.com/in/sharon-dashet
  • 2.
  • 3.
    It all startedwith RDBMS….
  • 4.
    Traditional EDW players ~1995 Big data vendors ~2005 Cloud platformvendors ~2010 Specialized Cloud vendors ~2012 Data Management timeline Relational OLTP 80s Database Developer Backend Developer Application DBA Production DBA Data Scientist Data Analysts BI/OLAP Expert SQL Expert Governance MDM Big Data Developer Big Data Architect ML Engineer Hadoop Admin Hadoop Expert AI Scientist CDO Cloud Data Engineer Cloud Data Architect
  • 5.
    HBase ( NoSQL datastore) Flume (Log aggregationand transport) Sqoop (Import and export of relational data) Ambari (Management and monitoring) MapReduce (Cluster data processing) YARN (Cluster resource management) HDFS (Hadoop Distributed File System) HCatalog (Metadata) Oozie (Workflow automation) Zookeeper (Coordination ) Pig (Scripting) Flink (Streams) Mahout & Spark ML (Machine learning) Presto (Distributed SQL query) (Cluster data processing) Hive (SQL DW) The Hadoop ecosystem is very popular for Big Data workloads
  • 6.
    Multi-User, Shared HadoopCluster Data (HDFS) Temp Data (HDFS) Metadata (Hive metastore, RDBMS) AuthZ Policies, Audit, Governance (Ranger, Atlas) Compute: YARN Hive Spark MR R AuthN Kerberos, LDAP Kafka, Storm, Flume, Cassandra, Hbase, ELK etc. Typical on-premises deployment
  • 7.
  • 8.
    Resource utilization andoverall TCO of on-prem data lakes becomes unmanageable Data governance and security issues open up compliance concerns Resource intensive data and analytics processing can lead to missed SLAs Analytics experimentation is slow due to resource provisioning time TCO Challenges Governance Challenges Agility ChallengesScaling Challenges On-prem Data Lakes are struggling to deliver value
  • 9.
    Key market playersare struggling to convert customers.
  • 10.
    The need isstill there AI is now capable of extracting value from unstructured data Cloud is faster, simpler to operate, and less expensive “80 percent of worldwide data will be unstructured by 2025” Data Lake are shifted to the cloud “By connecting data points, we can offer advice like hygiene laws for certain foods, or information on provenance. We can even integrate their local weather forecast so a store doesn't run out of ice cream on a sunny day." Sven Lipowski, Unit Owner Customer Solutions adMETERONOMIDC (source) “The ability to spin up purpose driven Hadoop clusters against our shared datasets and scale them up/down with demand is a game changer for us…” Brett Uyeshiro VP Platform Services, Pandora
  • 11.
    02 Patterns for DataLakes in Public Cloud
  • 12.
    Beyond HDFS- Storageand Compute separation Keep your storage on GCS instead of HDFS Benefits: ● Separation of Compute/Storage ● Full HDFS-compliant GCS connector ● Facilitates Job-scoped cost effective workloads (+ephemeral clusters) ● No need to provision x3 storage for replication ● No unused bytes on disks
  • 13.
    Hive Analytics BusinessReportingMapReduce ETL Machine Learning Storage Cloud Storage Hive Metastore Cloud Dataproc Clusters Job-Scoped Clusters - Beyond complicated Yarn queues ● Step away from complicated Yarn queues and multi tenancy ● Control cost and performance per workload: ○ Ephemeral Clusters ○ Mix regular and preemptible VMs in the worker pool ○ Different VM types
  • 14.
    Beyond Yarn andinto Modern Service Mesh
  • 15.
    AI Platform Notebook s AI Platform AI Platform Notebook s 1. Data sources 2.Data Lake storage 3. Data Pipelines 4. Data Warehouse/Lake 5. ML and analytics workloads Converged Smart Analytics
  • 16.