DATA ORCHESTRATION SUMMIT
2020
Hybrid Data Lake on Google Cloud
With Alluxio and Dataproc
RoderickYao | Google Cloud
DATA ORCHESTRATION SUMMIT
2020
Resource utilization and overall
TCO of on-prem data lakes
becomes unmanageable
Data governance and security issues open up
compliance concerns
Resource intensive data and
analytics processing can
caused missed SLAs
Analytics experimentation is slow due
to resource provisioning time
On-Prem Data Lakes are struggling to
deliver value
TCO Challenges Governance Challenges
Agility ChallengesScaling Challenges
Data Ingestion
Quickly ingest any volume of real-time or batch data from any system
(Cloud Dataflow, Data Fusion, Pub/Sub)
Data Storage
Cost effective storage for any type or volume of data
(GCS, BigQuery)
Data & Analytics Processing
Large scale processing engines that support any language or application
(Cloud Dataproc, BigQuery)
Business
Intelligence
(Looker)
Data Science
(AI Platform)
Data
Engineering
(Data Fusion)
Partner
App
Securedand
Governed
Data Lake Solution
Data Lake
solution on
Google
Cloud
Migrate your data lake to GCP in phases to
optimize risk, TCO and value
Lift & Shift
● Minimize risk and disruption
● Fast migration from Cloudera,
MapR, etc.
● Fully managed
● Lower TCO
Zero-Copy Burst
● A foot in both worlds
● Take advantage of some cloud
capabilities (GCS, Ephemeral
clusters)
● 57% lower TCO than on-prem
Modernize
● Cloud-native, clean break from the past
● From fully managed to serverless
● Greatest development velocity and agility.
● 60-88% lower TCO than on-prem, plus value
from Google AI on unstructured data
Migrate your data lake to GCP in phases to
optimize risk, TCO and value
Lift & Shift
● Minimize risk and disruption
● Fast migration from Cloudera,
MapR, etc.
● Fully managed
● Lower TCO
Zero-Copy Burst
● A foot in both worlds
● Take advantage of some cloud
capabilities (GCS, Ephemeral
clusters)
● 57% lower TCO than on-prem
Modernize
● Cloud-native, clean break from the past
● From fully managed to serverless
● Greatest development velocity and agility.
● 60-88% lower TCO than on-prem, plus value
from Google AI on unstructured data
What is Cloud Dataproc?
Rapid cluster creation
Familiar open source tools
Google Cloud Platform’s
fully-managed Apache Spark
and Apache Hadoop service
Ephemeral clusters on-demand
Customizable machines
Tightly Integrated
with other Google Cloud
Platform services
Google Cloud Dataproc value
Fast
Things take seconds
to minutes, not
hours or weeks
Easy
Be an expert with
your data, not your
data infrastructure
Cost-effective
Pay for exactly what
you use to process
your data, not more
Cloud Dataproc on
Kubernetes (BETA)
Combining the best of
open source and cloud.
Jan ‘19 - Kubernetes
Operator for Apache Spark
Open Sourced
Sept ‘19 - Kubernetes
Operator for Apache Flink
Open Sourced
What does this
mean for data
scientists and
data engineers?
Moving to a
container-first world
Cross-cloud and hybrid cloud
support
Better OSS component isolation
Support for vendor-supported
components on Cloud Dataproc
1
2
3
Faster development of cluster / task
management vs. Apache YARN
4
Cloud
Dataproc
Machine
Learning
ETL/ ELT SQL
Partner
Component
Secure Manage Support
Streaming
▪ On-demand Capacity
▪ Keep existing investments.When you need compute capacity, expand cloud
footprint.
▪ Leverage cloud flexibility for bursty workloads
▪ Reduce overload on existing infrastructure by moving ephemeral or
workloads with unpredictable resource utilization
▪ Intermediate step before migrating to the cloud
▪ Lower risk of a full cloud data migration and start with compute in the cloud
and data on-prem. Full migration can be slow!
Why Hybrid Cloud?
Zero-copy Hybrid Bursting Architecture
Migration Challenges
Data Orchestration
● Global Catalog simplifies data
discovery
● Data On-demand
Security & Governance
● Authentication - Kerberos, Delegation
token, LDAP, AD
● Authorization - File System security
model, Apache Ranger integration
● Encryption - TLS, encryption at rest
● Audit - access logs
A Real World Example
DEMO
Get Started with Alluxio on Dataproc
Tutorial: Getting started with Dataproc and Alluxio
https://www.alluxio.io/products/google-cloud/gcp-dataproc-tutorial/
For this Demo:
$ gcloud dataproc clusters create roderickyao-alluxio --initialization-actions gs://alluxio-public/enterprise-dataproc/2.1.0-1.0/alluxio-dat
alluxio_root_ufs_uri=gs://ryao-test/alluxio-test/,alluxio_site_properties="alluxio.master.mount.table.root.option.fs.gcs.accessKeyId=<KEYID>
on.fs.gcs.secretAccessKey=<SECRET>",alluxio_license_base64=$(cat alluxio-enterprise-license.json | base64 | tr -d
"n"),alluxio_download_path=gs://ryao-test/alluxio-enterprise-2.1.0-1.0.tar.gz
Resources
▪ Burst data lake processing to Dataproc using on-prem Hadoop data
https://cloud.google.com/blog/products/data-analytics/burst-data-lake-processing-data
proc-using-prem-hadoop-data
▪ Tutorial: Hybrid Cloud Bursting with GCP and Alluxio
https://docs.alluxio.io/ee/user/stable/en/tutorials/GCP-Tutorial.html

Hybrid data lake on google cloud with alluxio and dataproc

  • 1.
    DATA ORCHESTRATION SUMMIT 2020 Hybrid Data Lakeon Google Cloud With Alluxio and Dataproc RoderickYao | Google Cloud
  • 2.
  • 3.
    Resource utilization andoverall TCO of on-prem data lakes becomes unmanageable Data governance and security issues open up compliance concerns Resource intensive data and analytics processing can caused missed SLAs Analytics experimentation is slow due to resource provisioning time On-Prem Data Lakes are struggling to deliver value TCO Challenges Governance Challenges Agility ChallengesScaling Challenges
  • 4.
    Data Ingestion Quickly ingestany volume of real-time or batch data from any system (Cloud Dataflow, Data Fusion, Pub/Sub) Data Storage Cost effective storage for any type or volume of data (GCS, BigQuery) Data & Analytics Processing Large scale processing engines that support any language or application (Cloud Dataproc, BigQuery) Business Intelligence (Looker) Data Science (AI Platform) Data Engineering (Data Fusion) Partner App Securedand Governed Data Lake Solution Data Lake solution on Google Cloud
  • 5.
    Migrate your datalake to GCP in phases to optimize risk, TCO and value Lift & Shift ● Minimize risk and disruption ● Fast migration from Cloudera, MapR, etc. ● Fully managed ● Lower TCO Zero-Copy Burst ● A foot in both worlds ● Take advantage of some cloud capabilities (GCS, Ephemeral clusters) ● 57% lower TCO than on-prem Modernize ● Cloud-native, clean break from the past ● From fully managed to serverless ● Greatest development velocity and agility. ● 60-88% lower TCO than on-prem, plus value from Google AI on unstructured data
  • 6.
    Migrate your datalake to GCP in phases to optimize risk, TCO and value Lift & Shift ● Minimize risk and disruption ● Fast migration from Cloudera, MapR, etc. ● Fully managed ● Lower TCO Zero-Copy Burst ● A foot in both worlds ● Take advantage of some cloud capabilities (GCS, Ephemeral clusters) ● 57% lower TCO than on-prem Modernize ● Cloud-native, clean break from the past ● From fully managed to serverless ● Greatest development velocity and agility. ● 60-88% lower TCO than on-prem, plus value from Google AI on unstructured data
  • 7.
    What is CloudDataproc? Rapid cluster creation Familiar open source tools Google Cloud Platform’s fully-managed Apache Spark and Apache Hadoop service Ephemeral clusters on-demand Customizable machines Tightly Integrated with other Google Cloud Platform services
  • 8.
    Google Cloud Dataprocvalue Fast Things take seconds to minutes, not hours or weeks Easy Be an expert with your data, not your data infrastructure Cost-effective Pay for exactly what you use to process your data, not more
  • 9.
    Cloud Dataproc on Kubernetes(BETA) Combining the best of open source and cloud.
  • 10.
    Jan ‘19 -Kubernetes Operator for Apache Spark Open Sourced Sept ‘19 - Kubernetes Operator for Apache Flink Open Sourced
  • 11.
    What does this meanfor data scientists and data engineers? Moving to a container-first world Cross-cloud and hybrid cloud support Better OSS component isolation Support for vendor-supported components on Cloud Dataproc 1 2 3 Faster development of cluster / task management vs. Apache YARN 4
  • 12.
  • 13.
    ▪ On-demand Capacity ▪Keep existing investments.When you need compute capacity, expand cloud footprint. ▪ Leverage cloud flexibility for bursty workloads ▪ Reduce overload on existing infrastructure by moving ephemeral or workloads with unpredictable resource utilization ▪ Intermediate step before migrating to the cloud ▪ Lower risk of a full cloud data migration and start with compute in the cloud and data on-prem. Full migration can be slow! Why Hybrid Cloud?
  • 14.
  • 15.
    Migration Challenges Data Orchestration ●Global Catalog simplifies data discovery ● Data On-demand Security & Governance ● Authentication - Kerberos, Delegation token, LDAP, AD ● Authorization - File System security model, Apache Ranger integration ● Encryption - TLS, encryption at rest ● Audit - access logs
  • 16.
    A Real WorldExample
  • 17.
  • 18.
    Get Started withAlluxio on Dataproc Tutorial: Getting started with Dataproc and Alluxio https://www.alluxio.io/products/google-cloud/gcp-dataproc-tutorial/ For this Demo: $ gcloud dataproc clusters create roderickyao-alluxio --initialization-actions gs://alluxio-public/enterprise-dataproc/2.1.0-1.0/alluxio-dat alluxio_root_ufs_uri=gs://ryao-test/alluxio-test/,alluxio_site_properties="alluxio.master.mount.table.root.option.fs.gcs.accessKeyId=<KEYID> on.fs.gcs.secretAccessKey=<SECRET>",alluxio_license_base64=$(cat alluxio-enterprise-license.json | base64 | tr -d "n"),alluxio_download_path=gs://ryao-test/alluxio-enterprise-2.1.0-1.0.tar.gz
  • 19.
    Resources ▪ Burst datalake processing to Dataproc using on-prem Hadoop data https://cloud.google.com/blog/products/data-analytics/burst-data-lake-processing-data proc-using-prem-hadoop-data ▪ Tutorial: Hybrid Cloud Bursting with GCP and Alluxio https://docs.alluxio.io/ee/user/stable/en/tutorials/GCP-Tutorial.html