Accelerating workloads and
bursting data with Google
Dataproc & Alluxio
Dipti Borkar | VP, Product | Alluxio
Roderick Yao | Strategic Cloud Engineer | Google
▪ What’s Google Dataproc?
▪ What’s Alluxio?
▪ Alluxio in Dataproc
▪ Demo
Placeholder: GCP
Overview
AndroidChrome
Next-Gen Devices
G Suite Maps Compute
Engine
App
Engine
Cloud ML
Engine
SaaS PaaS IaaS
Drive Container
Engine
BigQuery
Building what’s next 6
Everything You Need To Build And Scale
Compute
From virtual machines with proven
price/performance advantages to
a fully managed app development
platform.
Compute Engine
App Engine
Container Engine
Container Registry
Cloud Functions
Storage and Databases
Scalable, resilient, high
performance object storage and
databases for your applications.
Cloud Storage
Cloud Bigtable
Cloud Datastore
Cloud SQL
Cloud Spanner
Networking
State-of-the-art software-defined
networking products on Google’s
private fiber network.
Cloud Virtual Network
Cloud Load Balancing
Cloud CDN
Cloud Interconnect
Cloud DNS
Management Tools
Monitoring, logging, and diagnostics
and more, all a easy to use web
management console or mobile app.
Stackdriver Overview
Monitoring
Logging
Error Reporting
Debugger
Deployment Manager & More
Big Data
Fully managed data warehousing,
batch and stream processing, data
exploration, Hadoop/Spark, and
reliable messaging.
BigQuery
Cloud Dataflow
Cloud Dataproc
Cloud Dataprep
Cloud Datalab
Cloud Pub/Sub
Genomics
Machine Learning
Fast, scalable, easy to use ML
services. Use our pre-trained models
or train custom models on your data.
Cloud Machine Learning Platform
Vision API
Video Intelligence API
Speech API
Translate API
NLP API
Developer Tools
Develop and deploy your applications
using our command-line interface and
other developer tools.
Cloud SDK
Deployment Manager
Cloud Source Repositories
Cloud Endpoints
Cloud Tools for Android Studio
Cloud Tools for IntelliJ
Google Plugin for Eclipse
Cloud Test Lab
Cloud Container Builder
Identity & Security
Control access and visibility to
resources running on a platform
protected by Google’s security model.
Cloud IAM
Cloud IAP
Cloud KMS
Cloud Resource Manager
Cloud Security Scanner
Cloud Platform Security Overview
2016
7
Google
Research
20082002 2004 2006 2010 2012 2014 2015
Open
Source
2005
Google
Cloud
Products BigQuery Pub/Sub Dataflow Bigtable ML
GFS
Map
Reduce
BigTable Dremel
Flume
Java
Millwheel Tensorflow
Google has 20+ years experience solving Data Problems
Apache Beam
PubSub
Dataproc
Benefits of migrating Hadoop to GCP
Pay for use; Scale, anytime
Managed hardware and software
Run existing jobs with minimal changes
Flexible job configuration
Make HDFS data widely available
1
2
3
4
5
5
External validation of cost-effectiveness
“Analyzing the economic
benefits of Google Cloud
Dataproc Cloud-native
Hadoop and Spark Platform”
57%
Less expensive
than
on-premises
32%
Less expensive
Than EMR
“...customers also reported substantial benefits
in the strategic value they were able to pull out
of the data hosted in the Google Cloud.”
Why Enterprises Migrate to GCP
To reduce
infrastructure costs,
improve reliability and
scale smoothly
$
To gain more value
from data and
predict business
outcomes
To more rapidly
build new apps
and experiences
To connect to
business platforms
of services and
partners
To make teams
productive with
secure mobile /
devices
Combining the best of
open source and cloud.
Cloud Dataproc
What is Cloud Dataproc?
Rapid cluster creation
Familiar open source tools
Google Cloud Platform’s fully-
managed Apache Spark and
Apache Hadoop service
Ephemeral clusters on-demand
Customizable machines
Tightly Integrated
with other Google Cloud
Platform services
Ephemeral
Dataproc Clusters
Submit
Job 1
Deploy
Cluster 1
Submit
Job 3
Deploy
Cluster 3
Submit
Job 2
Deploy
Cluster 2
Long Standing
Dataproc Clusters
01
02
03
04
Deploy
Small
Cluster
Users
submit jobs
as themselves
Cluster scales
up to meet
demand within
predetermined
budget
Cluster scales
down as
demand
recedes
Cloud
Dataproc
Cluster
Cloud
Dataproc
Cluster
Cloud
Dataproc
Cluster
Cloud
Dataproc
Cluster
HDFS to GCS
Use Google Cloud Storage
as your primary data
source and sink
Decouple storage and compute
Un-silo your data
1
2
Cloud Storage provides strong global consistency for the
following operations, including both data and metadata:
● Read-after-write
● Read-after-metadata-update
● Read-after-delete
● Bucket listing
● Object listing
● Granting access to resources
Renames are metadata operations
Use Cloud Storage as a Destination
Cloud Storage
Take Advantage of Cloud Storage Classes for Hadoop
● Can be Multi-Regional or
Regional.
● Use for interactive Hive/Spark
analysis or Batch jobs that
occur more than once a month
REGIONAL STORAGE
Universal cloud storage
for any workload.
Cloud storage for use cases that
don't require high availability.
NEARLINE STORAGE
● Use for batch jobs that only
need the data in historical
reporting/aggregations.
(at most once a month)
Cloud storage for long term, less
frequently accessed content.
COLDLINE STORAGE
● Use for post-processed data
● No expectation to use again
(no more than once per year)
Use Cloud Storage to make HDFS widely available
Cloud Storage
GCS Connector
External tables
Apache Beam input/output
Google
BigQuery
Cloud
Dataproc
Hortonworks
Data Platform
Compute Engine
Cloud
DataFlow
Cloudera Data
Platform
GCS Connector
Using Alluxio to Burst Workloads
On-Premises Datacenter
Region - us-central1
Zone - us-central1-f
Alluxio
Cloud Dataproc
Multiple Instances
hdfs://
Cloud Storage
Regional
On-Premise Cluster
HDFS DataNode
YARN NodeManager
Worker Nodes
Migrate Data
gs://
Batch Jobs
Cloud Dataproc
Store in GCS
and cache in
Alluxio
Calls data from on-prem HDFS or
Cloud Storage
Introduction to Alluxio
Open source data orchestration
The Alluxio Story
Originated as Tachyon project, at the UC Berkeley’s AMP
Lab by then Ph.D. student & now Alluxio CTO, Haoyuan
(H.Y.) Li.
2014
2015
Open Source project established & company to
commercialize Alluxio founded
Goal: Orchestrate Data for the Cloud for data driven
apps such as Big Data Analytics, ML and AI.
Focus: Accelerating modern app frameworks running on
HDFS/S3/ GCS -based data lakes or warehouses
Data Orchestration for the Cloud
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift Driver S3 Driver NFS Driver
Lines of Business
Data Orchestration for the Cloud
Alluxio
MasterZookeeper
/ RAFT
Standby
Master
WA
N
Alluxio
Client
Alluxio
Client
Alluxio
Worker
RAM / SSD / HDD
Alluxio
Worker
RAM / SSD / HDD
Alluxio Reference Architecture
…
…
Applicatio
n
Applicatio
n
Under Store 1
Under Store
2
Compute
Storage
2–5 Mins
2–5 Mins
Elastic
✓
Elastic
✓
Enterprise Cloud Compute & Storage is Great…
but Data got left behind
2–4 Weeks
Request
Data
Request Review Find
Dataset
Code
Script/Job
Run
ETL jobs
Grant
Permissions
Not Elastic
!
Dataset
Public Cloud IaaS
Spark Presto Hive TensorFlow
Alluxio Data Orchestration and Control Service
Alluxio enables compute!
Alluxio Data Orchestration and Control Service
Solution: Consistent High Performance
• Performance increases range from 1.5X
to 10X
• AWS EMR & Google Dataproc
integrations
• Fewer copies of data means lower costs
Problem: Object Stores have
inconsistent performance for analytics
and AI workloads
▪ SLAs are hard to achieve
▪ Metadata operations are expensive
▪ Copied data storage costs add up making
the solution expensive
Accelerating Analytics in the cloud
PRESTO
OBJECT STORE
Public Cloud
Project:
• Utilize Presto for interactive queries
on cloud object store compute
Problem:
• Low performance of queries too slow
to be usable
• Inconsistent performance of queries
Walmart | High Performance Cloud analytics
Alluxio solution:
• Alluxio provides intelligent distributed
caching layer for object storage
Result:
• High performance queries
• Consistent performance
• Interactive query performance for
analysts
PRESTO
OBJECT STORE
Public Cloud
ALLUXIO
28
Presto & Alluxio on
Works well together…
Small range query response time
Lower is better
Large scan query response time
Lower is better
Concurrency
Higher is better
Prest
o
Presto +
Alluxio
• Query performance bottlenecks
• Un-predictable network IO
• Query pattern - Datasets modelled in star
schema could benefit by dimension table
caching
• Presto + Alluxio
• Avoids unpredictable network
• Consistent query latency
• Higher throughput and better concurrency
Alluxio in Dataproc
Google Dataproc
Presto Hive Presto Hive
Google
Dataproc
Cluster
Google Cloud Store Google Cloud Store
Using Alluxio with Google Dataproc
Presto Hive
Metadata &
Data cache
Presto Hive
Metadata &
Data cache
Compute-driven
Continuous sync
Compute-driven
Continuous sync
Google
Dataproc
Cluster
Google Cloud Store Google Cloud Store
Single command initialization action brings up Alluxio in dataproc
Alluxio Initialization Action - https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/alluxio
What about remote data?
Bursting workloads to the cloud with remote data
Typical Restrictions
▪ Data cannot be persisted in a public cloud
▪ Additional I/O capacity cannot be added to existing Hadoop infrastructure
▪ On-prem level security needs to be maintained
▪ Network bandwidth utilization needs to be minimal
Options
Lift and Shift
Data copy by
workload
“Zero-copy” Bursting
Problem: HDFS cluster is compute-
bound & complex to maintain
AWS Public Cloud IaaS
Spark Presto Hive TensorFlow
Alluxio Data Orchestration and Control Service
On Premises
Connectivity
Datacenter
Spark Presto Hive
Tensor
Flow
Alluxio Data Orchestration and Control Service
Barrier 1: Prohibitive network latency
and bandwidth limits
• Makes hybrid analytics unfeasible
Barrier 2: Copying data to cloud
• Difficult to maintain copies
• Data security and governance
• Costs of another silo
Step 1: Hybrid Cloud for Burst Compute Capacity
• Orchestrates compute access to on-prem data
• Working set of data, not FULL set of data
• Local performance
• Scales elastically
• On-Prem Cluster Offload (both Compute & I/O)
Step 2: Online Migration of Data Per Policy
• Flexible timing to migrate, with less dependencies
• Instead of hard switch over, migrate at own pace
• Moves the data per policy – e.g. last 7 days
“Zero-copy” bursting to scale to the cloud
Spark Presto Hive TensorFlow
RAM
Framework
Read file /trades/us
Trades Directory Customers Directory
Data requests
”Zero-copy” bursting under the hood
Read file /trades/us again Read file /trades/top
Read file /trades/top
Variable latency
with throttling
Read file /trades/us again Read file /trades/top
Read file /trades/top
Read file /trades/us again Read file /trades/top
Read file /trades/top
Read file /trades/us again Read file /trades/top
Read file /trades/top
Read file /trades/us again
Spark Presto Hive TensorFlow
RAM
SSD
Disk
Framework
New Trades
Policy Defined Move data > 90 days old to
Feature Highlight – Policy-driven Data Management
GCS
Policy interval : Every day
Policy applied everyday
DEMO
Demo: initialization action installs Alluxio in Dataproc
Presto Hive
Metadata &
Data cache
Presto Hive
Metadata &
Data cache
Compute-driven
Continuous sync
Compute-driven
Continuous sync
Google
Dataproc
Cluster
#1 - Access data in Google Cloud Store
Demo: initialization action installs Alluxio in Dataproc
Presto Hive
Metadata &
Data cache
Presto Hive
Metadata &
Data cache
Compute-driven
Continuous sync
Compute-driven
Continuous sync
Google
Dataproc
Cluster
#2 - Access data from remote Hadoop cluster (simulated as Dataproc)
Get Started with Alluxio on Dataproc
Single command created Dataproc cluster with Alluxio installed
$ gcloud dataproc clusters create roderickyao-alluxio --initialization-actions
gs://alluxio-public/enterprise-dataproc/2.1.0-1.0/alluxio-dataproc.sh
--metadata alluxio_root_ufs_uri=gs://ryao-test/alluxio-test/,
alluxio_site_properties="alluxio.master.mount.table.root.option.fs.gcs.accessKeyId=<KEYID>;
alluxio.master.mount.table.root.option.fs.gcs.secretAccessKey=<SECRET>",
alluxio_license_base64=$(cat alluxio-enterprise-license.json | base64 | tr -d
"n"),alluxio_download_path=gs://ryao-test/alluxio-enterprise-2.1.0-1.0.tar.gz
Tutorial: Getting started with Dataproc and Alluxio
https://www.alluxio.io/products/google-cloud/gcp-dataproc-tutorial/
Resources
Alluxio Initialization Action
- https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/alluxio
Alluxio with Google Cloud Storage documentation
- https://docs.alluxio.io/ee/user/stable/en/ufs/GCS.html
Questions?
alluxio.io

Integrating Google Cloud Dataproc with Alluxio for faster performance in the cloud

  • 1.
    Accelerating workloads and burstingdata with Google Dataproc & Alluxio Dipti Borkar | VP, Product | Alluxio Roderick Yao | Strategic Cloud Engineer | Google
  • 2.
    ▪ What’s GoogleDataproc? ▪ What’s Alluxio? ▪ Alluxio in Dataproc ▪ Demo
  • 3.
  • 5.
    AndroidChrome Next-Gen Devices G SuiteMaps Compute Engine App Engine Cloud ML Engine SaaS PaaS IaaS Drive Container Engine BigQuery
  • 6.
    Building what’s next6 Everything You Need To Build And Scale Compute From virtual machines with proven price/performance advantages to a fully managed app development platform. Compute Engine App Engine Container Engine Container Registry Cloud Functions Storage and Databases Scalable, resilient, high performance object storage and databases for your applications. Cloud Storage Cloud Bigtable Cloud Datastore Cloud SQL Cloud Spanner Networking State-of-the-art software-defined networking products on Google’s private fiber network. Cloud Virtual Network Cloud Load Balancing Cloud CDN Cloud Interconnect Cloud DNS Management Tools Monitoring, logging, and diagnostics and more, all a easy to use web management console or mobile app. Stackdriver Overview Monitoring Logging Error Reporting Debugger Deployment Manager & More Big Data Fully managed data warehousing, batch and stream processing, data exploration, Hadoop/Spark, and reliable messaging. BigQuery Cloud Dataflow Cloud Dataproc Cloud Dataprep Cloud Datalab Cloud Pub/Sub Genomics Machine Learning Fast, scalable, easy to use ML services. Use our pre-trained models or train custom models on your data. Cloud Machine Learning Platform Vision API Video Intelligence API Speech API Translate API NLP API Developer Tools Develop and deploy your applications using our command-line interface and other developer tools. Cloud SDK Deployment Manager Cloud Source Repositories Cloud Endpoints Cloud Tools for Android Studio Cloud Tools for IntelliJ Google Plugin for Eclipse Cloud Test Lab Cloud Container Builder Identity & Security Control access and visibility to resources running on a platform protected by Google’s security model. Cloud IAM Cloud IAP Cloud KMS Cloud Resource Manager Cloud Security Scanner Cloud Platform Security Overview
  • 7.
    2016 7 Google Research 20082002 2004 20062010 2012 2014 2015 Open Source 2005 Google Cloud Products BigQuery Pub/Sub Dataflow Bigtable ML GFS Map Reduce BigTable Dremel Flume Java Millwheel Tensorflow Google has 20+ years experience solving Data Problems Apache Beam PubSub Dataproc
  • 8.
    Benefits of migratingHadoop to GCP Pay for use; Scale, anytime Managed hardware and software Run existing jobs with minimal changes Flexible job configuration Make HDFS data widely available 1 2 3 4 5 5
  • 9.
    External validation ofcost-effectiveness “Analyzing the economic benefits of Google Cloud Dataproc Cloud-native Hadoop and Spark Platform” 57% Less expensive than on-premises 32% Less expensive Than EMR “...customers also reported substantial benefits in the strategic value they were able to pull out of the data hosted in the Google Cloud.”
  • 10.
    Why Enterprises Migrateto GCP To reduce infrastructure costs, improve reliability and scale smoothly $ To gain more value from data and predict business outcomes To more rapidly build new apps and experiences To connect to business platforms of services and partners To make teams productive with secure mobile / devices
  • 11.
    Combining the bestof open source and cloud. Cloud Dataproc
  • 12.
    What is CloudDataproc? Rapid cluster creation Familiar open source tools Google Cloud Platform’s fully- managed Apache Spark and Apache Hadoop service Ephemeral clusters on-demand Customizable machines Tightly Integrated with other Google Cloud Platform services
  • 13.
    Ephemeral Dataproc Clusters Submit Job 1 Deploy Cluster1 Submit Job 3 Deploy Cluster 3 Submit Job 2 Deploy Cluster 2
  • 14.
    Long Standing Dataproc Clusters 01 02 03 04 Deploy Small Cluster Users submitjobs as themselves Cluster scales up to meet demand within predetermined budget Cluster scales down as demand recedes Cloud Dataproc Cluster Cloud Dataproc Cluster Cloud Dataproc Cluster Cloud Dataproc Cluster
  • 15.
    HDFS to GCS UseGoogle Cloud Storage as your primary data source and sink Decouple storage and compute Un-silo your data 1 2
  • 16.
    Cloud Storage providesstrong global consistency for the following operations, including both data and metadata: ● Read-after-write ● Read-after-metadata-update ● Read-after-delete ● Bucket listing ● Object listing ● Granting access to resources Renames are metadata operations Use Cloud Storage as a Destination Cloud Storage
  • 17.
    Take Advantage ofCloud Storage Classes for Hadoop ● Can be Multi-Regional or Regional. ● Use for interactive Hive/Spark analysis or Batch jobs that occur more than once a month REGIONAL STORAGE Universal cloud storage for any workload. Cloud storage for use cases that don't require high availability. NEARLINE STORAGE ● Use for batch jobs that only need the data in historical reporting/aggregations. (at most once a month) Cloud storage for long term, less frequently accessed content. COLDLINE STORAGE ● Use for post-processed data ● No expectation to use again (no more than once per year)
  • 18.
    Use Cloud Storageto make HDFS widely available Cloud Storage GCS Connector External tables Apache Beam input/output Google BigQuery Cloud Dataproc Hortonworks Data Platform Compute Engine Cloud DataFlow Cloudera Data Platform GCS Connector
  • 19.
    Using Alluxio toBurst Workloads On-Premises Datacenter Region - us-central1 Zone - us-central1-f Alluxio Cloud Dataproc Multiple Instances hdfs:// Cloud Storage Regional On-Premise Cluster HDFS DataNode YARN NodeManager Worker Nodes Migrate Data gs:// Batch Jobs Cloud Dataproc Store in GCS and cache in Alluxio Calls data from on-prem HDFS or Cloud Storage
  • 20.
    Introduction to Alluxio Opensource data orchestration
  • 21.
    The Alluxio Story Originatedas Tachyon project, at the UC Berkeley’s AMP Lab by then Ph.D. student & now Alluxio CTO, Haoyuan (H.Y.) Li. 2014 2015 Open Source project established & company to commercialize Alluxio founded Goal: Orchestrate Data for the Cloud for data driven apps such as Big Data Analytics, ML and AI. Focus: Accelerating modern app frameworks running on HDFS/S3/ GCS -based data lakes or warehouses
  • 22.
    Data Orchestration forthe Cloud Java File API HDFS Interface S3 Interface REST APIPOSIX Interface HDFS Driver Swift Driver S3 Driver NFS Driver Lines of Business
  • 23.
  • 24.
    Alluxio MasterZookeeper / RAFT Standby Master WA N Alluxio Client Alluxio Client Alluxio Worker RAM /SSD / HDD Alluxio Worker RAM / SSD / HDD Alluxio Reference Architecture … … Applicatio n Applicatio n Under Store 1 Under Store 2
  • 25.
    Compute Storage 2–5 Mins 2–5 Mins Elastic ✓ Elastic ✓ EnterpriseCloud Compute & Storage is Great… but Data got left behind 2–4 Weeks Request Data Request Review Find Dataset Code Script/Job Run ETL jobs Grant Permissions Not Elastic ! Dataset
  • 26.
    Public Cloud IaaS SparkPresto Hive TensorFlow Alluxio Data Orchestration and Control Service Alluxio enables compute! Alluxio Data Orchestration and Control Service Solution: Consistent High Performance • Performance increases range from 1.5X to 10X • AWS EMR & Google Dataproc integrations • Fewer copies of data means lower costs Problem: Object Stores have inconsistent performance for analytics and AI workloads ▪ SLAs are hard to achieve ▪ Metadata operations are expensive ▪ Copied data storage costs add up making the solution expensive Accelerating Analytics in the cloud
  • 27.
    PRESTO OBJECT STORE Public Cloud Project: •Utilize Presto for interactive queries on cloud object store compute Problem: • Low performance of queries too slow to be usable • Inconsistent performance of queries Walmart | High Performance Cloud analytics Alluxio solution: • Alluxio provides intelligent distributed caching layer for object storage Result: • High performance queries • Consistent performance • Interactive query performance for analysts PRESTO OBJECT STORE Public Cloud ALLUXIO
  • 28.
    28 Presto & Alluxioon Works well together… Small range query response time Lower is better Large scan query response time Lower is better Concurrency Higher is better Prest o Presto + Alluxio • Query performance bottlenecks • Un-predictable network IO • Query pattern - Datasets modelled in star schema could benefit by dimension table caching • Presto + Alluxio • Avoids unpredictable network • Consistent query latency • Higher throughput and better concurrency
  • 29.
  • 30.
    Google Dataproc Presto HivePresto Hive Google Dataproc Cluster Google Cloud Store Google Cloud Store
  • 31.
    Using Alluxio withGoogle Dataproc Presto Hive Metadata & Data cache Presto Hive Metadata & Data cache Compute-driven Continuous sync Compute-driven Continuous sync Google Dataproc Cluster Google Cloud Store Google Cloud Store Single command initialization action brings up Alluxio in dataproc Alluxio Initialization Action - https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/alluxio
  • 32.
  • 33.
    Bursting workloads tothe cloud with remote data Typical Restrictions ▪ Data cannot be persisted in a public cloud ▪ Additional I/O capacity cannot be added to existing Hadoop infrastructure ▪ On-prem level security needs to be maintained ▪ Network bandwidth utilization needs to be minimal Options Lift and Shift Data copy by workload “Zero-copy” Bursting
  • 34.
    Problem: HDFS clusteris compute- bound & complex to maintain AWS Public Cloud IaaS Spark Presto Hive TensorFlow Alluxio Data Orchestration and Control Service On Premises Connectivity Datacenter Spark Presto Hive Tensor Flow Alluxio Data Orchestration and Control Service Barrier 1: Prohibitive network latency and bandwidth limits • Makes hybrid analytics unfeasible Barrier 2: Copying data to cloud • Difficult to maintain copies • Data security and governance • Costs of another silo Step 1: Hybrid Cloud for Burst Compute Capacity • Orchestrates compute access to on-prem data • Working set of data, not FULL set of data • Local performance • Scales elastically • On-Prem Cluster Offload (both Compute & I/O) Step 2: Online Migration of Data Per Policy • Flexible timing to migrate, with less dependencies • Instead of hard switch over, migrate at own pace • Moves the data per policy – e.g. last 7 days “Zero-copy” bursting to scale to the cloud
  • 35.
    Spark Presto HiveTensorFlow RAM Framework Read file /trades/us Trades Directory Customers Directory Data requests ”Zero-copy” bursting under the hood Read file /trades/us again Read file /trades/top Read file /trades/top Variable latency with throttling Read file /trades/us again Read file /trades/top Read file /trades/top Read file /trades/us again Read file /trades/top Read file /trades/top Read file /trades/us again Read file /trades/top Read file /trades/top Read file /trades/us again
  • 36.
    Spark Presto HiveTensorFlow RAM SSD Disk Framework New Trades Policy Defined Move data > 90 days old to Feature Highlight – Policy-driven Data Management GCS Policy interval : Every day Policy applied everyday
  • 37.
  • 38.
    Demo: initialization actioninstalls Alluxio in Dataproc Presto Hive Metadata & Data cache Presto Hive Metadata & Data cache Compute-driven Continuous sync Compute-driven Continuous sync Google Dataproc Cluster #1 - Access data in Google Cloud Store
  • 39.
    Demo: initialization actioninstalls Alluxio in Dataproc Presto Hive Metadata & Data cache Presto Hive Metadata & Data cache Compute-driven Continuous sync Compute-driven Continuous sync Google Dataproc Cluster #2 - Access data from remote Hadoop cluster (simulated as Dataproc)
  • 40.
    Get Started withAlluxio on Dataproc Single command created Dataproc cluster with Alluxio installed $ gcloud dataproc clusters create roderickyao-alluxio --initialization-actions gs://alluxio-public/enterprise-dataproc/2.1.0-1.0/alluxio-dataproc.sh --metadata alluxio_root_ufs_uri=gs://ryao-test/alluxio-test/, alluxio_site_properties="alluxio.master.mount.table.root.option.fs.gcs.accessKeyId=<KEYID>; alluxio.master.mount.table.root.option.fs.gcs.secretAccessKey=<SECRET>", alluxio_license_base64=$(cat alluxio-enterprise-license.json | base64 | tr -d "n"),alluxio_download_path=gs://ryao-test/alluxio-enterprise-2.1.0-1.0.tar.gz Tutorial: Getting started with Dataproc and Alluxio https://www.alluxio.io/products/google-cloud/gcp-dataproc-tutorial/
  • 41.
    Resources Alluxio Initialization Action -https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/alluxio Alluxio with Google Cloud Storage documentation - https://docs.alluxio.io/ee/user/stable/en/ufs/GCS.html
  • 42.