Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Integrating Google Cloud Dataproc with Alluxio for faster performance in the cloud


Published on

Alluxio Tech Talk
Dec 10, 2019

Chris Crosbie and Roderick Yao from the Google Dataproc team and Dipti Borkar of Alluxio will demo how to set up Google Cloud Dataproc with Alluxio so jobs can seamlessly read from and write to Cloud Storage. They’ll also show how to run Dataproc Spark against a remote HDFS cluster.

For more Alluxio events:

Published in: Software
  • Be the first to comment

Integrating Google Cloud Dataproc with Alluxio for faster performance in the cloud

  1. 1. Accelerating workloads and bursting data with Google Dataproc & Alluxio Dipti Borkar | VP, Product | Alluxio Roderick Yao | Strategic Cloud Engineer | Google
  2. 2. ▪ What’s Google Dataproc? ▪ What’s Alluxio? ▪ Alluxio in Dataproc ▪ Demo
  3. 3. Placeholder: GCP Overview
  4. 4. AndroidChrome Next-Gen Devices G Suite Maps Compute Engine App Engine Cloud ML Engine SaaS PaaS IaaS Drive Container Engine BigQuery
  5. 5. Building what’s next 6 Everything You Need To Build And Scale Compute From virtual machines with proven price/performance advantages to a fully managed app development platform. Compute Engine App Engine Container Engine Container Registry Cloud Functions Storage and Databases Scalable, resilient, high performance object storage and databases for your applications. Cloud Storage Cloud Bigtable Cloud Datastore Cloud SQL Cloud Spanner Networking State-of-the-art software-defined networking products on Google’s private fiber network. Cloud Virtual Network Cloud Load Balancing Cloud CDN Cloud Interconnect Cloud DNS Management Tools Monitoring, logging, and diagnostics and more, all a easy to use web management console or mobile app. Stackdriver Overview Monitoring Logging Error Reporting Debugger Deployment Manager & More Big Data Fully managed data warehousing, batch and stream processing, data exploration, Hadoop/Spark, and reliable messaging. BigQuery Cloud Dataflow Cloud Dataproc Cloud Dataprep Cloud Datalab Cloud Pub/Sub Genomics Machine Learning Fast, scalable, easy to use ML services. Use our pre-trained models or train custom models on your data. Cloud Machine Learning Platform Vision API Video Intelligence API Speech API Translate API NLP API Developer Tools Develop and deploy your applications using our command-line interface and other developer tools. Cloud SDK Deployment Manager Cloud Source Repositories Cloud Endpoints Cloud Tools for Android Studio Cloud Tools for IntelliJ Google Plugin for Eclipse Cloud Test Lab Cloud Container Builder Identity & Security Control access and visibility to resources running on a platform protected by Google’s security model. Cloud IAM Cloud IAP Cloud KMS Cloud Resource Manager Cloud Security Scanner Cloud Platform Security Overview
  6. 6. 2016 7 Google Research 20082002 2004 2006 2010 2012 2014 2015 Open Source 2005 Google Cloud Products BigQuery Pub/Sub Dataflow Bigtable ML GFS Map Reduce BigTable Dremel Flume Java Millwheel Tensorflow Google has 20+ years experience solving Data Problems Apache Beam PubSub Dataproc
  7. 7. Benefits of migrating Hadoop to GCP Pay for use; Scale, anytime Managed hardware and software Run existing jobs with minimal changes Flexible job configuration Make HDFS data widely available 1 2 3 4 5 5
  8. 8. External validation of cost-effectiveness “Analyzing the economic benefits of Google Cloud Dataproc Cloud-native Hadoop and Spark Platform” 57% Less expensive than on-premises 32% Less expensive Than EMR “...customers also reported substantial benefits in the strategic value they were able to pull out of the data hosted in the Google Cloud.”
  9. 9. Why Enterprises Migrate to GCP To reduce infrastructure costs, improve reliability and scale smoothly $ To gain more value from data and predict business outcomes To more rapidly build new apps and experiences To connect to business platforms of services and partners To make teams productive with secure mobile / devices
  10. 10. Combining the best of open source and cloud. Cloud Dataproc
  11. 11. What is Cloud Dataproc? Rapid cluster creation Familiar open source tools Google Cloud Platform’s fully- managed Apache Spark and Apache Hadoop service Ephemeral clusters on-demand Customizable machines Tightly Integrated with other Google Cloud Platform services
  12. 12. Ephemeral Dataproc Clusters Submit Job 1 Deploy Cluster 1 Submit Job 3 Deploy Cluster 3 Submit Job 2 Deploy Cluster 2
  13. 13. Long Standing Dataproc Clusters 01 02 03 04 Deploy Small Cluster Users submit jobs as themselves Cluster scales up to meet demand within predetermined budget Cluster scales down as demand recedes Cloud Dataproc Cluster Cloud Dataproc Cluster Cloud Dataproc Cluster Cloud Dataproc Cluster
  14. 14. HDFS to GCS Use Google Cloud Storage as your primary data source and sink Decouple storage and compute Un-silo your data 1 2
  15. 15. Cloud Storage provides strong global consistency for the following operations, including both data and metadata: ● Read-after-write ● Read-after-metadata-update ● Read-after-delete ● Bucket listing ● Object listing ● Granting access to resources Renames are metadata operations Use Cloud Storage as a Destination Cloud Storage
  16. 16. Take Advantage of Cloud Storage Classes for Hadoop ● Can be Multi-Regional or Regional. ● Use for interactive Hive/Spark analysis or Batch jobs that occur more than once a month REGIONAL STORAGE Universal cloud storage for any workload. Cloud storage for use cases that don't require high availability. NEARLINE STORAGE ● Use for batch jobs that only need the data in historical reporting/aggregations. (at most once a month) Cloud storage for long term, less frequently accessed content. COLDLINE STORAGE ● Use for post-processed data ● No expectation to use again (no more than once per year)
  17. 17. Use Cloud Storage to make HDFS widely available Cloud Storage GCS Connector External tables Apache Beam input/output Google BigQuery Cloud Dataproc Hortonworks Data Platform Compute Engine Cloud DataFlow Cloudera Data Platform GCS Connector
  18. 18. Using Alluxio to Burst Workloads On-Premises Datacenter Region - us-central1 Zone - us-central1-f Alluxio Cloud Dataproc Multiple Instances hdfs:// Cloud Storage Regional On-Premise Cluster HDFS DataNode YARN NodeManager Worker Nodes Migrate Data gs:// Batch Jobs Cloud Dataproc Store in GCS and cache in Alluxio Calls data from on-prem HDFS or Cloud Storage
  19. 19. Introduction to Alluxio Open source data orchestration
  20. 20. The Alluxio Story Originated as Tachyon project, at the UC Berkeley’s AMP Lab by then Ph.D. student & now Alluxio CTO, Haoyuan (H.Y.) Li. 2014 2015 Open Source project established & company to commercialize Alluxio founded Goal: Orchestrate Data for the Cloud for data driven apps such as Big Data Analytics, ML and AI. Focus: Accelerating modern app frameworks running on HDFS/S3/ GCS -based data lakes or warehouses
  21. 21. Data Orchestration for the Cloud Java File API HDFS Interface S3 Interface REST APIPOSIX Interface HDFS Driver Swift Driver S3 Driver NFS Driver Lines of Business
  22. 22. Data Orchestration for the Cloud
  23. 23. Alluxio MasterZookeeper / RAFT Standby Master WA N Alluxio Client Alluxio Client Alluxio Worker RAM / SSD / HDD Alluxio Worker RAM / SSD / HDD Alluxio Reference Architecture … … Applicatio n Applicatio n Under Store 1 Under Store 2
  24. 24. Compute Storage 2–5 Mins 2–5 Mins Elastic ✓ Elastic ✓ Enterprise Cloud Compute & Storage is Great… but Data got left behind 2–4 Weeks Request Data Request Review Find Dataset Code Script/Job Run ETL jobs Grant Permissions Not Elastic ! Dataset
  25. 25. Public Cloud IaaS Spark Presto Hive TensorFlow Alluxio Data Orchestration and Control Service Alluxio enables compute! Alluxio Data Orchestration and Control Service Solution: Consistent High Performance • Performance increases range from 1.5X to 10X • AWS EMR & Google Dataproc integrations • Fewer copies of data means lower costs Problem: Object Stores have inconsistent performance for analytics and AI workloads ▪ SLAs are hard to achieve ▪ Metadata operations are expensive ▪ Copied data storage costs add up making the solution expensive Accelerating Analytics in the cloud
  26. 26. PRESTO OBJECT STORE Public Cloud Project: • Utilize Presto for interactive queries on cloud object store compute Problem: • Low performance of queries too slow to be usable • Inconsistent performance of queries Walmart | High Performance Cloud analytics Alluxio solution: • Alluxio provides intelligent distributed caching layer for object storage Result: • High performance queries • Consistent performance • Interactive query performance for analysts PRESTO OBJECT STORE Public Cloud ALLUXIO
  27. 27. 28 Presto & Alluxio on Works well together… Small range query response time Lower is better Large scan query response time Lower is better Concurrency Higher is better Prest o Presto + Alluxio • Query performance bottlenecks • Un-predictable network IO • Query pattern - Datasets modelled in star schema could benefit by dimension table caching • Presto + Alluxio • Avoids unpredictable network • Consistent query latency • Higher throughput and better concurrency
  28. 28. Alluxio in Dataproc
  29. 29. Google Dataproc Presto Hive Presto Hive Google Dataproc Cluster Google Cloud Store Google Cloud Store
  30. 30. Using Alluxio with Google Dataproc Presto Hive Metadata & Data cache Presto Hive Metadata & Data cache Compute-driven Continuous sync Compute-driven Continuous sync Google Dataproc Cluster Google Cloud Store Google Cloud Store Single command initialization action brings up Alluxio in dataproc Alluxio Initialization Action -
  31. 31. What about remote data?
  32. 32. Bursting workloads to the cloud with remote data Typical Restrictions ▪ Data cannot be persisted in a public cloud ▪ Additional I/O capacity cannot be added to existing Hadoop infrastructure ▪ On-prem level security needs to be maintained ▪ Network bandwidth utilization needs to be minimal Options Lift and Shift Data copy by workload “Zero-copy” Bursting
  33. 33. Problem: HDFS cluster is compute- bound & complex to maintain AWS Public Cloud IaaS Spark Presto Hive TensorFlow Alluxio Data Orchestration and Control Service On Premises Connectivity Datacenter Spark Presto Hive Tensor Flow Alluxio Data Orchestration and Control Service Barrier 1: Prohibitive network latency and bandwidth limits • Makes hybrid analytics unfeasible Barrier 2: Copying data to cloud • Difficult to maintain copies • Data security and governance • Costs of another silo Step 1: Hybrid Cloud for Burst Compute Capacity • Orchestrates compute access to on-prem data • Working set of data, not FULL set of data • Local performance • Scales elastically • On-Prem Cluster Offload (both Compute & I/O) Step 2: Online Migration of Data Per Policy • Flexible timing to migrate, with less dependencies • Instead of hard switch over, migrate at own pace • Moves the data per policy – e.g. last 7 days “Zero-copy” bursting to scale to the cloud
  34. 34. Spark Presto Hive TensorFlow RAM Framework Read file /trades/us Trades Directory Customers Directory Data requests ”Zero-copy” bursting under the hood Read file /trades/us again Read file /trades/top Read file /trades/top Variable latency with throttling Read file /trades/us again Read file /trades/top Read file /trades/top Read file /trades/us again Read file /trades/top Read file /trades/top Read file /trades/us again Read file /trades/top Read file /trades/top Read file /trades/us again
  35. 35. Spark Presto Hive TensorFlow RAM SSD Disk Framework New Trades Policy Defined Move data > 90 days old to Feature Highlight – Policy-driven Data Management GCS Policy interval : Every day Policy applied everyday
  36. 36. DEMO
  37. 37. Demo: initialization action installs Alluxio in Dataproc Presto Hive Metadata & Data cache Presto Hive Metadata & Data cache Compute-driven Continuous sync Compute-driven Continuous sync Google Dataproc Cluster #1 - Access data in Google Cloud Store
  38. 38. Demo: initialization action installs Alluxio in Dataproc Presto Hive Metadata & Data cache Presto Hive Metadata & Data cache Compute-driven Continuous sync Compute-driven Continuous sync Google Dataproc Cluster #2 - Access data from remote Hadoop cluster (simulated as Dataproc)
  39. 39. Get Started with Alluxio on Dataproc Single command created Dataproc cluster with Alluxio installed $ gcloud dataproc clusters create roderickyao-alluxio --initialization-actions gs://alluxio-public/enterprise-dataproc/2.1.0-1.0/ --metadata alluxio_root_ufs_uri=gs://ryao-test/alluxio-test/, alluxio_site_properties="alluxio.master.mount.table.root.option.fs.gcs.accessKeyId=<KEYID>; alluxio.master.mount.table.root.option.fs.gcs.secretAccessKey=<SECRET>", alluxio_license_base64=$(cat alluxio-enterprise-license.json | base64 | tr -d "n"),alluxio_download_path=gs://ryao-test/alluxio-enterprise-2.1.0-1.0.tar.gz Tutorial: Getting started with Dataproc and Alluxio
  40. 40. Resources Alluxio Initialization Action - Alluxio with Google Cloud Storage documentation -
  41. 41. Questions?