Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Accelerating workloads and bursting data with Google Dataproc & Alluxio

34 views

Published on

Big Data Application Meetup @ Google
Nov 21, 2019

Speakers:
Dipti Borkar, Alluxio
Roderick Yao, Google

Published in: Software
  • Be the first to comment

  • Be the first to like this

Accelerating workloads and bursting data with Google Dataproc & Alluxio

  1. 1. Accelerating workloads and bursting data with Google Dataproc & Alluxio Dipti Borkar | VP, Product | Alluxio Roderick Yao | Strategic Cloud Engineer | Google
  2. 2. ▪ What’s Google Dataproc? ▪ What’s Alluxio? ▪ Alluxio in Dataproc ▪ Demo
  3. 3. Enterprises are telling us they need: To respond to different business data needs with different urgency and emphasis ● Create bespoke hadoop clusters customized for any workload ● Use them for a minute or a year A faster, more scalable way to get insights from data ● Get up and running without waiting for hardware or software to be installed or configured To get their people out of owning and monitoring technology and back to innovating ● Design workflows that create clusters, complete jobs end-to-end, and then delete themselves To spend less money ● Create clusters in seconds ● Pay only for when the cluster is running ● Take advantage of preemptible VM instances
  4. 4. Enterprise Hadoop cluster woes You know that managing a Hadoop cluster can be frustrating and time consuming It’s a hassle to renew the license on your on-premises system It’s hard to scale compute or storage on- demand Maintaining the operations of your Hadoop cluster takes too much time Your system can’t keep up with forecasted usage and data growth Your legacy system busts your budget
  5. 5. What is Cloud Dataproc? Rapid cluster creation Familiar open source tools Google Cloud Platform’s fully- managed Apache Spark and Apache Hadoop service Ephemeral clusters on-demand Customizable machines Tightly Integrated with other Google Cloud Platform services
  6. 6. Fast Things take seconds to minutes, not hours or weeks Easy Be an expert with your data, not your data infrastructure Cost-effective Pay for exactly what you use to process your data, not more Google Cloud Dataproc vision
  7. 7. Disaggregation of storage and compute Analysis Cloud Datalab Development & Test Data sinksProduction Cloud Dataproc External applications Storage Cloud Storage Application Logs Storage BigQuery Development Cloud Dataproc Test Cloud Dataproc Data sources Storage Cloud Bigtable Storage Cloud Storage Storage BigQuery Storage Cloud Bigtable Data scienceCluster monitoring Monitor Stackdriver Logs Logging
  8. 8. Ephemeral and long-lived clusters Semi-long-lived clusters - group and select by labelClusters per job Cluster Cloud Dataproc Cluster Cloud Dataproc Cluster Cloud Dataproc Cloud Storage Edge Nodes Compute Engine Client Client Client ClientsClients Development (Preview) Production (1.2) Prod 1 Cloud Dataproc Dev cluster Cloud Dataproc Prod 2 Cloud Dataproc
  9. 9. Customers using Dataproc to Scale
  10. 10. BigQuery Stackdriver Compute Cloud Storage PSO & SupportBigTable Dataflow Dataproc Pub/Sub Challenge To build machine learning models that focused on fraud detection and inventory management How Google Helped Partnered with retailer to think about both the digital experience as well as the in-store customer experience to especially help them manage major retail events like Black Friday. What they are running: 67 avg. clusters per day 513 nodes per cluster Products & Services: NDA Traditional Brick and Mortar Retailer
  11. 11. Combining the best of open source and cloud. Cloud Dataproc
  12. 12. Introduction to Alluxio Open source data orchestration
  13. 13. The Alluxio Story Originated as Tachyon project, at the UC Berkeley’s AMP Lab by then Ph.D. student & now Alluxio CTO, Haoyuan (H.Y.) Li. 2014 2015 Open Source project established & company to commercialize Alluxio founded Goal: Orchestrate Data for the Cloud for data driven apps such as Big Data Analytics, ML and AI. Focus: Accelerating modern app frameworks running on HDFS/S3/ GCS -based data lakes or warehouses
  14. 14. Data Orchestration for the Cloud Java File API HDFS Interface S3 Interface REST APIPOSIX Interface HDFS Driver Swift Driver S3 Driver NFS Driver Lines of Business
  15. 15. Data Orchestration for the Cloud
  16. 16. Alluxio MasterZookeeper / RAFT Standby Master WA N Alluxio Client Alluxio Client Alluxio Worker RAM / SSD / HDD Alluxio Worker RAM / SSD / HDD Alluxio Reference Architecture … … Applicatio n Applicatio n Under Store 1 Under Store 2
  17. 17. Compute Storage 2–5 Mins 2–5 Mins Elastic ✓ Elastic ✓ Enterprise Cloud Compute & Storage is Great… but Data got left behind 2–4 Weeks Request Data Request Review Find Dataset Code Script/Job Run ETL jobs Grant Permissions Not Elastic ! Dataset
  18. 18. Public Cloud IaaS Spark Presto Hive TensorFlow Alluxio Data Orchestration and Control Service Alluxio enables compute! Alluxio Data Orchestration and Control Service Solution: Consistent High Performance • Performance increases range from 1.5X to 10X • AWS EMR & Google Dataproc integrations • Fewer copies of data means lower costs Problem: Object Stores have inconsistent performance for analytics and AI workloads ▪ SLAs are hard to achieve ▪ Metadata operations are expensive ▪ Copied data storage costs add up making the solution expensive Accelerating Analytics in the cloud
  19. 19. PRESTO OBJECT STORE Public Cloud Project: • Utilize Presto for interactive queries on cloud object store compute Problem: • Low performance of queries too slow to be usable • Inconsistent performance of queries Walmart | High Performance Cloud analytics Alluxio solution: • Alluxio provides intelligent distributed caching layer for object storage Result: • High performance queries • Consistent performance • Interactive query performance for analysts PRESTO OBJECT STORE Public Cloud ALLUXIO
  20. 20. 20 Presto & Alluxio on Works well together… Small range query response time Lower is better Large scan query response time Lower is better Concurrency Higher is better Prest o Presto + Alluxio • Query performance bottlenecks • Un-predictable network IO • Query pattern - Datasets modelled in star schema could benefit by dimension table caching • Presto + Alluxio • Avoids unpredictable network • Consistent query latency • Higher throughput and better concurrency
  21. 21. Alluxio in Dataproc
  22. 22. Google Dataproc Presto Hive Presto Hive Google Dataproc Cluster Google Cloud Store Google Cloud Store
  23. 23. Using Alluxio with Google Dataproc Presto Hive Metadata & Data cache Presto Hive Metadata & Data cache Compute-driven Continuous sync Compute-driven Continuous sync Google Dataproc Cluster Google Cloud Store Google Cloud Store Single command initialization action brings up Alluxio in dataproc Alluxio Initialization Action - https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/alluxio
  24. 24. What about remote data?
  25. 25. Bursting workloads to the cloud with remote data Typical Restrictions ▪ Data cannot be persisted in a public cloud ▪ Additional I/O capacity cannot be added to existing Hadoop infrastructure ▪ On-prem level security needs to be maintained ▪ Network bandwidth utilization needs to be minimal Options Lift and Shift Data copy by workload “Zero-copy” Bursting
  26. 26. Problem: HDFS cluster is compute- bound & complex to maintain AWS Public Cloud IaaS Spark Presto Hive TensorFlow Alluxio Data Orchestration and Control Service On Premises Connectivity Datacenter Spark Presto Hive Tensor Flow Alluxio Data Orchestration and Control Service Barrier 1: Prohibitive network latency and bandwidth limits • Makes hybrid analytics unfeasible Barrier 2: Copying data to cloud • Difficult to maintain copies • Data security and governance • Costs of another silo Step 1: Hybrid Cloud for Burst Compute Capacity • Orchestrates compute access to on-prem data • Working set of data, not FULL set of data • Local performance • Scales elastically • On-Prem Cluster Offload (both Compute & I/O) Step 2: Online Migration of Data Per Policy • Flexible timing to migrate, with less dependencies • Instead of hard switch over, migrate at own pace • Moves the data per policy – e.g. last 7 days “Zero-copy” bursting to scale to the cloud
  27. 27. Spark Presto Hive TensorFlow RAM Framework Read file /trades/us Trades Directory Customers Directory Data requests ”Zero-copy” bursting under the hood Read file /trades/us again Read file /trades/top Read file /trades/top Variable latency with throttling Read file /trades/us again Read file /trades/top Read file /trades/top Read file /trades/us again Read file /trades/top Read file /trades/top Read file /trades/us again Read file /trades/top Read file /trades/top Read file /trades/us again
  28. 28. Spark Presto Hive TensorFlow RAM SSD Disk Framework New Trades Policy Defined Move data > 90 days old to Feature Highlight – Policy-driven Data Management GCS Policy interval : Every day Policy applied everyday
  29. 29. DEMO
  30. 30. Demo: initialization action installs Alluxio in Dataproc Presto Hive Metadata & Data cache Presto Hive Metadata & Data cache Compute-driven Continuous sync Compute-driven Continuous sync Google Dataproc Cluster #1 - Access data in Google Cloud Store
  31. 31. Demo: initialization action installs Alluxio in Dataproc Presto Hive Metadata & Data cache Presto Hive Metadata & Data cache Compute-driven Continuous sync Compute-driven Continuous sync Google Dataproc Cluster #2 - Access data from remote Hadoop cluster (simulated as Dataproc)
  32. 32. Get Started with Alluxio on Dataproc Single command created Dataproc cluster with Alluxio installed $ gcloud dataproc clusters create roderickyao-alluxio --initialization-actions gs://alluxio-public/enterprise-dataproc/2.1.0-1.0/alluxio-dataproc.sh --metadata alluxio_root_ufs_uri=gs://ryao-test/alluxio-test/, alluxio_site_properties="alluxio.master.mount.table.root.option.fs.gcs.accessKeyId=<KEYID>; alluxio.master.mount.table.root.option.fs.gcs.secretAccessKey=<SECRET>", alluxio_license_base64=$(cat alluxio-enterprise-license.json | base64 | tr -d "n"),alluxio_download_path=gs://ryao-test/alluxio-enterprise-2.1.0-1.0.tar.gz Tutorial: Getting started with Dataproc and Alluxio https://www.alluxio.io/products/google-cloud/gcp-dataproc-tutorial/
  33. 33. Resources Alluxio Initialization Action - https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/alluxio Alluxio with Google Cloud Storage documentation - https://docs.alluxio.io/ee/user/stable/en/ufs/GCS.html
  34. 34. Questions? alluxio.io

×