Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Bursting Apache Spark Workloads to the Cloud on Remote Data


Published on

Alluxio Community Office Hour
Mar 10, 2020

For more Alluxio events:

Speaker: Bin Fan

Accessing data to run analytic workloads in Spark across data centers and/or clouds can be challenging. Additionally, network I/O can bottleneck Spark jobs that need to read a large amount of data. A common solution is to deploy an HDFS cluster closer to Spark as a caching layer and manually copy the input data to HDFS first, purging it afterward. But this ETL process can be both time-consuming and also error-prone.

A more efficient and simpler solution is to run Spark on Alluxio as a distributed cache on top of the remote data source. While caching data transparently based on access patterns and storing the working set closer, Alluxio provides Spark jobs much higher I/O throughput with enhanced data locality. In addition, Alluxio also provides data accessibility and abstraction for deployments in hybrid and multi-cloud environments.

In this Office Hour, we will go over how to:
- Burst on-prem Spark workloads to the cloud with Alluxio so Spark can seamlessly read from and write to remote data storage
- Use Alluxio as the input/output for Spark applications
- Save and load Spark RDDs and Dataframes with Alluxio

Published in: Software
  • Be the first to comment

  • Be the first to like this

Bursting Apache Spark Workloads to the Cloud on Remote Data

  1. 1. Office Hour: Bursting Apache Spark Workloads to the Cloud on Remote Data 2020/03/10 Office Hour Bin Fan | Founding Engineer | Alluxio
  2. 2. Co-located Co-located compute & HDFS on the same cluster Disaggregated compute & HDFS on the same cluster MR / Hive HDFS Hive HDFS Disaggregated Burst HDFS data in the cloud, public or private Enable & accelerate access big data across data centers Support analytics across datacenters HDFS for Hybrid Cloud Big data journey & innovation for enterprises
  3. 3. Challenge: Data Gets Increasingly Remote from Compute ▪ Challenging Scenarios ▪ Data-driven initiatives in need of more compute ▪ Hadoop system on-prem, but it’s remote ▪ Object data growth in a cloud region, but it’s remote ▪ How to make remote data local to the compute without copies? ▪ Business benefits ▪ Data immediately available for quicker data-driven insights ▪ More cloud computing power to solve problems quicker ▪ Up to 80% lower egress costs Datacenter
  4. 4. Solution: “Zero-copy” bursting to scale to the cloud Spark Alluxio Accelerate big data frameworks on the public cloud Same instance / container Spark Alluxio Burst big data workloads in hybrid cloud environments On premise Same instance / container
  5. 5. Alluxio is Open-Source Data Orchestration Data Orchestration for the Cloud Java File API HDFS Interface S3 Interface REST APIPOSIX Interface HDFS Driver GCS Driver S3 Driver Azure Driver
  6. 6. Zero-Copy Burst: View the I/O Stack 6 FAST 104 - 105 MB/s MODERATE 103 - 104 MB/s SLOW 10 - 103 MB/s Only when necessary Limited Often SSD HDD Mem
  7. 7. The Alluxio Story Originated as Tachyon project, at UC Berkley AMPLab by Ph.D. student Haoyuan (H.Y.) Li - now Alluxio CTO2013 2015 Open Source project established & company to commercialize Alluxio founded Goal: Orchestrate Data at Memory Speed for the Cloud for data driven apps such as Big Data Analytics, ML and AI. 20192018 2019 Top 10 Big Data 2019 Top 10 Cloud Software
  8. 8. Fast-growing Open Source Community 4000+ Github Stars1000+ Contributors Join the community on Slack (FAQ for this office hour) Apache 2.0 Licensed Contribute to source code
  9. 9. Data Elasticity with a unified namespace Abstract data silos & storage systems to independently scale data on-demand with compute Run Spark, Hive, Presto, ML workloads on your data located anywhere Accelerate big data workloads with transparent tiered local data Data Accessibility for popular APIs & API translation Data Locality with Intelligent Multi-tiering Alluxio – Key innovations
  10. 10. Data Locality with Intelligent Multi-tiering Local performance from remote data using multi-tier storage Hot Warm Cold RAM SSD HDD Read & Write Buffering Transparent to App Policies for pinning, promotion/demotion, TTL
  11. 11. Data Accessibility via popular APIs and API Translation Convert from Client-side Interface to native Storage Interface Java File API HDFS Interface S3 Interface REST APIFUSE Interface HDFS Driver Swift DriverS3 Driver NFS Driver
  12. 12. Data Elasticity via Unified Namespace Enables effective data management across different Under Store - Uses Mounting with Transparent Naming
  13. 13. Unified Namespace: Global Data Accessibility Transparent access to understorage makes all enterprise data available locally SUPPORTS • HDFS • NFS • OpenStack • Ceph • Amazon S3 • Azure • Google Cloud IT OPS FRIENDLY • Storage mounted into Alluxio by central IT • Security in Alluxio mirrors source data • Authentication through LDAP/AD • Wireline encryption HDFS #1 Object Store NFS HDFS #2
  14. 14. DATA ORCHESTRATION SPARK HDFS SPARK HDFS Public Cloud Public Cloud ▪ Compute scales elastically independent of storage ▪ Faster time to insights with seamless data orchestration ▪ Accelerated workloads with memory-first data approach Leading Hedge Fund Fastest growing big hedge fund managing $46 billion for investors Use case | Cloud bursting on-premise data
  15. 15. Machine Learning Case Study Challenge – Gain end to end view of business with large volume of data Queries were slow / not interactive, resulting in operational inefficiency Solution – ETL Data from Teradata to Alluxio Impact – Faster Time to Market – “Now we don’t have to work Sundays” Use Case: SPARK TERADATA SPARK TERADATA
  16. 16. Walmart Use case Why Walmart chose Alluxio’s “Zero-Copy” burst solution: • No requirement to persist data into the cloud • Improved query performance and no network hops on recurrent queries  • Lower costs without the need for creating data copies
  17. 17. Enterprises moving towards independent compute & storage
  18. 18. Incredible Open Source Momentum with growing community 1000+ contributors & growing 4.5K+ Git Stars Apache 2.0 Licensed Hundreds of thousands of downloads Join the conversation on Slack
  19. 19. Questions? Join the Alluxio Community | Twitter: @alluxio
  20. 20. Problem: HDFS cluster is compute- bound & complex to maintain AWS Public Cloud IaaS Spark Presto Hive TensorFlow Alluxio Data Orchestration and Control Service On Premises Connectivity Datacenter Spark Presto Hive Tensor Flow Alluxio Data Orchestration and Control Service Barrier 1: Prohibitive network latency and bandwidth limits • Makes hybrid analytics unfeasible Barrier 2: Copying data to cloud • Difficult to maintain copies • Data security and governance • Costs of another silo Step 1: Hybrid Cloud for Burst Compute Capacity• Orchestrates compute access to on-prem data • Working set of data, not FULL set of data • Local performance • Scales elastically • On-Prem Cluster Offload (both Compute & I/O) Step 2: Online Migration of Data Per Policy • Flexible timing to migrate, with less dependencies • Instead of hard switch over, migrate at own pace • Moves the data per policy – e.g. last 7 days Solution: “Zero-copy” bursting to scale to the cloud
  21. 21. Use case | Data orchestration for agility DATA ORCHESTRATION SPARK HDFS SPARK Kubernetes OBJECT HBASE ETLSPARK HDFS OBJECT HBASE ▪ Single namespace to access & address all data ▪ Data local to compute accelerates workloads China Unicom Leading Chinese Telco serving 320 million subscribers
  22. 22. Analytics Use Case – Top Retailer Challenge – Bottleneck in Trend Analysis of mission critical daily sales and inventory management Queries were slow / not interactive, resulting in operational inefficiency Solution – With Alluxio, data queries are 10X faster Impact – Higher operational efficiency Use case: SPARK HDFS SPARK HDFS
  23. 23. Customer Insights Use Case – Top Telecom Challenge – Desired a central view of consumer information in near real time for proactive support. Many HDFS, different distributions, many incompatible versions. On-prem & cloud. Integration through heavy ETL. Solution – Alluxio integrates data into central catalog for fast access to consumer interaction records. Impact – Reduced integration time Faster data speed & freshness HADOOP ML HADOOP HDFS HDFS HDFS ML ETL HDP HDFS CDH HDFS MAPR HDFS HDFS