Be the first to like this
Alluxio Community Office Hour
Mar 10, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speaker: Bin Fan
Accessing data to run analytic workloads in Spark across data centers and/or clouds can be challenging. Additionally, network I/O can bottleneck Spark jobs that need to read a large amount of data. A common solution is to deploy an HDFS cluster closer to Spark as a caching layer and manually copy the input data to HDFS first, purging it afterward. But this ETL process can be both time-consuming and also error-prone.
A more efficient and simpler solution is to run Spark on Alluxio as a distributed cache on top of the remote data source. While caching data transparently based on access patterns and storing the working set closer, Alluxio provides Spark jobs much higher I/O throughput with enhanced data locality. In addition, Alluxio also provides data accessibility and abstraction for deployments in hybrid and multi-cloud environments.
In this Office Hour, we will go over how to:
- Burst on-prem Spark workloads to the cloud with Alluxio so Spark can seamlessly read from and write to remote data storage
- Use Alluxio as the input/output for Spark applications
- Save and load Spark RDDs and Dataframes with Alluxio