Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Accelerate Spark Workloads on S3

110 views

Published on

Alluxio Webinar
Dipti Borkar
06/27/2019

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Accelerate Spark Workloads on S3

  1. 1. Accelerate Spark workloads on S3 Dipti Borkar | Product @ Alluxio
  2. 2. Becoming increasingly popular § HDFS complex and not created for cloud environments § S3 is very easy to use, simple API and cost effective Analytics on the cloud object storage Big data frameworks on the public cloud SparkSparkSparkSpark
  3. 3. Running Spark on S3 – Different Options #1 Native Apache Spark install on AWS EC2 SparkSparkSparkSpark + #2 Bundled into services like AWS EMR #3 Managed services
  4. 4. Challenges with Analytics on S3 Big data frameworks on the public cloud SparkSparkSparkSpark Challenges with S3 § Not built for interactive analytics § Expensive metadata operations like list, rename § Eventual consistency § Performance inconsistent
  5. 5. #2 Data Access options on AWS EMR Presto Hive HDFS EMRF S Instances Presto Hive HDFS EMRF S
  6. 6. Using Spark with HDFS on AWS EMR Presto Hive HDFS EMRF S Instances Presto Hive HDFS Manual distcp / No SyncManual distcp / No Sync EMRF S
  7. 7. Using EMRFS on AWS EMR Presto Hive HDFS EMRF S Instances Presto Hive HDFS EMRF S No data caching No data caching
  8. 8. § Provides a data caching layer for Spark § Provides strong consistency for for metadata operations and faster performance § Provides API compatibility with HDFS & S3 § S3 is eventually consistent making it hard to predict query results § Allows for data outside of S3 to be analyzed as well Spark workloads on S3 with Alluxio Compute caching for Spark on S3 Accelerate big data frameworks on the public cloud Same instance / container Alluxio Spark AlluxioAlluxio Spark Alluxio SparkSpark
  9. 9. Presto Hive Instances Metadata & Data cache Presto Hive Metadata & Data cache HDFS HDFSEMRF S EMRF S Compute-driven Continuous sync Compute-driven Continuous sync Using Alluxio with AWS EMR
  10. 10. Alluxio for Spark • Data sharing between jobs • Data resilience during application crashes • Consolidate memory usage and alleviate GC issues 10 Alluxio for Spark
  11. 11. In-Memory Storage block 1 block 3 In-Memory Storage block 1 block 3 block 2 block 4 storage engine & execution engine same process Data Sharing Between Jobs Inter-process sharing slowed down by network I/O 11 Data sharing between jobs
  12. 12. Data Sharing Between Jobs block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 In-Memory block 1 block 3 block 4 storage & execution engine separated Inter-process sharing can happen at memory speed 12 Data Sharing Between JobsData sharing between jobs
  13. 13. Data Resilience during Crashes In-Memory Storage block 1 block 3 block 1 block 3 block 2 block 4 storage engine & execution engine same process Process crash requires network I/O to re-read the data 13 Data Sharing Between JobsData resilience during crashes
  14. 14. Data Resilience during Crashes Crash In-Memory Storage block 1 block 3 block 1 block 3 block 2 block 4 storage engine & execution engine same process Process crash requires network I/O to re-read the data 14 Data Sharing Between JobsData resilience during crashes
  15. 15. Data Resilience during Crashes block 1 block 3 block 2 block 4 Crash storage engine & execution engine same process Process crash requires network I/O to re-read the data 15 Data Resilience during CrashesData Sharing Between JobsData resilience during crashes
  16. 16. Data Resilience during Crashes storage & execution engine separated HDFS disk block 1 block 3 block 2 block 4 In-Memory block 1 block 3 block 4 Process crash only needs memory I/O to re-read the data 16 Data Resilience during CrashesData Sharing Between JobsData resilience during crashes
  17. 17. Data Resilience during Crashes Crash storage & execution engine separated Process crash only needs memory I/O to re-read the data HDFS disk block 1 block 3 block 2 block 4 In-Memory block 1 block 3 block 4 17 Data Resilience during CrashesData Sharing Between JobsData resilience during crashes
  18. 18. The Alluxio Story Originated as Tachyon project, at the UC Berkley’s AMP Lab by then Ph.D. student & now Alluxio CTO, Haoyuan (H.Y.) Li. 2014 2015 Open Source project established & company to commercialize Alluxio founded Goal: Orchestrate Data at Memory Speed for the Cloud for data driven apps such as Big Data Analytics, ML and AI. 2018 20192018
  19. 19. Data Ecosystem - Beta Data Ecosystem 1.0 COMPUTE STORAGE STORAGE COMPUTE
  20. 20. Data Orchestration for the Cloud Java File API HDFS Interface S3 Interface REST APIPOSIX Interface HDFS Driver Swift Driver S3 Driver NFS Driver Independent scaling of compute & storage
  21. 21. Data Elasticity with a unified namespace Abstract data silos & storage systems to independently scale data on-demand with compute Run Spark, Hive, Presto, ML workloads on your data located anywhere Accelerate big data workloads with transparent tiered local data Data Accessibility for popular APIs & API translation Data Locality with Intelligent Multi-tiering Alluxio – Key innovations
  22. 22. Data Locality with Intelligent Multi-tiering Local performance from remote data using multi-tier storage Hot Warm Cold RAM SSD HDD Read & Write Buffering Transparent to App Policies for pinning, promotion/demotion,TTL
  23. 23. Data Accessibility via popular APIs and API Translation Convert from Client-side Interface to native Storage Interface Java File API HDFS Interface S3 Interface REST APIPOSIX Interface HDFS Driver Swift DriverS3 Driver NFS Driver
  24. 24. Data Elasticity via Unified Namespace Enables effective data management across different Under Store - Uses Mounting withTransparent Naming
  25. 25. Unified Namespace: Global Data Accessibility Transparent access to understorage makes all enterprise data available locally SUPPORTS • HDFS • NFS • OpenStack • Ceph • Amazon S3 • Azure • Google Cloud IT OPS FRIENDLY • Storage mounted into Alluxio by central IT • Security in Alluxio mirrors source data • Authentication through LDAP/AD • Wireline encryption HDFS #1 Object Store NFS HDFS #2
  26. 26. AlluxioAlluxioAlluxio § Accessing data over WAN too slow § Copying data to compute cloud time consuming and complex § Using another storage system like S3 means expensive application changes § Using S3 via HDFS connector leads to extremely low performance Challenges with Hybrid Cloud & Alluxio Solution HDFS for Hybrid Cloud Alluxio Burst big data workloads in hybrid cloud environments Same instance / container Solution Benefits § Same performance as local § Same end-user experience § 100% of I/O is offloaded SparkSparkSparkSpark
  27. 27. Challenges with supporting more frameworks & Alluxio Solution § Running new frameworks on existing an HDFS cluster can dramatically affect performance of existing workloads § In a disaggregate environment, copying data to multiple compute clouds time consuming and error prone § Migrating applications for new storage systems is complex & time consuming § Storing and managing multiple copies of the data becomes expensive Support more frameworks Any object store or HDFS Same data center / region Presto Enable big data on object stores across single or multiple clouds or Spark Alluxio Alluxio
  28. 28. Spark Alluxio Orchestrate data frameworks on the public cloud Any public / private cloud PrestoHive Multi-cloud access for your Spark Workloads
  29. 29. Alluxio MasterZookeeper / RAFT Standby Master WAN Alluxio Client Alluxio Client Alluxio Worker RAM / SSD / HDD Alluxio Worker RAM / SSD / HDD Alluxio Reference Architecture … … Application Application Under Store 1 Under Store 2
  30. 30. Demo: Bootstrapping Alluxio with AWS EMR aws emr create-cluster --release-label ${RELEASE_LABEL} --instance-count ${NUM_INSTANCES} --instance-type ${INSTANCE_TYPE} --applications Name=Presto Name=Hive Name=Spark --name "${CLUSTER_NAME}" --bootstrap-actions Path=${BOOTSTRAP_PATH},Args=[${ALLUXIO_DOWNLOAD_URL},${ROOT_UFS_ URI}, ${ADDITIONAL_PROPERTIES}] --configurations file:///Users/diptiborkar/Downloads/alluxio-emr.json --ec2-attributes KeyName=${KEY_PAIR}
  31. 31. Bootstrapping Alluxio with AWS EMR RELEASE_LABEL="emr-5.23.0" NUM_INSTANCES=5 INSTANCE_TYPE="m4.xlarge" CLUSTER_NAME="emr-v12" BOOTSTRAP_PATH="s3://dipti-alx-2019/emr/alluxio-emr.sh" ALLUXIO_DOWNLOAD_URL="https://downloads.alluxio.io/downloads/fil es/2.0.0/alluxio-2.0.0-RC3-bin.tar.gz" ROOT_UFS_URI="s3a://dipti-alx-2019/emr/ufs/" ADDITIONAL_PROPERTIES="alluxio.underfs.s3.owner.id.to.username.m apping=${S3_ID}=hadoop;alluxio.user.file.writetype.default=ASYNC _THROUGH"
  32. 32. Enterprises moving towards independent compute & storage Learn more
  33. 33. Incredible Open Source Momentum with growing community 1000+ contributors & growing 4000+ Git Stars Apache 2.0 Licensed Hundreds of thousands of downloads Join the conversation on Slack alluxio.io/slack
  34. 34. Questions? Join the Alluxio Community http://alluxio.io/ | @alluxio

×