Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building a Cloud Native Stack with EMR Spark, Alluxio, and S3

267 views

Published on

Alluxio Community Office Hour
Aug 27, 2019

Speakers:
Bin Fan
Nakkul Sreenivas

Published in: Software
  • Be the first to comment

  • Be the first to like this

Building a Cloud Native Stack with EMR Spark, Alluxio, and S3

  1. 1. 2019/08/26 Office Hour Website | www.alluxio.io Q&A | https://alluxio.io/slack Building a Cloud Native Stack with EMR Spark, Alluxio, and S3 Bin Fan, Nakkul Sreenivas
  2. 2. AWS S3: The Default Data Lake on AWS ▪ Why we love it ▪ Cheap, ▪ High available ▪ Fully managed ▪ Really large scale ▪ Still, limitations & difference: ▪ Slow object listing ▪ Expensive rename ▪ Tput throttling ▪ Variable performance ▪ No data locality on computation ▪ No user-managed cache
  3. 3. Alluxio is Open-Source Data Orchestration Data Orchestration for the Cloud Java File API HDFS Interface S3 Interface REST APIPOSIX Interface HDFS Driver GCS Driver S3 Driver Azure Driver
  4. 4. Why put Alluxio in AWS ▪ Provide better or consistent performance ▪ Add a data caching tier to S3: cache Hot data/Metadata ▪ Familiar FS semantics: listing, rename ▪ Keep data local to applications like Spark ▪ Compatible with other existing services like Hadoop, Hive, Presto ▪ Mount multiple data sources into the namespace ▪ Files/Objects in different storage GCS, Azure, HDFS ▪ Objects in other S3 buckets
  5. 5. The Alluxio Story Originated as Tachyon project, at UC Berkley AMPLab by Ph.D. student Haoyuan (H.Y.) Li - now Alluxio CTO2013 2015 Open Source project established & company to commercialize Alluxio founded Goal: Orchestrate Data at Memory Speed for the Cloud for data driven apps such as Big Data Analytics, ML and AI. 20192018 2019 Top 10 Big Data 2019 Top 10 Cloud Software
  6. 6. Fast-growing Open Source Community 4000+ Github Stars1000+ Contributors Join the community on Slack (FAQ for this office hour) alluxio.io/slack Apache 2.0 Licensed Contribute to source code github.com/alluxio/alluxio
  7. 7. Data Locality via Intelligent Multi-tiering ▪ Local performance from remote data using multi-tier storage RAM SSD HDD Hot Warm Cold Read & Write Buffering Transparent to App Policies for pinning, promotion/demotion, TTL 8/20/19 7
  8. 8. Spark Presto Bash Tensorflow Java ~$ cat /mnt/alluxio/myInput Data Accessibility via popular APIs > rdd = sc.textFile(“alluxio://master:19998/myInput”) > CREATE SCHEMA hive.web > WITH (location = 'alluxio://master:19998/my-table/') ~$ python classify_image.py --model_dir /mnt/fuse/imagenet/ FileSystem fs = FileSystem.Factory.get(); FileInStream in = fs.openFile(new AlluxioURI("/myInput"));
  9. 9. Data Abstraction via Unified Namespace Enables effective data management across different Under Store $ ./bin/alluxio fs mount /Data s3://bucket/directory
  10. 10. Typical Alluxio Use Cases • Cloud Analytics Caching Get in-memory data access for Spark, Presto, or any analytics framework on Cloud storage • Hybrid Cloud Analytics Get in-memory data access for Spark, Presto, or any analytics framework on Cloud storage
  11. 11. Spark Alluxio AWS S3 Co-locate Alluxio Workers with Spark for optimal I/O performance Deployment Approaches Same instance Spark Alluxio AWS S3 Deploy Alluxio as standalone cluster between Spark and Storage Same data center / region Presto
  12. 12. Alluxio-EMR Prerequisites and Design Considerations ▪ IAM Account with the default EMR Roles ▪ S3 Bucket to host Bootstrap script and to act as a UFS ▪ Key Pair for EC2 ▪ AWS CLI ▪ Leverage AWS Glue/RDS to persist Hive Metastore State ▪ Bootstrap Scripts 12
  13. 13. Alluxio EMR Service Integration: Bootstrap Actions ▪ EMR provides hooks into the main configuration files for Hadoop Services: ▪ hive-site.xml, core-site.xml, hadoop-env.sh, hive.properties ▪ Bootstrap Actions ▪ Up to 10 shell scripts specified by the user ▪ Runs before Hadoop service installation ▪ Offering for shutdown actions as well
  14. 14. Demo
  15. 15. Alluxio MasterZookeeper / RAFT Standby Master Alluxio Worker Alluxio Worker Alluxio Reference Architecture … … Application Application Under Store 1 Under Store 2
  16. 16. Read data in Alluxio, on same node as client Alluxio Worker RAM / SSD / HDD Memory Speed Read of Data Application Alluxio Client Alluxio Master
  17. 17. Read data not in Alluxio RAM / SSD / HDD Network / Disk Speed Read of Data Application Alluxio Client Alluxio Master Alluxio WorkerUnder Store
  18. 18. Write data only to Alluxio on same node as client Alluxio Worker RAM / SSD / HDD Memory Speed Write of Data Application Alluxio Client Alluxio Master
  19. 19. Write data to Alluxio and Under Store synchronously RAM / SSD / HDD Network / Disk Speed Write of Data Application Alluxio Client Alluxio Master Alluxio Worker Under Store
  20. 20. Alluxio 2.0 & Coming in 2.1 Release ▪ Alluxio 2.0: Released in July ▪ Metadata scales to 1 bln file or more (based on rocksdb) ▪ Self-managed Metadata service based on Quorum ▪ Async writes, distributed load ▪ Many more: https://www.alluxio.io/download/releases/alluxio-2-0-0-release/ ▪ Alluxio 2.1: Scheduled in Sept ▪ A Presto-Alluxio Connector with Iceberg Integration ▪ Use Alluxio as a caching layer without modifying HMS
  21. 21. Next steps - Try it out! • Getting Started • Spark Performance Tuning Tips • Accelerate Spark and Hive Jobs on AWS S3: Use case from Bazaarvoic • Spark + Alluxio: Tencent Use Case Questions or Suggestions? Engage with us at alluxio.io/slack!
  22. 22. Questions Slides will be available at slack channel (https://alluxio.io/slack)

×