Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Spark Pipelines in the Cloud with Alluxio with Gene Pang

614 views

Published on

Organizations commonly use Apache Spark to gain actionable insight from their large amounts of data. Often, these analytics are in the form of data processing pipelines, where there are a series of processing stages, and each stage performs a particular function, and the output of one stage is the input of the next stage. There are several examples of pipelines, such as log processing, IoT pipelines, and machine learning. The common attribute among different pipelines is the sharing of data between stages. It is also common for Spark pipelines to process data stored in the public cloud, such as Amazon S3, Microsoft Azure Blob Storage, or Google Cloud Storage. The global availability and cost effectiveness of these public cloud storage services make them the preferred storage for data. However, running pipeline jobs while sharing data via cloud storage can be expensive in terms of increased network traffic, and slower data sharing and job completion times. Using Alluxio, a memory speed virtual distributed storage system, enables sharing data between different stages or jobs at memory speed. By reading and writing data in Alluxio, the data can stay in memory for the next stage of the pipeline, and this result in great performance gains. In this talk, we discuss how Alluxio can be deployed and used with a Spark data processing pipeline in the cloud. We show how pipeline stages can share data with Alluxio memory for improved performance benefits, and how Alluxio can improves completion times and reduces performance variability for Spark pipelines in the cloud.

Published in: Data & Analytics
  • Be the first to comment

Spark Pipelines in the Cloud with Alluxio with Gene Pang

  1. 1. Spark Pipelines in the Cloud with Alluxio Gene Pang,Alluxio, Inc. Spark Summit EU - October 2017
  2. 2. About Me •  Gene Pang •  Software engineer @ Alluxio, Inc. •  Alluxio open source PMC member •  Ph.D. from AMPLab @ UC Berkeley •  Worked at Google before UC Berkeley •  Twitter: @unityxx •  Github: @gpang ©2017 Alluxio, Inc.All Rights Reserved 2
  3. 3. Outline ©2017 Alluxio, Inc.All Rights Reserved 3 1 2 3 Alluxio Overview Data Pipelines Experiments
  4. 4. History of Alluxio Started at UC Berkeley AMPLab In Summer 2012 •  Originally named as Tachyon •  Rebranded to Alluxio in early 2016 Open Sourced in 2013 •  Apache License 2.0 •  Latest Release:Alluxio 1.6.0 ©2017 Alluxio, Inc.All Rights Reserved 4
  5. 5. 5©2017 Alluxio, Inc.All Rights Reserved Alluxio: Unify Data at Memory Speed Namespace Unification Architecture Flexibility IO Performance
  6. 6. Data Ecosystem with Alluxio •  Apps only talk to Alluxio •  Simple Add/Remove •  No App Changes •  In-Memory Performance Native File System Hadoop Compatible File System Native Key-Value Interface Fuse Compatible File System HDFS Interface Amazon S3 Interface Swift Interface GlusterFS Interface ©2017 Alluxio, Inc.All Rights Reserved 6
  7. 7. Next Gen Analytics with Alluxio Native File System Hadoop Compatible File System Native Key-Value Interface Fuse Compatible File System HDFS Interface Amazon S3 Interface Swift Interface GlusterFS Interface Apps, Data & Storage at Memory Speed ü  Big Data/IoT ü  AI/ML ü  Deep Learning ü  Cloud Migration ü  Multi Platform ü  Autonomous ©2017 Alluxio, Inc.All Rights Reserved 7
  8. 8. Fastest Growing Big Data Open Source Projects Fastest Growing open- source project in the big data ecosystem Running in large production clusters Now: 600+ Contributors from 100+ organizations 0 100 200 300 400 500 0 10 20 30 40 45 NumberofContributors Github Open Source Contributors by Month Alluxio Spark Kafka Redis HDFS Cassandra Hive ©2017 Alluxio, Inc.All Rights Reserved 8
  9. 9. Outline ©2017 Alluxio, Inc.All Rights Reserved 9 1 2 3 Alluxio Overview Data Pipelines Experiments
  10. 10. 1 0©2017 Alluxio, Inc.All Rights Reserved Data Processing Pipeline Stage 1 Stage 2 Stage 3 Stage 4 Output of stage is input of next stage
  11. 11. 1 1©2017 Alluxio, Inc.All Rights Reserved Data Processing Pipeline Sharing via common storage Shared Storage Stage 1 Stage 2 Stage 3 Stage 4
  12. 12. 1 2©2017 Alluxio, Inc.All Rights Reserved What about pipelines in the cloud?
  13. 13. 1 3©2017 Alluxio, Inc.All Rights Reserved Data Processing Pipeline in the Cloud S3 (object store) Stage 1 Stage 2 Stage 3 Stage 4
  14. 14. 1 4©2017 Alluxio, Inc.All Rights Reserved Sharing data via cloud storage slows down performance
  15. 15. 1 5©2017 Alluxio, Inc.All Rights Reserved Cloud Pipeline with Alluxio Sharing via Alluxio memory S3 (object store) Stage 1 Stage 2 Stage 3 Stage 4
  16. 16. Sharing Data in the Cloud Previous stage writes output to storage Next stage reads input from storage … ©2017 Alluxio, Inc.All Rights Reserved 1 6
  17. 17. Sharing Data in the Cloud with Alluxio Previous stage writes output to storage memory Next stage reads input from storage memory … ©2017 Alluxio, Inc.All Rights Reserved 1 7
  18. 18. 1 8©2017 Alluxio, Inc.All Rights Reserved Alluxio enables in-memory data sharing Faster pipeline performance
  19. 19. Alluxio – Fast Durable Writes Improves write performance, without sacrificing fault tolerance ©2017 Alluxio, Inc.All Rights Reserved 1 9
  20. 20. Alluxio – Fast Durable Writes Synchronously write to replicas in Alluxio memory Asynchronously write to underlying storage ©2017 Alluxio, Inc.All Rights Reserved 2 0
  21. 21. Alluxio – Fast Durable Writes ©2017 Alluxio, Inc.All Rights Reserved 2 1 client Storage Write is completed
  22. 22. Alluxio – Fast Durable Writes ©2017 Alluxio, Inc.All Rights Reserved 2 2 client Storage Asynchronously written to storage
  23. 23. Outline ©2017 Alluxio, Inc.All Rights Reserved 2 3 1 2 3 Alluxio Overview Data Pipelines Experiments
  24. 24. Log Pipeline in Amazon Web Services ©2017 Alluxio, Inc.All Rights Reserved 2 4 Generate Parquet Transform Aggregate Generate: [MapReduce] Create random csv log data Parquet: [MapReduce] Convert csv to parquet format Transform: [Spark] Update column values Aggregate: [Spark] Compute group by / aggregate
  25. 25. Log Pipeline Environment •  r4.2xlarge instances (61 GB ram, 8 CPUs) •  1 master, 3 workers •  Apache Spark 2.2.0 •  Apache Hadoop 2.7.2 •  Alluxio 1.6.0 •  Generate 12 GB of logs •  Compare AWS S3 vs Alluxio w/ Fast Durable Writes ©2017 Alluxio, Inc.All Rights Reserved 2 5
  26. 26. 2 6©2017 Alluxio, Inc.All Rights Reserved Average Stage Completion Time
  27. 27. 2 7©2017 Alluxio, Inc.All Rights Reserved Pipeline Completion Time Over 9x speedup!
  28. 28. Alluxio and Pipelines in the Cloud Alluxio enables in-memory sharing for data pipelines in the cloud Alluxio’s Fast Durable Write feature increases performance without sacrificing fault tolerance ©2017 Alluxio, Inc.All Rights Reserved 2 8
  29. 29. Thank you! Gene Pang gene@alluxio.com Twitter: @unityxx Twi$er.com/alluxio   Linkedin.com/alluxio     Website www.alluxio.com E-mail info@alluxio.com @ Social Media á ™ ©2017 Alluxio, Inc.All Rights Reserved 2 9

×