Best Practices for Using Alluxio with Apache Spark with Gene Pang

308 views

Published on

Alluxio, formerly Tachyon, is a memory speed virtual distributed storage system and leverages memory for storing data and accelerating access to data in different storage systems. Many organizations and deployments use Alluxio with Apache Spark, and some of them scale out to over PB’s of data. Alluxio can enable Spark to be even more effective, in both on-premise deployments and public cloud deployments. Alluxio bridges Spark applications with various storage systems and further accelerates data intensive applications. In this talk, we briefly introduce Alluxio, and present different ways how Alluxio can help Spark jobs. We discuss best practices of using Alluxio with Spark, including RDDs and DataFrames, as well as on-premise deployments and public cloud deployments.

Published in: Data & Analytics
  • Be the first to like this

Best Practices for Using Alluxio with Apache Spark with Gene Pang

  1. 1. Best Practices for Using Alluxio with Spark Gene Pang,Alluxio, Inc. Spark Summit EU - October 2017
  2. 2. About Me •  Gene Pang •  Software engineer @ Alluxio, Inc. •  Alluxio open source PMC member •  Ph.D. from AMPLab @ UC Berkeley •  Worked at Google before UC Berkeley •  Twitter: @unityxx •  Github: @gpang ©2017 Alluxio, Inc.All Rights Reserved 2
  3. 3. Outline Alluxio Overview Alluxio + Spark Use Cases Alluxio Architecture Using Spark with Alluxio Experiments 1 2 3 4 5 ©2017 Alluxio, Inc.All Rights Reserved 3
  4. 4. Data EcosystemYesterday •  One Compute Framework •  Single Storage System •  Co-located ©2017 Alluxio, Inc.All Rights Reserved 4
  5. 5. Data Ecosystem Today •  Many Compute Frameworks •  Multiple Storage Systems •  Most not co-located ©2017 Alluxio, Inc.All Rights Reserved 5
  6. 6. Data Ecosystem Issues •  Each application manage multiple data sources •  Add/Removing data sources require application changes •  Storage optimizations requires application change •  Lower performance due to lack of locality ©2017 Alluxio, Inc.All Rights Reserved 6
  7. 7. Data Ecosystem with Alluxio •  Apps only talk to Alluxio •  Simple Add/Remove •  No App Changes •  Memory Performance Native File System Hadoop Compatible File System Native Key-Value Interface Fuse Compatible File System HDFS Interface Amazon S3 Interface Swift Interface GlusterFS Interface ©2017 Alluxio, Inc.All Rights Reserved 7
  8. 8. Next Gen Analytics with Alluxio Native File System Hadoop Compatible File System Native Key-Value Interface Fuse Compatible File System HDFS Interface Amazon S3 Interface Swift Interface GlusterFS Interface Apps, Data & Storage at Memory Speed ü  Big Data/IoT ü  AI/ML ü  Deep Learning ü  Cloud Migration ü  Multi Platform ü  Autonomous ©2017 Alluxio, Inc.All Rights Reserved 8
  9. 9. Fastest Growing Big Data Open Source Projects Fastest Growing open- source project in the big data ecosystem Running in large production clusters 600+ Contributors from 100+ organizations 0 100 200 300 400 500 0 10 20 30 40 45 NumberofContributors Github Open Source Contributors by Month Alluxio Spark Kafka Redis HDFS Cassandra Hive ©2017 Alluxio, Inc.All Rights Reserved 9
  10. 10. Outline Alluxio Overview Alluxio + Spark Use Cases Alluxio Architecture Using Spark with Alluxio Experiments 1 2 3 4 5 ©2017 Alluxio, Inc.All Rights Reserved 1 0
  11. 11. Big Data Case Study – 1 110/30/17 ©2017 Alluxio, Inc.All Rights Reserved Challenge – Gain end to end view of business with large volume of data Queries were slow / not interactive, resulting in operational inefficiency SPARK TERADATA SPARK TERADATA Solution – ETL Data from Teradata to Alluxio Impact – Faster Time to Market – “Now we don’t have to work Sundays” http://bit.ly/2oMx95W
  12. 12. Big Data Case Study – 1 210/30/17 ©2017 Alluxio, Inc.All Rights Reserved Challenge – Gain end to end view of business with large volume of data Queries were slow / not interactive, resulting in operational inefficiency SPARK Baidu File System SPARK Baidu File System Solution – With Alluxio, data queries are 30X faster Impact – Higher operational efficiency http://bit.ly/2pDHS3O
  13. 13. Big Data Case Study – Challenge – Gain end to end view of business with large volume of data for $5B Travel Site Queries were slow / not interactive, resulting in operational inefficiency SPARK HDFS Solution – With Alluxio, 300x improvement in performance Impact – Increased revenue from immediate response to user behavior Use case: http://bit.ly/2pDJdrq CEPH HDFS CEPH FLINK SPARK FLINK ©2017 Alluxio, Inc.All Rights Reserved 1 3
  14. 14. Machine Learning Case Study – 1 410/30/17 ©2017 Alluxio, Inc.All Rights Reserved Challenge – Disparate Data both on-prem and Cloud. Heterogeneous types of data. Scaling of Exabyte size data. Slow due to disk based approach. SPARK HDFS SPARK MINIO Solution – Using Alluxio to prevent I/O bottlenecks Impact – Orders of magnitude higher performance than before. http://bit.ly/2p18ds3 MESOS
  15. 15. Outline Alluxio Overview Alluxio + Spark Use Cases Alluxio Architecture Using Spark with Alluxio Experiments 1 2 3 4 5 ©2017 Alluxio, Inc.All Rights Reserved 1 5
  16. 16. 1 6©2017 Alluxio, Inc.All Rights Reserved Application AlluxioClient Alluxio Master Alluxio Worker Alluxio Worker … Storage Storage … Alluxio Architecture
  17. 17. Alluxio Client Applications interact with Alluxio via the Alluxio client •  Java Native Alluxio Filesystem Client •  Alluxio specific operations like [un]pin, [un]mount, [un]set TTL •  HDFS-Compatible Filesystem Client •  No code change necessary •  S3 API ©2017 Alluxio, Inc.All Rights Reserved 1 7
  18. 18. Alluxio Master Master is responsible for managing metadata •  Filesystem namespace metadata •  Blocks / workers metadata Primary master writes journal for durable operations •  Secondary masters replay journal entries ©2017 Alluxio, Inc.All Rights Reserved 1 8
  19. 19. Alluxio Worker Worker is responsible for managing block data Worker stores block data on various storage media •  HDD, SSD, Memory Reads and writes data to underlying storage systems ©2017 Alluxio, Inc.All Rights Reserved 1 9
  20. 20. Outline Alluxio Overview Alluxio + Spark Use Cases Alluxio Architecture Using Spark with Alluxio Experiments 1 2 3 4 5 ©2017 Alluxio, Inc.All Rights Reserved 2 0
  21. 21. Sharing Data via Memory Storage Engine & Execution Engine Same Process •  Two copies of data in memory – double the memory used •  Sharing Slowed Down by Network / Disk I/O Spark Compute Spark Storage block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 Spark Compute Spark Storage block 1 block 3 ©2017 Alluxio, Inc.All Rights Reserved 2 1
  22. 22. Sharing Data via Memory Storage Engine & Execution Engine Different process •  Half the memory used •  Sharing Data at Memory Speed Spark Compute Spark Storage HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Alluxio block 1 block 3 block 4 Spark Compute Spark Storage ©2017 Alluxio, Inc.All Rights Reserved 2 2
  23. 23. Data Resilience During Crash Spark Compute Spark Storage block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 Storage Engine & Execution Engine Same Process ©2017 Alluxio, Inc.All Rights Reserved 2 3
  24. 24. Data Resilience During Crash CRASH Spark Storage block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 •  Process Crash Requires Network and/or Disk I/O to Re-read Data Storage Engine & Execution Engine Same Process ©2017 Alluxio, Inc.All Rights Reserved 2 4
  25. 25. Data Resilience During Crash CRASH HDFS / Amazon S3 block 1 block 3 block 2 block 4 Storage Engine & Execution Engine Same Process •  Process Crash Requires Network and/or Disk I/O to Re-read Data ©2017 Alluxio, Inc.All Rights Reserved 2 5
  26. 26. Data Resilience During Crash Spark Compute Spark Storage HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Alluxio block 1 block 3 block 4 Storage Engine & Execution Engine Different process ©2017 Alluxio, Inc.All Rights Reserved 2 6
  27. 27. Data Resilience During Crash •  Process Crash – Data is Re-read at Memory Speed HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Alluxio block 1 block 3 block 4 CRASH Storage Engine & Execution Engine Different process ©2017 Alluxio, Inc.All Rights Reserved 2 7
  28. 28. Accessing Alluxio Data From Spark Writing Data Write to an Alluxio file Reading Data Read from an Alluxio file ©2017 Alluxio, Inc.All Rights Reserved 2 8
  29. 29. Code Example for Spark RDDs Writing RDD to Alluxio rdd.saveAsTextFile(alluxioPath)! rdd.saveAsObjectFile(alluxioPath)! Reading RDD from Alluxio rdd = sc.textFile(alluxioPath)! rdd = sc.objectFile(alluxioPath)! ©2017 Alluxio, Inc.All Rights Reserved 2 9
  30. 30. Code Example for Spark DataFrames Writing to Alluxio df.write.parquet(alluxioPath)! Reading from Alluxio df = sc.read.parquet(alluxioPath)! ©2017 Alluxio, Inc.All Rights Reserved 3 0
  31. 31. Deploying Alluxio with Spark ©2017 Alluxio, Inc.All Rights Reserved 3 1 Spark Alluxio Storage Spark Alluxio Storage Colocate Alluxio Workers with Spark for optimal I/O performance Deploy Alluxio between Spark and Storage
  32. 32. Outline Alluxio Overview Alluxio + Spark Use Cases Alluxio Architecture Using Spark with Alluxio Experiments 1 2 3 4 5 ©2017 Alluxio, Inc.All Rights Reserved 3 2
  33. 33. Experiments Spark 2.2.0 + Alluxio 1.6.0 Single worker:Amazon r3.2xlarge Compare reading cached parquet files ©2017 Alluxio, Inc.All Rights Reserved 3 3
  34. 34. Reading Cached DataFrame (parquet) ©2017 Alluxio, Inc.All Rights Reserved 3 4
  35. 35. New Context: 50 GB DataFrame (S3) 6x – 8x speedup ©2017 Alluxio, Inc.All Rights Reserved 3 5
  36. 36. Conclusion Easy to use Alluxio with Spark Alluxio enables improved I/O performance Easily interact with various storage systems with Alluxio ©2017 Alluxio, Inc.All Rights Reserved 3 6
  37. 37. Thank you! Gene Pang gene@alluxio.com Twitter: @unityxx Twi$er.com/alluxio   Linkedin.com/alluxio     Website www.alluxio.com E-mail info@alluxio.com @ Social Media á ™ ©2017 Alluxio, Inc.All Rights Reserved 3 7

×