Best Practices for Using Alluxio with Spark

245 views

Published on

Strata New York
September 2017
Haoyuan Li, Ancil McBarnett

Published in: Technology
1 Comment
0 Likes
Statistics
Notes
  • Be the first to like this

No Downloads
Views
Total views
245
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
18
Comments
1
Likes
0
Embeds 0
No embeds

No notes for slide

Best Practices for Using Alluxio with Spark

  1. 1. Best Practices for Using Alluxio with Spark Haoyuan Li, Ancil McBarnett Strata NewYork, Sept 2017
  2. 2. Confidential © Alluxio, Inc.All Rights Reserved. 2 Outline Alluxio Overview Alluxio + Spark Use Cases Using Spark with Alluxio Performance Evaluation Demo 1 2 3 4 5
  3. 3. Confidential © Alluxio, Inc.All Rights Reserved. 3 Data EcosystemYesterday 3 •  One Compute Framework •  Single Storage System •  Co-located
  4. 4. Confidential © Alluxio, Inc.All Rights Reserved. 4 Data Ecosystem Today •  Many Compute Frameworks •  Multiple Storage Systems •  Most not co-located
  5. 5. Confidential © Alluxio, Inc.All Rights Reserved. 5 Data Ecosystem Issues 5 •  Each application manage multiple data sources •  Add/Removing data sources require application changes •  Storage optimizations requires application change •  Lower performance due to lack of locality
  6. 6. Confidential © Alluxio, Inc.All Rights Reserved. 6 Data Ecosystem Challenges 2 Data Freshness •  Cross-network movement is slow •  Each ETL creates more lag 4 Security & Governance •  Data security & governance is increasingly complex 1 Speed & Complexity •  Numerous storage & compute systems •  Integration and interoperability issues (on prem, hybrid, cloud) •  Many departments & groups 3 Cost •  Data duplication •  Data and App explosion driving cost up 6 Heavy integrations create painful organizational drag
  7. 7. Confidential © Alluxio, Inc.All Rights Reserved. 7 Data Ecosystem with Alluxio 7 •  Apps only talk to Alluxio •  Simple Add/Remove •  No App Changes •  Highest performance in Memory •  No Lock in Native File System Hadoop Compatible File System Native Key-Value Interface Fuse Compatible File System HDFS Interface Amazon S3 Interface Swift Interface GlusterFS Interface
  8. 8. Confidential © Alluxio, Inc.All Rights Reserved. 8 Alluxio Design Principles 2 Optimize Data Access •  Remote data •  Service-oriented & microservices •  Hot/warm/cold data •  Temporary data 4 Enterprise Class •  Distributed Architecture •  Commodity Hardware •  High Availability •  Security 1 Big Data & Machine Learning •  Interoperability with leading projects •  Large scale data sets •  High IO 3 Application Data Sharing •  Multiple compute frameworks within a node or cluster •  Shared storage •  Read/write support 8
  9. 9. Confidential © Alluxio, Inc.All Rights Reserved. 9 Alluxio Innovation: Server-side API Translation Convert from Client-side Interface to Native Storage Interface HDFS Interface HDFS Interface S3A Interface Swift Interface Google Cloud Interface
  10. 10. Confidential © Alluxio, Inc.All Rights Reserved. 10 Alluxio Innovation: Server-side API Translation Convert between different versions of HDFS HDFS 2.7 Interface HDP 2.4 InterfaceCDH 5.6 Interface MAPR 5.2 Interface
  11. 11. Confidential © Alluxio, Inc.All Rights Reserved. 11 Alluxio Innovation: Unified Namespace Enables effective data management across different Under Stores Uses Mounting with Transparent Naming
  12. 12. Confidential © Alluxio, Inc.All Rights Reserved. 12 Alluxio Innovation: Unified Namespace Create a catalog of available data sources for Data Scientists /finance/customer-transactions/ /finance/vendor-transactions/ /operations/device-logs/ /operations/phone-call-recordings/ /operations/check-images/ /research/us-economic-data/ /research/intl-economic-data/ /marketing/advertising-dataset/ /marketing/marketing-funnel-dataset/ alluxio://
  13. 13. Confidential © Alluxio, Inc.All Rights Reserved. 13 Alluxio Innovation: Intelligent Cache Local performance from remote data using native multi-tier storage RAM SSD HDD Hot Warm Cold
  14. 14. Confidential © Alluxio, Inc.All Rights Reserved. 14 Where to use Alluxio Finding high-fit Alluxio use-cases Compute Zone Standalone or managed with Mesos orYarn Storage in Different Availability Zone Either on-prem or cloud Alluxio is installed with or near compute to unify data stores, stage remote data, and improve system performance. Spark Tensorflow Presto HDFS Guidelines ü  Compute separated from storage ü  Distributed compute ü  I/O or network latency exists ü  Unification of many storage systems ü  Applications sharing long lived data More checks result in higher fit applications
  15. 15. Confidential © Alluxio, Inc.All Rights Reserved. 15 Fastest Growing Big Data Open Source Projects Fastest Growing open-source project in the big data ecosystem Running in large production clusters 500+ Contributors from 100+ organizations 0 100 200 300 400 500 0 10 20 30 40 45 NumberofContributors Github Open Source Contributors by Month Alluxio Spark Kafka Redis HDFS Cassandra Hive 15
  16. 16. Confidential © Alluxio, Inc.All Rights Reserved. 16 Outline Alluxio Overview Alluxio + Spark Use Cases Using Spark with Alluxio Performance Evaluation Demo 1 2 3 4 5
  17. 17. Confidential © Alluxio, Inc.All Rights Reserved. 17 Big Data Case Study – 17 Challenge – Gain end to end view of business with large volume of data Queries were slow / not interactive, resulting in operational inefficiency SPARK TERADATA SPARK TERADATA Solution – ETL Data from Teradata to Alluxio Impact – Faster Time to Market – “Now we don’t have to work Sundays” http://bit.ly/2oMx95W
  18. 18. Confidential © Alluxio, Inc.All Rights Reserved. 18 Big Data Case Study – 18 Challenge – Gain end to end view of business with large volume of data Queries were slow / not interactive, resulting in operational inefficiency SPARK Baidu File System SPARK Baidu File System Solution – With Alluxio, data queries are 30X faster Impact – Higher operational efficiency http://bit.ly/2pDHS3O
  19. 19. Confidential © Alluxio, Inc.All Rights Reserved. 19 Big Data Case Study – 19 Challenge – Gain end to end view of business with large volume of data for $5B Travel Site Queries were slow / not interactive, resulting in operational inefficiency SPARK HDFS Solution – With Alluxio, 300x improvement in performance Impact – Increased revenue from immediate response to user behavior Use case: http://bit.ly/2pDJdrq CEPH HDFS CEPH FLINK SPARK FLINK
  20. 20. Confidential © Alluxio, Inc.All Rights Reserved. 20 Machine Learning Case Study – 20 Challenge – Disparate Data both on-prem and Cloud. Heterogeneous types of data. Scaling of Exabyte size data. Slow due to disk based approach. SPARK HDFS SPARK MINIO Solution – Using Alluxio to prevent I/O bottlenecks Impact – Orders of magnitude higher performance than before. http://bit.ly/2p18ds3 MESOS
  21. 21. Confidential © Alluxio, Inc.All Rights Reserved. 21 Outline Alluxio Overview Alluxio + Spark Use Cases Using Spark with Alluxio Performance Evaluation Demo 1 2 3 4 5
  22. 22. Confidential © Alluxio, Inc.All Rights Reserved. 22 Consolidating Memory 22 Storage Engine & Execution Engine Same Process •  Two copies of data in memory – double the memory used •  Inter-process Sharing Slowed Down by Network / Disk I/O Spark Compute Spark Storage block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 Spark Compute Spark Storage block 1 block 3
  23. 23. Confidential © Alluxio, Inc.All Rights Reserved. 23 Consolidating Memory 23 Storage Engine & Execution Engine Different process •  Half the memory used •  Inter-process Sharing Happens at Memory Speed Spark Compute Spark Storage HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Alluxio block 1 block 3 block 4 Spark Compute Spark Storage
  24. 24. Confidential © Alluxio, Inc.All Rights Reserved. 24 Data Resilience During Crash 24 Spark Compute Spark Storage block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 Storage Engine & Execution Engine Same Process
  25. 25. Confidential © Alluxio, Inc.All Rights Reserved. 25 Data Resilience During Crash 25 CRASH Spark Storage block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 •  Process Crash Requires Network and/or Disk I/O to Re-read Data Storage Engine & Execution Engine Same Process
  26. 26. Confidential © Alluxio, Inc.All Rights Reserved. 26 Data Resilience During Crash 26 CRASH HDFS / Amazon S3 block 1 block 3 block 2 block 4 Storage Engine & Execution Engine Same Process •  Process Crash Requires Network and/or Disk I/O to Re-read Data
  27. 27. Confidential © Alluxio, Inc.All Rights Reserved. 27 Data Resilience During Crash 27 Spark Compute Spark Storage HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Alluxio block 1 block 3 block 4 Storage Engine & Execution Engine Different process
  28. 28. Confidential © Alluxio, Inc.All Rights Reserved. 28 Data Resilience During Crash 28 •  Process Crash – Data is Re-read at Memory Speed HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Alluxio block 1 block 3 block 4 CRASH Storage Engine & Execution Engine Different process
  29. 29. Confidential © Alluxio, Inc.All Rights Reserved. 29 Accessing Alluxio Data From Spark 29 Writing Data Write to an Alluxio file Reading Data Read from an Alluxio file
  30. 30. Confidential © Alluxio, Inc.All Rights Reserved. 30 Code Example for Spark RDDs 30 Writing RDD to Alluxio rdd.saveAsTextFile(alluxioPath)! rdd.saveAsObjectFile(alluxioPath)! Reading RDD from Alluxio rdd = sc.textFile(alluxioPath)! rdd = sc.objectFile(alluxioPath)!
  31. 31. Confidential © Alluxio, Inc.All Rights Reserved. 31 Code Example for Spark DataFrames 31 Writing to Alluxio df.write.parquet(alluxioPath)! Reading from Alluxio df = sc.read.parquet(alluxioPath)!
  32. 32. Confidential © Alluxio, Inc.All Rights Reserved. 32 Outline Alluxio Overview Alluxio + Spark Use Cases Using Spark with Alluxio Performance Evaluation Demo 1 2 3 4 5
  33. 33. Confidential © Alluxio, Inc.All Rights Reserved. 33 Experiments Spark 2.0.0 + Alluxio 1.2.0 Single worker:Amazon r3.2xlarge Comparisons: Alluxio Spark Storage Level: MEMORY_ONLY Spark Storage Level: MEMORY_ONLY_SER Spark Storage Level: DISK_ONLY
  34. 34. Confidential © Alluxio, Inc.All Rights Reserved. 34 0 50 100 150 200 250 0 5 10 15 20 25 30 35 40 45 50 Time[seconds] RDD Size [GB] Alluxio (textFile) Alluxio (objectFile) DISK_ONLY MEMORY_ONLY_SER MEMORY_ONLY Reading Cached RDD 34
  35. 35. Confidential © Alluxio, Inc.All Rights Reserved. 35 0 100 200 300 400 500 600 700 800 Alluxio (textFile) Alluxio (objectFile) No Alluxio Time [seconds] 7x speedup 16x speedup New Context: Read 50 GB RDD (S3) 35
  36. 36. Confidential © Alluxio, Inc.All Rights Reserved. 36 Reading Cached DataFrame (parquet) 36 0 50 100 150 200 250 0 5 10 15 20 25 30 35 40 45 50 Time[seconds] DataFrame Size [GB] Alluxio (textFile) MEMORY_ONLY_SER MEMORY_ONLY
  37. 37. Confidential © Alluxio, Inc.All Rights Reserved. 37 New Context: Read 50 GB DataFrame (S3) 37 0 250 500 750 1000 1250 1500 1750 Alluxio No Alluxio Time [seconds] 10x average speedup, 17x peak speedup
  38. 38. Confidential © Alluxio, Inc.All Rights Reserved. 38 Outline Alluxio Overview Alluxio + Spark Use Cases Using Spark with Alluxio Performance Evaluation Demo 1 2 3 4 5
  39. 39. Confidential © Alluxio, Inc.All Rights Reserved. 39 Demo Environment 39 Spark Alluxio
  40. 40. Confidential © Alluxio, Inc.All Rights Reserved. 40 Conclusion Easy to use Alluxio with Spark Predictable and improved performance Easily connect to various storages
  41. 41. Twi$er.com/alluxio Linkedin.com/alluxio Website www.alluxio.com E-mail info@alluxio.com @ Social Media á ™ Confidential © Alluxio, Inc.All Rights Reserved. 41 Thank you! Haoyuan Li Ancil McBarnett haoyuan@alluxio.com ancil@alluxio.com Twitter: @haoyuan Twitter: @ Twi$er.com/alluxio Linkedin.com/alluxio Website www.alluxio.com E-mail info@alluxio.com @ Social Media á ™

×