Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

520 views

Published on

Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

Published in: Technology
  • Be the first to comment

Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

  1. 1. EFFECTIVE SPARK WITH ALLUXIO Strata + Hadoop World San Jose 2017 Calvin Jia, Alluxio Inc.
  2. 2. OUTLINE • Alluxio Overview • Alluxio + Spark Use Cases • Using Alluxio with Spark • Performance Evaluation 2
  3. 3. BIG DATA ECOSYSTEM TODAYBIG DATA ECOSYSTEM WITH ALLUXIO … … FUSE Compatible File SystemHadoop Compatible File System Native Key-Value InterfaceNative File System Unifying Data at Memory Speed BIG DATA ECOSYSTEM ISSUES GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface 3 BIG DATA ECOSYSTEM YESTERDAY
  4. 4. 4 • Fastest growing open- source project in the big data ecosystem • 400+ contributors from 100+ organizations • Running in large production clusters • Community members are welcome! FASTEST GROWING BIG DATA PROJECTS Popular Open Source Projects’ Growth Months NumberofContributors
  5. 5. OUTLINE • Alluxio Overview • Alluxio + Spark Use Cases • Using Alluxio with Spark • Performance Evaluation 5
  6. 6. ACCELERATE I/O TO/FROM REMOTE STORAGE The performance was amazing. With Spark SQL alone, it took 100-150 seconds to finish a query; using Alluxio, where data may hit local or remote Alluxio nodes, it took 10-15 seconds. - Baidu RESULTS • Data queries are now 30x faster with Alluxio • Alluxio cluster runs stably, providing over 50TB of RAM space • By using Alluxio, batch queries usually lasting over 15 minutes were transformed into an interactive query taking less than 30 seconds Baidu’s PMs and analysts run interactive queries to gain insights into their products and business • 200+ nodes deployment • 2+ petabytes of storage • Mix of memory + HDD ALLUXIO Baidu File System 6
  7. 7. GUARANTEE PERFORMANCE TO MEET SLAS RESULTS • Introducing Alluxio provides the system with smooth performance in an environment with highly variable load • Without Alluxio, the performance of critical jobs could take more than 4x the expected time • With Alluxio’s guaranteed performance, business logic could be reliably be executed leading to improved user experience • Real-time machine learning • 10+ TB of storage • Mix of Memory + HDD A leading online retailer uses Alluxio to compute conversion attribution 7 ALLUXIO
  8. 8. SHARE DATA ACROSS JOBS @ MEMORY SPEED Thanks to Alluxio, we now have the raw data immediately available at every iteration and we can skip the costs of loading in terms of time waiting, network traffic, and RDBMS activity. - Barclays RESULTS • Barclays workflow iteration time decreased from hours to seconds • Alluxio enabled workflows that were impossible before • By keeping data only in memory, the I/O cost of loading and storing in Alluxio is now on the order of seconds Barclays uses query and machine learning to train models for risk management • 6 node deployment • 1TB of storage • Memory only ALLUXIO Relational Database 8
  9. 9. TRANSPARENTLY MANAGE DATA ACROSS STORAGE SYSTEMS We’ve been running Alluxio in production for over 9 months, Alluxio’s unified namespace enable different applications and frameworks to easily interact with data from different storage systems. - Qunar RESULTS • Data sharing among Spark Streaming, Spark batch and Flink jobs provide efficient data sharing • Improved the performance of their system with 15x – 300x speedups • Tiered storage feature manages storage resources including memory, SSD and disk • 200+ nodes deployment • 6 billion logs (4.5 TB) daily • Mix of Memory + HDD ALLUXIO Qunar uses real-time machine learning for their website ads. 9
  10. 10. OUTLINE • Alluxio Overview • Alluxio + Spark Use Cases • Using Alluxio with Spark • Performance Evaluation 10
  11. 11. CONSOLIDATING MEMORY 11 Storage Engine & Execution Engine Same Process • Two copies of data in memory – double the memory used • Inter-process Sharing Slowed Down by Network / Disk I/O Spark Compute Spark Storage block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 Spark Compute Spark Storage block 1 block 3
  12. 12. CONSOLIDATING MEMORY 12 Storage Engine & Execution Engine Different process • Half the memory used • Inter-process Sharing Happens at Memory Speed Spark Compute Spark Storage HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block  1 block  3 block  2 block  4 Alluxio block 1 block 3 block 4 Spark Compute Spark Storage
  13. 13. DATA RESILIENCE DURING CRASH 13 Spark Compute Spark Storage block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 Storage Engine & Execution Engine Same Process
  14. 14. DATA RESILIENCE DURING CRASH 14 CRASH Spark Storage block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 • Process Crash Requires Network and/or Disk I/O to Re-read Data Storage Engine & Execution Engine Same Process
  15. 15. DATA RESILIENCE DURING CRASH 15 CRASH HDFS / Amazon S3 block 1 block 3 block 2 block 4 Storage Engine & Execution Engine Same Process • Process Crash Requires Network and/or Disk I/O to Re-read Data
  16. 16. DATA RESILIENCE DURING CRASH 16 Spark Compute Spark Storage HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block  1 block  3 block  2 block  4 Alluxio block 1 block 3 block 4 Storage Engine & Execution Engine Different process
  17. 17. DATA RESILIENCE DURING CRASH 17 • Process Crash – Data is Re-read at Memory Speed HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block  1 block  3 block  2 block  4 Alluxio block 1 block 3 block 4 CRASH Storage Engine & Execution Engine Different process
  18. 18. ACCESSING ALLUXIO DATA FROM SPARK 18 Writing Data Write to an Alluxio file Reading Data Read from an Alluxio file
  19. 19. CODE EXAMPLE FOR SPARK RDDS 19 Writing RDD to Alluxio rdd.saveAsTextFile(alluxioPath) rdd.saveAsObjectFile(alluxioPath) Reading RDD from Alluxio rdd = sc.textFile(alluxioPath) rdd = sc.objectFile(alluxioPath)
  20. 20. CODE EXAMPLE FOR SPARK DATAFRAMES 20 Writing to Alluxio df.write.parquet(alluxioPath) Reading from Alluxio df = sc.read.parquet(alluxioPath)
  21. 21. OUTLINE • Alluxio Overview • Alluxio + Spark Use Cases • Using Alluxio with Spark • Performance Evaluation 21
  22. 22. ENVIRONMENT Spark 2.0.0 + Alluxio 1.2.0 Single worker: Amazon r3.2xlarge Comparisons: • Alluxio • Spark Storage Level: MEMORY_ONLY • Spark Storage Level: MEMORY_ONLY_SER • Spark Storage Level: DISK_ONLY 22
  23. 23. 0 50 100 150 200 250 0 10 20 30 40 50 Time[seconds] RDD Size [GB] READING CACHEDRDD Alluxio (textFile) Alluxio (objectFile) DISK_ONLY MEMORY_ONLY_SER MEMORY_ONLY 23
  24. 24. 24 0 50 100 150 200 250 Alluxio (textFile) Alluxio (objectFile) No Alluxio Time [seconds] NEW CONTEXT: READ 50 GB RDD (SSD)
  25. 25. 25 0 100 200 300 400 500 600 700 800 Alluxio (textFile) Alluxio (objectFile) No Alluxio Time [seconds] NEW CONTEXT: READ 50 GB RDD (S3) 7x speedup 16x speedup
  26. 26. 26 0 50 100 150 200 250 0 10 20 30 40 50 Time[seconds] DataFrame Size [GB] READING CACHED DATAFRAME (PARQUET) Alluxio (textFile) MEMORY_ONLY_SER MEMORY_ONLY
  27. 27. 27 0 50 100 150 200 250 Alluxio No Alluxio Time [seconds] NEW CONTEXT: READ 50 GB DATAFRAME (SSD)
  28. 28. 28 0 250 500 750 1000 1250 1500 1750 Alluxio No Alluxio Time [seconds] NEW CONTEXT: READ 50 GB DATAFRAME (S3) 10x average speedup, 17x peak speedup
  29. 29. CONCLUSION • Easy to use with Spark • Predictable and improved performance • Easily connect to various storages 29
  30. 30. Contact: calvin@alluxio.com Twitter: @jiacalvin Websites: www.alluxio.com and www.alluxio.org Thank you! 30
  31. 31. 31 Unification New workflows across any data in any storage system Orders of magnitude improvement in run time Choice in compute and storage – grow each independently, buy only what is needed Performance Flexibility BENEFITS

×