Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Spark Summit EU talk by Jiri Simsa

245 views

Published on

Effective Spark with Alluxio

Published in: Software
  • Be the first to comment

  • Be the first to like this

Spark Summit EU talk by Jiri Simsa

  1. 1. EFFECTIVE SPARK WITH ALLUXIO Spark Summit EU Jiri Simsa, Alluxio, Inc.
  2. 2. OUTLINE • Alluxio Overview • Alluxio + Spark Use Cases • Using Alluxio with Spark • Performance Evaluation 2
  3. 3. 3 BIG DATA ECOSYSTEM YESTERDAY
  4. 4. BIG DATA ECOSYSTEM TODAY … … 3
  5. 5. … … BIG DATA ECOSYSTEM ISSUES 3
  6. 6. BIG DATA ECOSYSTEM WITH ALLUXIO … … FUSE Compatible File SystemHadoop Compatible File System Native Key-Value InterfaceNative File System GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface 3
  7. 7. BIG DATA ECOSYSTEM WITH ALLUXIO … … FUSE Compatible File SystemHadoop Compatible File System Native Key-Value InterfaceNative File System Unifying Data at Memory Speed GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface 3
  8. 8. WHY ALLUXIO 4 Co-located compute and data with memory-speed access to data Virtualized across different storage systems under a unified namespace Scale-out architecture File system API, software only
  9. 9. 5 Unification New workflows across any data in any storage system Orders of magnitude improvement in run time Choice in compute and storage – grow each independently, buy only what is needed Performance Flexibility BENEFITS
  10. 10. 6 Popular Open Source Projects’ Growth Spark Kafka Cassandra HDFS FASTEST GROWING BIG DATA PROJECTS
  11. 11. 6 • Fastest growing open-source project in the big data ecosystem • 300+ contributors from 100+ organizations • Running in large production clusters Popular Open Source Projects’ Growth Alluxio Spark Kafka Cassandra HDFS FASTEST GROWING BIG DATA PROJECTS
  12. 12. OUTLINE • Alluxio Overview • Alluxio + Spark Use Cases • Using Alluxio with Spark • Performance Evaluation 7
  13. 13. ACCELERATE I/O TO/FROM REMOTE STORAGE Baidu’s PMs and analysts run 
 interactive queries to gain insights 
 into their products and business • 200+ nodes deployment • 2+ petabytes of storage • Mix of memory + HDD ALLUXIO Baidu File System 8
  14. 14. ACCELERATE I/O TO/FROM REMOTE STORAGE The performance was amazing. With Spark SQL alone, it took 100-150 seconds to finish a query; using Alluxio, where data may hit local or remote Alluxio nodes, it took 10-15 seconds. - Baidu RESULTS • Data queries are now 30x faster with Alluxio • Alluxio cluster runs stably, providing over 50TB of RAM space • By using Alluxio, batch queries usually lasting over 15 minutes were transformed into an interactive query taking less than 30 seconds Baidu’s PMs and analysts run 
 interactive queries to gain insights 
 into their products and business • 200+ nodes deployment • 2+ petabytes of storage • Mix of memory + HDD ALLUXIO Baidu File System 8
  15. 15. SHARE DATA ACROSS JOBS AT MEMORY SPEED Barclays uses query and machine learning to train models for risk management • 6 node deployment • 1TB of storage • Memory only ALLUXIO Relational Database 9
  16. 16. SHARE DATA ACROSS JOBS AT MEMORY SPEED Thanks to Alluxio, we now have the raw data immediately available at every iteration and we can skip the costs of loading in terms of time waiting, network traffic, and RDBMS activity. - Barclays RESULTS • Barclays workflow iteration time decreased from hours to seconds • Alluxio enabled workflows that were impossible before • By keeping data only in memory, the I/O cost of loading and storing in Alluxio is now on the order of seconds Barclays uses query and machine learning to train models for risk management • 6 node deployment • 1TB of storage • Memory only ALLUXIO Relational Database 9
  17. 17. MANAGE DATA ACROSS STORAGE SYSTEMS • 200+ nodes deployment • 6 billion logs (4.5 TB) daily • Mix of Memory + HDD ALLUXIO Qunar uses real-time machine learning for their website ads. 10
  18. 18. MANAGE DATA ACROSS STORAGE SYSTEMS We’ve been running Alluxio in production for over 9 months, Alluxio’s unified namespace enable different applications and frameworks to easily interact with data from different storage systems. - Qunar RESULTS • Data sharing among Spark Streaming, Spark batch and Flink jobs provide efficient data sharing • Improved the performance of their system with 15x – 300x speedups • Tiered storage feature manages storage resources including memory and HDD • 200+ nodes deployment • 6 billion logs (4.5 TB) daily • Mix of Memory + HDD ALLUXIO Qunar uses real-time machine learning for their website ads. 10
  19. 19. OUTLINE • Alluxio Overview • Alluxio + Spark Use Cases • Using Alluxio with Spark • Performance Evaluation 11
  20. 20. CONSOLIDATING MEMORY 12 Storage Engine & Execution Engine Same Process • Two copies of data in memory – double the memory used • Inter-process Sharing Slowed Down by Network / Disk I/O Spark Compute Spark Storage block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 Spark Compute Spark Storage block 1 block 3
  21. 21. CONSOLIDATING MEMORY 13 Storage Engine & Execution Engine Different process • Half the memory used • Inter-process Sharing Happens at Memory Speed Spark Compute Spark Storage HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Alluxio block 1 block 3 block 4 Spark Compute Spark Storage
  22. 22. DATA RESILIENCE DURING CRASH 14 Spark Compute Spark Storage block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 Storage Engine & Execution Engine Same Process
  23. 23. DATA RESILIENCE DURING CRASH 15 CRASH Spark Storage block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 • Process Crash Requires Network and/or Disk I/O to Re-read Data Storage Engine & Execution Engine Same Process
  24. 24. DATA RESILIENCE DURING CRASH 16 CRASH HDFS / Amazon S3 block 1 block 3 block 2 block 4 Storage Engine & Execution Engine Same Process • Process Crash Requires Network and/or Disk I/O to Re-read Data
  25. 25. DATA RESILIENCE DURING CRASH 17 Spark Compute Spark Storage HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Alluxio block 1 block 3 block 4 Storage Engine & Execution Engine Different process
  26. 26. DATA RESILIENCE DURING CRASH 18 • Process Crash – Data is Re-read at Memory Speed HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Alluxio block 1 block 3 block 4 CRASH Storage Engine & Execution Engine Different process
  27. 27. ACCESSING ALLUXIO DATA FROM SPARK 19 Writing Data Write to an Alluxio file Reading Data Read from an Alluxio file
  28. 28. CODE EXAMPLE FOR SPARK RDDS 20 Writing RDD to Alluxio rdd.saveAsTextFile(alluxioPath) rdd.saveAsObjectFile(alluxioPath) Reading RDD from Alluxio rdd = sc.textFile(alluxioPath) rdd = sc.objectFile(alluxioPath)
  29. 29. CODE EXAMPLE FOR SPARK DATAFRAMES 21 Writing DataFrame to Alluxio df.write.parquet(alluxioPath) Reading DataFrame from Alluxio df = sc.read.parquet(alluxioPath)
  30. 30. OUTLINE • Alluxio Overview • Alluxio + Spark Use Cases • Using Alluxio with Spark • Performance Evaluation 22
  31. 31. ENVIRONMENT Spark 2.0.0 + Alluxio 1.2.0 Single r3.2xlarge instance (61GB RAM) Comparisons: • Alluxio • Spark Storage Level: MEMORY_ONLY • Spark Storage Level: MEMORY_ONLY_SER • Spark Storage Level: DISK_ONLY 23
  32. 32. Reading Cached RDDTime[seconds] 0 75 150 225 300 RDD Size [GB] 0 13 25 38 50 Alluxio (textFile) Alluxio (objectFile) DISK_ONLY MEMORY_ONLY_SER MEMORY_ONLY 24
  33. 33. 25 New Context: Read 50 GB RDD (SSD) No Alluxio Alluxio (objectFile) Alluxio (textFile) Time [seconds] 0 55 110 165 220 2x speedup 4x speedup
  34. 34. 26 New Context: Read 50 GB RDD (S3) No Alluxio Alluxio (objectFile) Alluxio (textFile) Time [seconds] 0 200 400 600 800 7x speedup 16x speedup
  35. 35. 27 Reading CACHED DATAFRAME (parquet)Time[seconds] 0 62.5 125 187.5 250 DataFrame Size [GB] 0 13 25 38 50 Alluxio (textFile) DISK_ONLY MEMORY_ONLY_SER MEMORY_ONLY
  36. 36. 28 New Context: Read 50 GB DATAFRAME (SSD) No Alluxio Alluxio Time [seconds] 0 60 120 180 240 2.5x speedup
  37. 37. 29 New CONTEXT: Read 50 GB DataFrame (S3) No Alluxio Alluxio Time [seconds] 0 250 500 750 1000 1250 1500 1750 10x speedup (avg), 17x speedup (peak)
  38. 38. CONCLUSION 30 • Easy to use with Spark • Better Performance • Predictable Performance
  39. 39. Contact: jiri@alluxio.com Twitter: @jsimsa Websites: www.alluxio.com and www.alluxio.org Thank you! 31

×