Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Best Practices for Using Alluxio with Spark
Haoyuan Li, Ancil McBarnett
Strata NewYork, Sept 2017
Confidential © Alluxio, Inc.All Rights Reserved. 2
Outline
Alluxio Overview
Alluxio + Spark Use Cases
Using Spark with All...
Confidential © Alluxio, Inc.All Rights Reserved. 3
Data EcosystemYesterday
3
•  One Compute
Framework
•  Single Storage Sy...
Confidential © Alluxio, Inc.All Rights Reserved. 4
Data Ecosystem Today
•  Many Compute
Frameworks
•  Multiple Storage Sys...
Confidential © Alluxio, Inc.All Rights Reserved. 5
Data Ecosystem Issues
5	
•  Each application manage
multiple data sourc...
Confidential © Alluxio, Inc.All Rights Reserved. 6
Data Ecosystem Challenges
2 Data Freshness
•  Cross-network movement is...
Confidential © Alluxio, Inc.All Rights Reserved. 7
Data Ecosystem with Alluxio
7	
•  Apps only talk to
Alluxio
•  Simple A...
Confidential © Alluxio, Inc.All Rights Reserved. 8
Alluxio Design Principles
2 Optimize Data Access
•  Remote data
•  Serv...
Confidential © Alluxio, Inc.All Rights Reserved. 9
Alluxio Innovation:
Server-side API Translation
Convert from Client-sid...
Confidential © Alluxio, Inc.All Rights Reserved. 10
Alluxio Innovation:
Server-side API Translation
Convert between differ...
Confidential © Alluxio, Inc.All Rights Reserved. 11
Alluxio Innovation:
Unified Namespace
Enables effective data management...
Confidential © Alluxio, Inc.All Rights Reserved. 12
Alluxio Innovation:
Unified Namespace
Create a catalog of available dat...
Confidential © Alluxio, Inc.All Rights Reserved. 13
Alluxio Innovation:
Intelligent Cache
Local performance from remote da...
Confidential © Alluxio, Inc.All Rights Reserved. 14
Where to use Alluxio
Finding high-fit Alluxio use-cases
Compute Zone
St...
Confidential © Alluxio, Inc.All Rights Reserved. 15
Fastest Growing Big Data
Open Source Projects
Fastest Growing
open-sou...
Confidential © Alluxio, Inc.All Rights Reserved. 16
Outline
Alluxio Overview
Alluxio + Spark Use Cases
Using Spark with Al...
Confidential © Alluxio, Inc.All Rights Reserved. 17
Big Data Case Study –
17
Challenge –
Gain end to end view of
business ...
Confidential © Alluxio, Inc.All Rights Reserved. 18
Big Data Case Study –
18
Challenge –
Gain end to end view of
business ...
Confidential © Alluxio, Inc.All Rights Reserved. 19
Big Data Case Study –
19
Challenge –
Gain end to end view of
business ...
Confidential © Alluxio, Inc.All Rights Reserved. 20
Machine Learning Case Study –
20
Challenge –
Disparate Data both on-pr...
Confidential © Alluxio, Inc.All Rights Reserved. 21
Outline
Alluxio Overview
Alluxio + Spark Use Cases
Using Spark with Al...
Confidential © Alluxio, Inc.All Rights Reserved. 22
Consolidating Memory
22
Storage Engine &
Execution Engine
Same Process...
Confidential © Alluxio, Inc.All Rights Reserved. 23
Consolidating Memory
23
Storage Engine &
Execution Engine
Different pr...
Confidential © Alluxio, Inc.All Rights Reserved. 24
Data Resilience During Crash
24
Spark Compute
Spark Storage
block 1
bl...
Confidential © Alluxio, Inc.All Rights Reserved. 25
Data Resilience During Crash
25
CRASH
Spark Storage
block 1
block 3
HD...
Confidential © Alluxio, Inc.All Rights Reserved. 26
Data Resilience During Crash
26
CRASH
HDFS / Amazon S3
block 1
block 3...
Confidential © Alluxio, Inc.All Rights Reserved. 27
Data Resilience During Crash
27
Spark Compute
Spark Storage
HDFS / Ama...
Confidential © Alluxio, Inc.All Rights Reserved. 28
Data Resilience During Crash
28
•  Process Crash – Data is Re-read at ...
Confidential © Alluxio, Inc.All Rights Reserved. 29
Accessing Alluxio Data From Spark
29
Writing Data Write to an Alluxio ...
Confidential © Alluxio, Inc.All Rights Reserved. 30
Code Example for Spark RDDs
30
Writing RDD to Alluxio
rdd.saveAsTextFi...
Confidential © Alluxio, Inc.All Rights Reserved. 31
Code Example for Spark DataFrames
31
Writing to Alluxio df.write.parqu...
Confidential © Alluxio, Inc.All Rights Reserved. 32
Outline
Alluxio Overview
Alluxio + Spark Use Cases
Using Spark with Al...
Confidential © Alluxio, Inc.All Rights Reserved. 33
Experiments
Spark 2.0.0 + Alluxio 1.2.0
Single worker:Amazon r3.2xlarg...
Confidential © Alluxio, Inc.All Rights Reserved. 34
0
50
100
150
200
250
0
 5
 10
 15
 20
 25
 30
 35
 40
 45
 50
Time[sec...
Confidential © Alluxio, Inc.All Rights Reserved. 35
0
 100
 200
 300
 400
 500
 600
 700
 800
Alluxio
(textFile)
Alluxio
(...
Confidential © Alluxio, Inc.All Rights Reserved. 36
Reading Cached DataFrame (parquet)
36
0
50
100
150
200
250
0
 5
 10
 1...
Confidential © Alluxio, Inc.All Rights Reserved. 37
New Context: Read 50 GB DataFrame (S3)
37
0
 250
 500
 750
 1000
 1250...
Confidential © Alluxio, Inc.All Rights Reserved. 38
Outline
Alluxio Overview
Alluxio + Spark Use Cases
Using Spark with Al...
Confidential © Alluxio, Inc.All Rights Reserved. 39
Demo Environment
39
Spark
Alluxio
Confidential © Alluxio, Inc.All Rights Reserved. 40
Conclusion
Easy to use Alluxio with Spark
Predictable and improved per...
Twi$er.com/alluxio	
Linkedin.com/alluxio	
	
Website
www.alluxio.com
E-mail
info@alluxio.com
@
Social Media
á
™
Confidentia...
Upcoming SlideShare
Loading in …5
×

Best Practices for Using Alluxio with Spark

891 views

Published on

Strata New York
September 2017
Haoyuan Li, Ancil McBarnett

Published in: Technology

Best Practices for Using Alluxio with Spark

  1. 1. Best Practices for Using Alluxio with Spark Haoyuan Li, Ancil McBarnett Strata NewYork, Sept 2017
  2. 2. Confidential © Alluxio, Inc.All Rights Reserved. 2 Outline Alluxio Overview Alluxio + Spark Use Cases Using Spark with Alluxio Performance Evaluation Demo 1 2 3 4 5
  3. 3. Confidential © Alluxio, Inc.All Rights Reserved. 3 Data EcosystemYesterday 3 •  One Compute Framework •  Single Storage System •  Co-located
  4. 4. Confidential © Alluxio, Inc.All Rights Reserved. 4 Data Ecosystem Today •  Many Compute Frameworks •  Multiple Storage Systems •  Most not co-located
  5. 5. Confidential © Alluxio, Inc.All Rights Reserved. 5 Data Ecosystem Issues 5 •  Each application manage multiple data sources •  Add/Removing data sources require application changes •  Storage optimizations requires application change •  Lower performance due to lack of locality
  6. 6. Confidential © Alluxio, Inc.All Rights Reserved. 6 Data Ecosystem Challenges 2 Data Freshness •  Cross-network movement is slow •  Each ETL creates more lag 4 Security & Governance •  Data security & governance is increasingly complex 1 Speed & Complexity •  Numerous storage & compute systems •  Integration and interoperability issues (on prem, hybrid, cloud) •  Many departments & groups 3 Cost •  Data duplication •  Data and App explosion driving cost up 6 Heavy integrations create painful organizational drag
  7. 7. Confidential © Alluxio, Inc.All Rights Reserved. 7 Data Ecosystem with Alluxio 7 •  Apps only talk to Alluxio •  Simple Add/Remove •  No App Changes •  Highest performance in Memory •  No Lock in Native File System Hadoop Compatible File System Native Key-Value Interface Fuse Compatible File System HDFS Interface Amazon S3 Interface Swift Interface GlusterFS Interface
  8. 8. Confidential © Alluxio, Inc.All Rights Reserved. 8 Alluxio Design Principles 2 Optimize Data Access •  Remote data •  Service-oriented & microservices •  Hot/warm/cold data •  Temporary data 4 Enterprise Class •  Distributed Architecture •  Commodity Hardware •  High Availability •  Security 1 Big Data & Machine Learning •  Interoperability with leading projects •  Large scale data sets •  High IO 3 Application Data Sharing •  Multiple compute frameworks within a node or cluster •  Shared storage •  Read/write support 8
  9. 9. Confidential © Alluxio, Inc.All Rights Reserved. 9 Alluxio Innovation: Server-side API Translation Convert from Client-side Interface to Native Storage Interface HDFS Interface HDFS Interface S3A Interface Swift Interface Google Cloud Interface
  10. 10. Confidential © Alluxio, Inc.All Rights Reserved. 10 Alluxio Innovation: Server-side API Translation Convert between different versions of HDFS HDFS 2.7 Interface HDP 2.4 InterfaceCDH 5.6 Interface MAPR 5.2 Interface
  11. 11. Confidential © Alluxio, Inc.All Rights Reserved. 11 Alluxio Innovation: Unified Namespace Enables effective data management across different Under Stores Uses Mounting with Transparent Naming
  12. 12. Confidential © Alluxio, Inc.All Rights Reserved. 12 Alluxio Innovation: Unified Namespace Create a catalog of available data sources for Data Scientists /finance/customer-transactions/ /finance/vendor-transactions/ /operations/device-logs/ /operations/phone-call-recordings/ /operations/check-images/ /research/us-economic-data/ /research/intl-economic-data/ /marketing/advertising-dataset/ /marketing/marketing-funnel-dataset/ alluxio://
  13. 13. Confidential © Alluxio, Inc.All Rights Reserved. 13 Alluxio Innovation: Intelligent Cache Local performance from remote data using native multi-tier storage RAM SSD HDD Hot Warm Cold
  14. 14. Confidential © Alluxio, Inc.All Rights Reserved. 14 Where to use Alluxio Finding high-fit Alluxio use-cases Compute Zone Standalone or managed with Mesos orYarn Storage in Different Availability Zone Either on-prem or cloud Alluxio is installed with or near compute to unify data stores, stage remote data, and improve system performance. Spark Tensorflow Presto HDFS Guidelines ü  Compute separated from storage ü  Distributed compute ü  I/O or network latency exists ü  Unification of many storage systems ü  Applications sharing long lived data More checks result in higher fit applications
  15. 15. Confidential © Alluxio, Inc.All Rights Reserved. 15 Fastest Growing Big Data Open Source Projects Fastest Growing open-source project in the big data ecosystem Running in large production clusters 500+ Contributors from 100+ organizations 0 100 200 300 400 500 0 10 20 30 40 45 NumberofContributors Github Open Source Contributors by Month Alluxio Spark Kafka Redis HDFS Cassandra Hive 15
  16. 16. Confidential © Alluxio, Inc.All Rights Reserved. 16 Outline Alluxio Overview Alluxio + Spark Use Cases Using Spark with Alluxio Performance Evaluation Demo 1 2 3 4 5
  17. 17. Confidential © Alluxio, Inc.All Rights Reserved. 17 Big Data Case Study – 17 Challenge – Gain end to end view of business with large volume of data Queries were slow / not interactive, resulting in operational inefficiency SPARK TERADATA SPARK TERADATA Solution – ETL Data from Teradata to Alluxio Impact – Faster Time to Market – “Now we don’t have to work Sundays” http://bit.ly/2oMx95W
  18. 18. Confidential © Alluxio, Inc.All Rights Reserved. 18 Big Data Case Study – 18 Challenge – Gain end to end view of business with large volume of data Queries were slow / not interactive, resulting in operational inefficiency SPARK Baidu File System SPARK Baidu File System Solution – With Alluxio, data queries are 30X faster Impact – Higher operational efficiency http://bit.ly/2pDHS3O
  19. 19. Confidential © Alluxio, Inc.All Rights Reserved. 19 Big Data Case Study – 19 Challenge – Gain end to end view of business with large volume of data for $5B Travel Site Queries were slow / not interactive, resulting in operational inefficiency SPARK HDFS Solution – With Alluxio, 300x improvement in performance Impact – Increased revenue from immediate response to user behavior Use case: http://bit.ly/2pDJdrq CEPH HDFS CEPH FLINK SPARK FLINK
  20. 20. Confidential © Alluxio, Inc.All Rights Reserved. 20 Machine Learning Case Study – 20 Challenge – Disparate Data both on-prem and Cloud. Heterogeneous types of data. Scaling of Exabyte size data. Slow due to disk based approach. SPARK HDFS SPARK MINIO Solution – Using Alluxio to prevent I/O bottlenecks Impact – Orders of magnitude higher performance than before. http://bit.ly/2p18ds3 MESOS
  21. 21. Confidential © Alluxio, Inc.All Rights Reserved. 21 Outline Alluxio Overview Alluxio + Spark Use Cases Using Spark with Alluxio Performance Evaluation Demo 1 2 3 4 5
  22. 22. Confidential © Alluxio, Inc.All Rights Reserved. 22 Consolidating Memory 22 Storage Engine & Execution Engine Same Process •  Two copies of data in memory – double the memory used •  Inter-process Sharing Slowed Down by Network / Disk I/O Spark Compute Spark Storage block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 Spark Compute Spark Storage block 1 block 3
  23. 23. Confidential © Alluxio, Inc.All Rights Reserved. 23 Consolidating Memory 23 Storage Engine & Execution Engine Different process •  Half the memory used •  Inter-process Sharing Happens at Memory Speed Spark Compute Spark Storage HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Alluxio block 1 block 3 block 4 Spark Compute Spark Storage
  24. 24. Confidential © Alluxio, Inc.All Rights Reserved. 24 Data Resilience During Crash 24 Spark Compute Spark Storage block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 Storage Engine & Execution Engine Same Process
  25. 25. Confidential © Alluxio, Inc.All Rights Reserved. 25 Data Resilience During Crash 25 CRASH Spark Storage block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 •  Process Crash Requires Network and/or Disk I/O to Re-read Data Storage Engine & Execution Engine Same Process
  26. 26. Confidential © Alluxio, Inc.All Rights Reserved. 26 Data Resilience During Crash 26 CRASH HDFS / Amazon S3 block 1 block 3 block 2 block 4 Storage Engine & Execution Engine Same Process •  Process Crash Requires Network and/or Disk I/O to Re-read Data
  27. 27. Confidential © Alluxio, Inc.All Rights Reserved. 27 Data Resilience During Crash 27 Spark Compute Spark Storage HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Alluxio block 1 block 3 block 4 Storage Engine & Execution Engine Different process
  28. 28. Confidential © Alluxio, Inc.All Rights Reserved. 28 Data Resilience During Crash 28 •  Process Crash – Data is Re-read at Memory Speed HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Alluxio block 1 block 3 block 4 CRASH Storage Engine & Execution Engine Different process
  29. 29. Confidential © Alluxio, Inc.All Rights Reserved. 29 Accessing Alluxio Data From Spark 29 Writing Data Write to an Alluxio file Reading Data Read from an Alluxio file
  30. 30. Confidential © Alluxio, Inc.All Rights Reserved. 30 Code Example for Spark RDDs 30 Writing RDD to Alluxio rdd.saveAsTextFile(alluxioPath)! rdd.saveAsObjectFile(alluxioPath)! Reading RDD from Alluxio rdd = sc.textFile(alluxioPath)! rdd = sc.objectFile(alluxioPath)!
  31. 31. Confidential © Alluxio, Inc.All Rights Reserved. 31 Code Example for Spark DataFrames 31 Writing to Alluxio df.write.parquet(alluxioPath)! Reading from Alluxio df = sc.read.parquet(alluxioPath)!
  32. 32. Confidential © Alluxio, Inc.All Rights Reserved. 32 Outline Alluxio Overview Alluxio + Spark Use Cases Using Spark with Alluxio Performance Evaluation Demo 1 2 3 4 5
  33. 33. Confidential © Alluxio, Inc.All Rights Reserved. 33 Experiments Spark 2.0.0 + Alluxio 1.2.0 Single worker:Amazon r3.2xlarge Comparisons: Alluxio Spark Storage Level: MEMORY_ONLY Spark Storage Level: MEMORY_ONLY_SER Spark Storage Level: DISK_ONLY
  34. 34. Confidential © Alluxio, Inc.All Rights Reserved. 34 0 50 100 150 200 250 0 5 10 15 20 25 30 35 40 45 50 Time[seconds] RDD Size [GB] Alluxio (textFile) Alluxio (objectFile) DISK_ONLY MEMORY_ONLY_SER MEMORY_ONLY Reading Cached RDD 34
  35. 35. Confidential © Alluxio, Inc.All Rights Reserved. 35 0 100 200 300 400 500 600 700 800 Alluxio (textFile) Alluxio (objectFile) No Alluxio Time [seconds] 7x speedup 16x speedup New Context: Read 50 GB RDD (S3) 35
  36. 36. Confidential © Alluxio, Inc.All Rights Reserved. 36 Reading Cached DataFrame (parquet) 36 0 50 100 150 200 250 0 5 10 15 20 25 30 35 40 45 50 Time[seconds] DataFrame Size [GB] Alluxio (textFile) MEMORY_ONLY_SER MEMORY_ONLY
  37. 37. Confidential © Alluxio, Inc.All Rights Reserved. 37 New Context: Read 50 GB DataFrame (S3) 37 0 250 500 750 1000 1250 1500 1750 Alluxio No Alluxio Time [seconds] 10x average speedup, 17x peak speedup
  38. 38. Confidential © Alluxio, Inc.All Rights Reserved. 38 Outline Alluxio Overview Alluxio + Spark Use Cases Using Spark with Alluxio Performance Evaluation Demo 1 2 3 4 5
  39. 39. Confidential © Alluxio, Inc.All Rights Reserved. 39 Demo Environment 39 Spark Alluxio
  40. 40. Confidential © Alluxio, Inc.All Rights Reserved. 40 Conclusion Easy to use Alluxio with Spark Predictable and improved performance Easily connect to various storages
  41. 41. Twi$er.com/alluxio Linkedin.com/alluxio Website www.alluxio.com E-mail info@alluxio.com @ Social Media á ™ Confidential © Alluxio, Inc.All Rights Reserved. 41 Thank you! Haoyuan Li Ancil McBarnett haoyuan@alluxio.com ancil@alluxio.com Twitter: @haoyuan Twitter: @ Twi$er.com/alluxio Linkedin.com/alluxio Website www.alluxio.com E-mail info@alluxio.com @ Social Media á ™

×