Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Getting Started with Alluxio + Spark + S3

1,330 views

Published on

Bay Area Meetup presentation (6/15/16)

Published in: Technology
  • Be the first to comment

Getting Started with Alluxio + Spark + S3

  1. 1. Alluxio (formerly Tachyon): Getting Started with Alluxio + Spark + S3 Calvin Jia June 15, 2016 @ Alluxio Meetup (hosted by Intel) Related Blog Post: http://goo.gl/MUpL0O
  2. 2. Who Am I? • Calvin Jia • SWE @ Alluxio, Inc. • Alluxio PMC Member • Twitter: @JiaCalvin 2
  3. 3. Outline • Technology Overview • Alluxio + Spark + S3 • Demo 3
  4. 4. Alluxio Ecosystem 4
  5. 5. Why Alluxio? • Data sharing between jobs • Data resilience during application crashes • Consolidate memory usage and alleviate GC issues 5
  6. 6. In-­‐Memory             Storage block  1 block  3 In-­‐Memory   Storage block  1 block  3 block  2 block  4 storage  engine  &   execution  engine same  process Data Sharing Between Jobs Inter-­‐process  sharing  slowed  down  by  network  I/O 6
  7. 7. Data Sharing Between Jobs block  1 block  3 block  2 block  4 HDFS disk block  1 block  3 block  2 block  4 In-­‐Memory block  1 block  3 block  4 storage  &   execution  engine separated Inter-­‐process  sharing  can  happen  at  memory  speed 7
  8. 8. Data Resilience during Crashes In-­‐Memory  Storage block  1 block  3 block  1 block  3 block  2 block  4 storage  engine  &   execution  engine same  process Process  crash  requires  network  I/O  to  re-­‐read  the  data 8
  9. 9. Data Resilience during Crashes Crash In-­‐Memory  Storage block  1 block  3 block  1 block  3 block  2 block  4 storage  engine  &   execution  engine same  process Process  crash  requires  network  I/O  to  re-­‐read  the  data 9
  10. 10. Data Resilience during Crashes block  1 block  3 block  2 block  4 Crash storage  engine  &   execution  engine same  process Process  crash  requires  network  I/O  to  re-­‐read  the  data 10
  11. 11. Data Resilience during Crashes storage  &   execution  engine separated HDFS disk block  1 block  3 block  2 block  4 In-­‐Memory block  1 block  3 block  4 Process  crash  only  needs  memory  I/O  to  re-­‐read  the  data 11
  12. 12. Data Resilience during Crashes Crash storage  &   execution  engine separated Process  crash  only  needs  memory  I/O  to  re-­‐read  the  data HDFS disk block  1 block  3 block  2 block  4 In-­‐Memory block  1 block  3 block  4 12
  13. 13. Consolidating Memory In-­‐Memory Storage block  1 block  3 In-­‐Memory Storage block  3 block  1 block  1 block  3 block  2 block  4 storage  engine  &   execution  engine same  process Data  duplicated  at  memory-­‐level 13
  14. 14. Consolidating Memory block  1 block  3 block  2 block  4 storage  &   execution  engine separated HDFS disk block  1 block  3 block  2 block  4 In-­‐Memory block  1 block  3 block  4 Data  not  duplicated  at  memory-­‐level 14
  15. 15. Outline • Technology Overview • Alluxio + Spark + S3 • Demo 15
  16. 16. Visualizing the Stack 16 FAST   104  -­ 105  MB/s MODERATE  103 -­ 104 MB/s SLOW  102 -­ 103 MB/s Only  when  necessary Limited Often SSD HDD Mem
  17. 17. When to use Alluxio •Two or more jobs access the same dataset •Job(s) may not always succeed •Dataset larger than Spark JVM •Jobs are pipelined •Resulting data does not need to be immediately persisted 17
  18. 18. Version Selection • Alluxio 1.1.0 –Latest released version –Many improvements, upgrade recommended • Spark 1.6.1 –Latest released version –Remember to use Spark Alluxio client, ie. - Pspark –Spark 2.0 is coming out soon, will recommend the best way to integrate with Alluxio 18
  19. 19. API Selection • Access data directly through the FileSystem API, but change scheme to alluxio:// –Minimal code change –Do not need to reason about logic •Example: –val  file  =  sc.textFile(“s3n://my-­‐bucket/myFile”) –val  file  =  sc.textFile(“alluxio://master:19998/myFile”) 19
  20. 20. Outline • Technology Overview • Alluxio + Spark + S3 • Demo 20

×