Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Alluxio Use Cases at Strata+Hadoop World Beijing 2016

1,796 views

Published on

Alluxio Use Cases at Strata+Hadoop World Beijing 2016

Published in: Technology
  • Be the first to comment

Alluxio Use Cases at Strata+Hadoop World Beijing 2016

  1. 1. Memory-Speed Virtual Distributed Storage: Latest Features and Use Cases August 2016 @ Strata+Hadoop World Beijing Bin Fan, Yupeng Fu 1
  2. 2. About Us •  Bin Fan •  Software Engineer @ Alluxio, Inc. •  Alluxio PMC •  Worked at Google, MSR •  CMU, CUHK, USTC •  Wechat @ apc999_fb •  Yupeng Fu •  Software Engineer @ Alluxio, Inc. •  Alluxio PMC •  Worked at Palantir, Google •  UCSD, Tsinghua •  Wechat @richbird9 2
  3. 3. •  Team consists of Alluxio creators and top committers •  Invested by •  Committed to Alluxio Open Source •  http://www.alluxio.com Alluxio Inc. 3
  4. 4. Outline •  What is Alluxio •  Example use cases 1.  Off-heap memory to alleviate resource pressure 2.  Fast data sharing between jobs 3.  Accelerate access to remote storage 4.  Unified namespace •  Summary 4
  5. 5. Non-persistent data-storage software. ATTRIBUTES Memory-Speed Virtual Distributed Storage Scale out architecture. Virtualized across different storage types under a unified namespace. Memory-speed access to data. 5
  6. 6. OPEN SOURCE ALLUXIO: History • Started in the AMPLab at UC Berkeley –  Summer of 2012 • Open Sourced as Tachyon (renamed to Alluxio) –  Apache 2.0 License –  Latest Release: version 1.2.0 (July, 2016) 6
  7. 7. OPEN SOURCE ALLUXIO v0.2 v0.3 v0.4 v0.5 v0.6 v0.7 v0.8 v1.0 v1.1 7
  8. 8. OPEN SOURCE ALLUXIO The fastest growing open-source project in the big data ecosystem. Currently over 300 contributors from over 100 organizations. 8
  9. 9. Outline •  What is Alluxio •  Example use cases 1.  Off-heap memory to alleviate resource pressure 2.  Fast data sharing between jobs 3.  Accelerate access to remote storage 4.  Unified namespace •  Summary 9
  10. 10. Application 1 Alluxio CASE 1: Off-heap Memory Storage to Alleviate Resource Pressure 10
  11. 11. HDFS / Amazon S3 Consolidating Memory Spark Job1 Spark Memory block 1 block 3 Spark Job2 Spark Memory block 3 block 1 block 1 block 3 block 2 block 4 storage engine & execution engine same process Data duplicated at memory-level 11
  12. 12. Spark Job1 Spark mem Spark Job2 Spark mem HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Alluxio In-Memory block 1 block 3 block 4 Data not duplicated at memory-level Consolidating Memory storage engine & execution engine same process 12
  13. 13. Data Resilience during Crashes Spark Task Spark Memory block manager block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 Process crash requires network and/or disk I/O to re-read the data storage engine & execution engine same process 13
  14. 14. Data Resilience during Crashes Crash Spark Memory block manager block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 Process crash requires network and/or disk I/O to re-read the data storage engine & execution engine same process 14
  15. 15. HDFS / Amazon S3 Data Resilience during Crashes block 1 block 3 block 2 block 4 Crash storage engine & execution engine same process Process crash requires network and/or disk I/O to re-read the data 15
  16. 16. Spark Task Spark Memory block manager HDFS disk block 1 block 3 block 2 block 4 Alluxio In-Memory block 1 block 3 block 4 Process crash only needs memory I/O to re-read the data Data Resilience during Crashes storage engine & execution engine same process 16
  17. 17. Crash Process crash only needs memory I/O to re-read the data HDFS disk block 1 block 3 block 2 block 4 Alluxio In-Memory block 1 block 3 block 4 Data Resilience during Crashes storage engine & execution engine same process 17
  18. 18. CASE STUDY Thanks to Alluxio, we now have the raw data immediately available at every iteration and we can skip the costs of loading in terms of time waiting, network traffic, and RDBMS activity. - Henry Powell, Barclays RESULTS •  Barclays workflow iteration time decreased from hours to seconds •  Alluxio enabled workflows that were impossible before •  By keeping data only in memory, the I/O cost of loading and storing in Alluxio is now on the order of seconds Relational Database Share Data Across Jobs at Memory Speed •  Reducing time from hours to seconds •  Solving issues of Teradata as under storage 18
  19. 19. Barclays Previous Architecture Spark Teradata Slow ETL, 30+ minutes Each context restart requires ETL process again h"ps://dzone.com/ar1cles/Accelerate-­‐In-­‐Memory-­‐Processing-­‐with-­‐Spark-­‐from-­‐Hours-­‐to-­‐Seconds-­‐With-­‐Tachyon   19
  20. 20. Barclays Architecture with Alluxio Spark Teradata Alluxio ETL only once Data still available in Alluxio memory after Job crash 20
  21. 21. CASE 2: Fast Data sharing between apps Application 1 Alluxio Application2 HDFS 21
  22. 22. Spark Job Spark Memory block 1 block 3 Hadoop MR Job YARN HDFS / Amazon S3 block 1 block 3 block 2 block 4 storage engine & execution engine same process Data Sharing Between Frameworks Inter-process sharing slowed down by network and/or disk I/O 22
  23. 23. Data Sharing Between Frameworks Spark Job Spark Memory Hadoop MR Job YARN HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Alluxio In-Memory block 1 block 3 block 4 storage engine & execution engine same process Inter-process sharing can happen at memory speed 23
  24. 24. CASE STUDY We’ve been running Alluxio in production for over 9 months, resulting in 15x speedup on average, and 300x speedup at peak service times. - Xueyan Li, Qunar RESULTS •  Multiple computing frameworks can share data at memory speed with Alluxio •  Improved the performance of their system with 15x – 300x speedups •  Alluxio provides various user-friendly APIs, which simplifies the transition to Alluxio. Sharing Data Across Different Compute Frameworks •  6 billion logs (4.5 TB) daily •  Mix of Memory + HDD 24
  25. 25. Qunar Previous Architecture Flink Streaming Spark Spark Streaming Flink HDFS Cluster Some jobs cannot finish Slow access to the storage 25
  26. 26. Qunar Architecture with Alluxio Flink Streaming HDFS Cluster Spark Spark Streaming Flink Alluxio Share in-memory data across frameworks 26
  27. 27. CASE 3: Accelerate access to remote storage Application Alluxio Remote Storage 27
  28. 28. •  Different compute and storage hardware requirements •  Scale compute and storage resources independently •  Cost effective object stores (Amazon S3, Google GCS, Microsoft Azure Blob Store) are inherently decoupled Decoupling Compute from Storage Accessing data requires remote I/O! 28
  29. 29. Remote I/O problems Application Remote Storage every data operation requires data transfer, sometimes over the WAN high latency, network throughput 29
  30. 30. Visualizing the Stack with Alluxio FAST 104 - 105 MB/s Application Remote Storage MODERATE 103 - 104 MB/s SLOW 102 - 103 MB/s Alluxio MEM Often Only When Necessary Alluxio SSD/HDD Limited 30
  31. 31. What if memory capacity is still not enough? 31
  32. 32. Alluxio Manages All Local Storage MEM SSD HDD Faster Higher Capacity 32
  33. 33. Configurable Storage Tiers MEM only MEM + HDD SSD only 33
  34. 34. Pluggable Tier Management Policies Evict stale data to slower tier Promote hot data to faster tier 34
  35. 35. CASE STUDY Baidu File System The performance was amazing. With Spark SQL alone, it took 100-150 seconds to finish a query; using Alluxio, where data may hit local or remote Alluxio nodes, it took 10-15 seconds. - Shaoshan Liu, Baidu RESULTS •  Data queries are now 30x faster with Alluxio •  1000-worker Alluxio cluster run stably, providing over 50TB of RAM space •  By using Alluxio, batch queries usually lasting over 15 minutes were transformed into an interactive query taking less than 30 seconds Accelerate Access to Remote Storage •  1000+ nodes deployment •  2+ petabytes of storage •  Mix of memory + HDD •  1000-worker cluster 35
  36. 36. CASE 4: Unified Name Space 36   Application Alluxio AWS S3 HDFS
  37. 37. Common Scenarios •  At large organizations, data spans many storage systems •  Legacy storage systems •  Application logic needs to integrate with different storage systems 37
  38. 38. Unified Name Space An abstraction for applications to access different storage systems through the same interface •  Benefits –  Application written once with Allluxio API –  Transparently add new storage systems –  Seamlessly integrate with existing systems 38
  39. 39. Transparent Naming Operations over persisted Alluxio objects mapped transparently to underlying storage alluxio://Data/Sales <-> hdfs://Data/Sales Alluxio Storage System (HDFS, S3, …) alluxio://host:port/ Data Users Reports Sales Alice Bob hdfs://host:port/ Data Users Reports Sales Alice Bob 39 mount
  40. 40. Multiple Storage Systems •  Unified namespace for multiple data sources alluxio://Data/Sales -> s3://bucket/Sales alluxio://Users/Alice -> hdfs://Users/Alice •  API for on-the-fly mounting / unmounting Alluxio Storage System A alluxio://host:port/ Data Users Alice Bob hdfs://host:port/ Users Alice Bob Storage System B s3://host/bucket Reports Sales Reports Sales 40 mount mount
  41. 41. A Proliferation of Under Storage Systems MORE… 41
  42. 42. CASE STUDY RESULTS •  Alluxio’s unified namespace enables different applications and frameworks to easily interact with their data from different storage systems •  Global data centers across North America and Asia •  Tiered storage feature manages various storage resources including memory, SSD and disk Transparently Manage Data Across Different Storage Systems •  Storage medium: MEM •  Clusters in different regions An Internet Company Machine Learning Pipelines 42 42
  43. 43. Outline •  What is Alluxio •  Example use cases 1.  Off-heap memory to alleviate resource pressure 2.  Fast data sharing between jobs 3.  Accelerate access to remote storage 4.  Unified namespace •  Summary 43
  44. 44. Takeaway: When to use Alluxio •  Two or more jobs access the same dataset •  Job(s) may not always succeed •  Spark with memory/GC pressure •  Jobs/Apps are pipelined •  Resulting data does not need to be immediately persisted •  Interact with remote data •  Data stored across different storage systems 44
  45. 45. Try out Alluxio 1.2.0 http://www.alluxio.org/releases 45
  46. 46. First China Meetups! •  Nanjing (Saturday, 7/30) •  Shanghai (Sunday, 7/31) •  Beijing (Sunday, 8/7) 46
  47. 47. Resources •  Alluxio Project: http://www.alluxio.org •  Development: https://github.com/Alluxio/alluxio •  Meet Friends: http://www.meetup.com/Alluxio •  Alluxio, Inc.: http://www.alluxio.com •  Training: http://www.alluxio.com/alluxio-training/ •  Contact us: info@alluxio.com •  Alluxio Wechat: Alluxio 47

×