Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017

499 views

Published on

Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017

Published in: Technology
  • Be the first to comment

Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017

  1. 1. ALLUXIO (FORMERLY TACHYON): UNIFY DATA AT MEMORY SPEED Strata and Hadoop World San Jose March 2017 Haoyuan Li, Alluxio, Inc. Calvin Jia, Alluxio, Inc.
  2. 2. HISTORY • Started at UC Berkeley AMPLab In Summer 2012 • Originally named as Tachyon • Rebranded to Alluxio in early 2016 • Open Sourced in 2013 • Apache License 2.0 • Latest Stable Release: Alluxio 1.4.0 • Alluxio 1.5.0 Planned For Q2, 2017 2
  3. 3. © 2016 Alluxio Confidential ALLUXIO DEPLOYMENTS 3
  4. 4. … … 4 BIG DATA ECOSYSTEM YESTERDAY
  5. 5. BIG DATA ECOSYSTEM TODAY … … 4
  6. 6. … … BIG DATA ECOSYSTEM ISSUES 4
  7. 7. BIG DATA ECOSYSTEM WITH ALLUXIO … … FUSE Compatible File SystemHadoop Compatible File System Native Key-Value InterfaceNative File System GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface 4
  8. 8. BIG DATA ECOSYSTEM WITH ALLUXIO … … FUSE Compatible File SystemHadoop Compatible File System Native Key-Value InterfaceNative File System Unifying Data at Memory Speed GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface 4
  9. 9. 5
  10. 10. 6
  11. 11. 7 • Fastest growing open- source project in the big data ecosystem • 500+ contributors from 100+ organizations • Running in large production clusters • Community members are welcome! FASTEST GROWING BIG DATA PROJECTS Popular Open Source Projects’ Growth Months NumberofContributors
  12. 12. WHY ALLUXIO 8 Co-located compute and data with memory-speed access to data Virtualized across different storage systems under a unified namespace Scale-out architecture File system API, software only
  13. 13. 9 Unification New workflows across any data in any storage system Orders of magnitude improvement in run time Choice in compute and storage – grow each independently, buy only what is needed Performance Flexibility BENEFITS
  14. 14. #1 – ACCELERATING REMOTE STORAGE I/O 10 • Scenario: Compute and Storage Separation • Meet different compute and storage hardware requirements • Scale compute and storage independently • Store data in traditional filers/SANs and object stores • Analyze existing data with Big Data compute frameworks • Limitation • Accessing data requires remote I/O
  15. 15. I/O WITHOUT ALLUXIO Spark Storage 11
  16. 16. I/O WITHOUT ALLUXIO Spark Storage Low latency, memory throughput High latency, network throughput 11
  17. 17. I/O WITH ALLUXIO Spark Storage Alluxio 12
  18. 18. I/O WITH ALLUXIO Spark Storage Alluxio Keeping data in Alluxio accelerates data access 12
  19. 19. CASE STUDY: BAIDU 13 The performance was amazing. With Spark SQL alone, it took 100-150 seconds to finish a query; using Alluxio, where data may hit local or remote Alluxio nodes, it took 10-15 seconds. - Baidu RESULTS • Data queries are now 30x faster with Alluxio • Alluxio cluster runs stably, providing over 50TB of RAM space • By using Alluxio, batch queries usually lasting over 15 minutes were transformed into an interactive query taking less than 30 seconds Baidu’s PMs and analysts run 
 interactive queries to gain insights 
 into their products and business • 200+ nodes deployment • 2+ petabytes of storage • Mix of memory + HDD ALLUXIO Baidu File System
  20. 20. #2 – SHARING DATA AT MEMORY-SPEED AMONG APPLICATIONS • Scenario: Data Sharing Architecture • Pipelines: output of one job is input of the next job • Applications, jobs, and contexts reading the same data • Limitation • Sharing data requires I/O 14
  21. 21. SHARING WITHOUT ALLUXIO Spark Storage MapReduce Spark 15
  22. 22. SHARING WITHOUT ALLUXIO Spark Storage MapReduce Spark Network I/O Disk I/O I/O slows down sharing 15
  23. 23. SHARING WITH ALLUXIO Spark Storage MapReduce Spark Alluxio 16
  24. 24. SHARING WITH ALLUXIO Spark Storage MapReduce Spark Memory-speed sharing Alluxio 16
  25. 25. CASE STUDY: BARCLAYS Thanks to Alluxio, we now have the raw data immediately available at every iteration and we can skip the costs of loading in terms of time waiting, network traffic, and RDBMS activity. - Barclays RESULTS • Barclays workflow iteration time decreased from hours to seconds • Alluxio enabled workflows that were impossible before • By keeping data only in memory, the I/O cost of loading and storing in Alluxio is now on the order of seconds Barclays uses query and machine learning to train models for risk management • 6 node deployment • 1TB of storage • Memory only ALLUXIO Relational Database 17
  26. 26. #3 – UNIFYING DATA ACCESS FROM DIFFERENT STORAGE • Scenario: Multiple Storage Systems • Most enterprises have multiple storage systems • New (better, faster, cheaper) storage systems arise • Limitation • Accessing data from different systems requires different APIs 18
  27. 27. ACCESSING DATA THROUGH ALLUXIO Storage B Alluxio Spark MapReduce Spark 19
  28. 28. ACCESSING DATA THROUGH ALLUXIO Storage B Alluxio Spark MapReduce Spark 19
  29. 29. ACCESSING DATA THROUGH ALLUXIO Storage B Alluxio Spark MapReduce Spark Storage A Storage C Flexible, simple no application changes, new mount point 19
  30. 30. CASE STUDY: QUNAR We’ve been running Alluxio in production for over 9 months, Alluxio’s unified namespace enable different applications and frameworks to easily interact with data from different storage systems. - Qunar RESULTS • Data sharing among Spark Streaming, Spark batch and Flink jobs provide efficient data sharing • Improved the performance of their system with 15x – 300x speedups • Tiered storage feature manages storage resources including memory and HDD • 200+ nodes deployment • 6 billion logs (4.5 TB) daily • Mix of Memory + HDD ALLUXIO Qunar uses real-time machine learning for their website ads. 20
  31. 31. DEMO AND MORE CASES
  32. 32. SUMMARY 22 • Adopted by industry leaders • Unified, memory-speed data access across compute frameworks (Spark, Presto, MapReduce) and storage systems (S3, GCS, ECS, Ceph, HDFS, NFS, etc) • Rapidly growing OS community
  33. 33. JOIN THE COMMUNITY 23 Contribute @ www.alluxio.org/contribute Get started @ goo.gl/55ApFx
  34. 34. Contact: {haoyuan, calvin}@alluxio.com Twitter: @haoyuan Websites: www.alluxio.com and www.alluxio.org Thank you! 
 
 We are hiring!
 
 Demo: 
 Spark + Alluxio + S3
 Alluxio Unified Namespace 24

×