Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage


Published on

Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage

Presented by Gene Pag, Alluxio
Introduction to Alluxio Meetup at Princeton

Published in: Software
  • Be the first to comment

  • Be the first to like this

Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage

  1. 1. Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage October 2016 Gene Pang
  2. 2. About Me and Alluxio, Inc. 2 • Team members from Google, Palantir, Uber, Yahoo with years of distributed systems development experience • Graduated from Stanford University, UC Berkeley, CMU, Peking University, and Tsinghua, with CS masters or PhDs • Top 9 committers of the Alluxio open source project Alluxio Team Gene Pang, Software Engineer, Alluxio Maintainer Ph.D. from UC Berkeley AMPLab Previously on Google F1 team Twitter: @unityxx • Andreessen HorowitzInvestors
  3. 3. AGENDA 3 • Alluxio Open Source Status and History • Alluxio Overview • Alluxio Use Cases • What’s Next?
  4. 4. HISTORY 4 • Started at UC Berkeley AMPLab In Summer 2012 • Original named as Tachyon • Open Sourced in 2013 • Apache License 2.0 • Latest Stable Release: Alluxio 1.2.0 • Next Release (Alluxio 1.3.0) soon! • Rebranded as Alluxio in 2016
  5. 5. 0 50 100 150 200 250 300 350 Year 1 Year 3Year 2 5 OPEN SOURCE ALLUXIO • One of the fastest growing open- source projects in the big data ecosystem • Currently over 300 contributors from over 100 organizations • Welcome to join our community! Popular Open Source Projects’ Growth Spark Kafka Cassandra HDFS Alluxio
  6. 6. BIG DATA ECOSYSTEM TODAYBIG DATA ECOSYSTEM WITH ALLUXIO 6 BIG DATA ECOSYSTEM YESTERDAY … … FUSE Compatible File SystemHadoop Compatible File System Native Key-Value InterfaceNative File System Enabling any application to access data from any storage system at memory-speed BIG DATA ECOSYSTEM ISSUES GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface
  7. 7. • Memory is getting Faster, Larger, and Cheaper • Memory price as halving every 18 months • Disk throughput increasing slowly 7 TECHNOLOGY TRENDS Top left chart: 20-years-of-samsung-new-management-as- manifested-by-the-latest-june-20th-galaxy- ativ-innovations/ Top right chart: s294/ 15/notes/02-TechnologyTrends.ppt Bottom chart: 6.25 12.5 25 18.75 31.25 43.75 37.5 50 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 DDR performance over time GBs/second DDR2 DDR4 DDR3
  8. 8. File System API Software Only 8 ALLUXIO ATTRIBUTES Memory-Speed Virtual Distributed Storage Scale out architecture Virtualizes across different storage systems, providing a unified namespace Memory-speed access to data
  9. 9. Server A Applications Server B Applications Server Z Applications Server C ApplicationsAlluxio Alluxio AlluxioAlluxio 9 ALLUXIO SOLUTION DEPLOYMENT Storage B Storage C Storage ZStorage A
  10. 10. 10 ALLUXIO BENEFITS Unification New workflows across any data in any storage system Performance High performance data access Flexibility Work with the compute and storage frameworks of your choice Cost Grow compute and storage systems independently
  11. 11. USE CASE 1 – Accelerate I/O to/from Remote Storage 11 • Compute and Storage Separation • Advantages • Meet different compute and storage hardware requirements efficiently • Scale compute and storage independently • Store data in Traditional filers/SANs and object stores cost effectively • Compute on data in existing storage via Big Data Computational frameworks • Disadvantage • Accessing data requires remote I/O
  12. 12. Use Case without Alluxio 12 Spark Storage Low latency, memory throughput High latency, network throughput
  13. 13. Use Case with Alluxio 13 Spark Storage Alluxio Keeping data in Alluxio accelerates data access
  14. 14. 14 CASE STUDY Baidu File System The performance was amazing. With Spark SQL alone, it took 100-150 seconds to finish a query; using Alluxio, where data may hit local or remote Alluxio nodes, it took 10-15 seconds. - Shaoshan Liu, Baidu RESULTS • Data queries are now 30x faster with Alluxio • Alluxio cluster run stably, providing over 50TB of RAM space • By using Alluxio, batch queries usually lasting over 15 minutes were transformed into an interactive query taking less than 30 seconds Accelerate Access to Remote Storage • 200+ nodes deployment • 2+ petabytes of storage • Mix of memory + HDD
  15. 15. USE CASE 2 – Share Data Across Jobs at Memory Speed 15 • Architectures Requiring Shared Data • Pipelines: output of one job is input of the next job • Different applications, jobs, or contexts read the same data • Disadvantage • Sharing data requires I/O
  16. 16. Use Case without Alluxio 16 Spark Storage MapReduce Spark Network I/O Disk I/O I/O slows down sharing
  17. 17. Use Case with Alluxio 17 Spark Storage MapReduce Spark Sharing data with Alluxio via memory Alluxio
  18. 18. 18 CASE STUDY Thanks to Alluxio, we now have the raw data immediately available at every iteration and we can skip the costs of loading in terms of time waiting, network traffic, and RDBMS activity. - Henry Powell, Barclays RESULTS • Barclays workflow iteration time decreased from hours to seconds • Alluxio enabled workflows that were impossible before • By keeping data only in memory, the I/O cost of loading and storing in Alluxio is now on the order of seconds Relational Database Share Data Across Jobs at Memory-Speed • 6 node deployment • 1TB of storage • Memory only
  19. 19. USE CASE 3 - Transparently Manage Data Across Storage Systems 19 • Reasons • Most enterprises have multiple storage systems • New (better, faster, cheaper) storage systems arise • Disadvantage • Managing data across systems can be difficult
  20. 20. Use Case Explained 20 Storage Alluxio Spark MapReduce Spark Storage Storage Flexible, simple no application changes, new mount point
  21. 21. 21 CASE STUDY We’ve been running Alluxio in production for over 9 months, resulting in 15x speedup on average, and 300x speedup at peak service times. - Xueyan Li, Qunar RESULTS • Alluxio’s unified namespace enables different applications and frameworks to easily interact with their data from different storage systems • Improved the performance of their system with 15x – 300x speedups • Tiered storage feature manages various storage resources including memory, SSD and disk Transparently Manage Data Across Different Storage Systems • 200+ nodes deployment • 6 billion logs (4.5 TB) daily • Mix of Memory + HDD
  22. 22. What’s Next? 22
  23. 23. • Contact: or • Twitter: @Alluxio • Websites: and • Alluxio Github: Thank you!