Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Unify Data at Memory Speed - Alluxio Overview


Published on

Haoyuan Li and Bin Fan
Alluxio NYC meetup 20180914

Published in: Technology
  • Be the first to comment

Unify Data at Memory Speed - Alluxio Overview

  1. 1. Unify Data at Memory Speed Alluxio Overview Haoyuan Li – Founder & CEO, Alluxio Inc. Bin Fan – Founding Member, Alluxio Inc.
  2. 2. Agenda Why Alluxio?1 Technology Overview2 Architecture4 Use Cases3
  3. 3. Company Overview • Founded Feb. 2015 – Haoyuan Li • PhD research project “Tachyon” at UC Berkeley AMPLab • Venture Backed • Andreesen Horowitz etc. • Open Source Business Model • Tachyon Open Sourced in Dec. 2012 • Open source v1.0 released Feb. 2016 • Enterprise Edition released Oct. 2016 • Office in San Mateo, CA • Team: Google, Palantir, Vmware, AMD, Cisco…
  4. 4. Data Ecosystem Today Many Compute Frameworks Many Storage Systems Most not co-located 9/17/2018 4Confidential © Alluxio, Inc. All Rights Reserved.
  5. 5. Migration to Cloud / Object Storage © Alluxio, Inc. All Rights Reserved. 5 • Decoupling of compute and storage • Enterprise move from Cloud provider turnkey solution to self managed data platforms on IaaS • Lacking agility at Data Storage level • Requires Storage Abstraction
  6. 6. Data Ecosystem Challenges • Complexity • Costly to integrate new compute or storage • Hard to maintain data sources plug-and-play • Complicated to create data pipelines • Efficiency • Slow and expensive to accessing remote data repeatedly • Data locality remains questionable; • Potential performance penalty and semantics mismatch
  7. 7. This is why we built Alluxio A unified data solution for the digital economy
  8. 8. VFS OS Buffer Cache Disk Device Local Application VDFS (Alluxio) Under Store Distributed Application Alluxio as a New VDFS Layer
  9. 9. Data Ecosystem with Alluxio Apps only talk to Alluxio Simple Add/Remove No App Changes Highest performance in Memory No Lock in Alluxio, a Virtual Distributed File System (VDFS) Java File API HDFS Interface S3 Interface REST API HDFS Driver S3 Driver Swift Driver NFS Driver FUSE Interface 9/17/2018 9Confidential © Alluxio, Inc. All Rights Reserved.
  10. 10. Fastest Growing Open Source Project in Data Eco-System Fastest Growing open-source project in the data ecosystem Running in world’s largest production clusters 800+ Contributors from 100+ organizations 0 100 200 300 400 500 600 700 800 0 10 20 30 40 45 50 55 NumberofContributors Open Source Contributors by Month (Github) Alluxio Spark Kafka Redis HDFS Cassandra Hive 9/17/2018 10Confidential © Alluxio, Inc. All Rights Reserved.
  11. 11. Technology Overview A unified data solution for the digital economy
  12. 12. Alluxio Innovations Unified Namespace Bring all files into a single interface Interact with data using any API Accelerate slow data transparently API Translation Intelligent Cache 9/17/2018 12Confidential © Alluxio, Inc. All Rights Reserved.
  13. 13. Alluxio Innovation: Unified Namespace Enables effective data management across different Under Stores Uses Mounting with Transparent Naming 9/17/2018 13Confidential © Alluxio, Inc. All Rights Reserved.
  14. 14. Alluxio Innovation: Server-side API Translation Convert from Client-side Interface to native Storage Interface HDFS Interface HDFS Interface S3A Interface Swift Interface Google Cloud Interface 9/17/2018 15Confidential © Alluxio, Inc. All Rights Reserved.
  15. 15. Alluxio Innovation: Intelligent Cache Local performance from remote data using multi-tier storage RAM SSD HDD Hot Warm Cold Read & Write Buffering Transparent to App Policies for pinning, promotion/demotion, TTL 9/17/2018 16Confidential © Alluxio, Inc. All Rights Reserved.
  16. 16. Provide a Common (HDFS) Interface 17 • Alluxio provides an HDFS compatible interface • Just change hdfs://foo/bar to alluxio://foo/bar • ALLUXIO-3287: aims to allow URIs unchanged • Native Alluxio Java FS, FUSE interface also available. • Choice of Under Stores: independent and transparent to Apps Compute Zone Standalone or managed with Mesos or Yarn Storage in Different Availability Zone Either on-prem or cloud TensorflowPrestoSpark HDFS API FUSE API
  17. 17. Use Cases
  18. 18. 100+ Known Production Deployments AND MORE! 9/17/2018 20Confidential © Alluxio, Inc. All Rights Reserved.
  19. 19. Machine Learning Case Study – Challenge – Slow training of model for algorithmic trading in $46B data driven Hedge Fund Data access was slow, costing them $$ in compute cost and lower modeler productivity SPARK HDFS SPARK HDFS Solution – With Alluxio, data access are 10- 30X faster Impact – Increased efficiency on training of ML algorithm, lowered compute cost and increased modeler productivity, resulting in 14 day ROI of Alluxio MESOS MESOS Public Internet Public Internet 9/17/2018 21Confidential © Alluxio, Inc. All Rights Reserved.
  20. 20. Big Data Case Study – Challenge – Gain end to end view of business with large volume of data Queries were slow / not interactive, resulting in operational inefficiency Solution – ETL Data from Teradata to Alluxio Impact – Faster Time to Market – “Now we don’t have to work Sundays” Use Case: SPARK TERADATA SPARK TERADATA 9/17/2018 22Confidential © Alluxio, Inc. All Rights Reserved.
  21. 21. Big Data Case Study – Top 3 Retailer Challenge – Bottleneck in Trend Analysis of mission critical daily sales and inventory management Queries were slow / not interactive, resulting in operational inefficiency Solution – With Alluxio, data queries are 10X faster Impact – Higher operational efficiency Use case: SPARK HDFS SPARK HDFS 9/17/2018 23Confidential © Alluxio, Inc. All Rights Reserved.
  22. 22. Consumer Intelligence Use Case – Top 3 Telco Challenge – Desired a central view of consumer information in near real time for proactive support. Many HDFS, different distributions, many incompatible versions. On-prem & cloud. Integration through heavy ETL. Solution – Alluxio integrates data into central catalog for fast access to consumer interaction records. Impact – Reduced integration time Faster data speed & freshness HADOOP ML HADOOP HDFS HDFS HDFS ML ETL HDP HDFS CDH HDFS MAPR HDFS HDFS 9/17/2018 24Confidential © Alluxio, Inc. All Rights Reserved.
  23. 23. Starburst Presto + Alluxio: Fast, Scalable Analytics © 2018 Presto: Open source distributed SQL • Originally developed by Facebook • Separate storage & compute for cloud data analytics • Scale interactive SQL over petabytes of data Starburst --- the Presto company • Leading Presto contributor (3 years in community) • Enterprise-grade Presto – production supported • PrestoCare - fully managed service • Best Presto in cloud: Starburst advantage • Full security model: Ranger, Sentry, etc • BI tools via enterprise ODBC & JDBC drivers • Cost-Based Optimizer for best performance Twitter: @starburstdata Blog: ... Caching Storage Compute (SQL Query)
  24. 24. Kyligence + Alluxio • Leverage Alluxio as the cache layer over S3/ADLS for Kyligence Cloud • Cache hot data in memory/SSD to gain high speed & throughput • Transparent to applications
  25. 25. HPC/Deep Learning Partnership – Alluxio maximizes GPU investment: • Self-serve data access for data scientists • Rapid integration of new data sources • Improved memory management & performance 9/17/2018 28Confidential © Alluxio, Inc. All Rights Reserved.
  26. 26. Architecture A Scalable Distributed File Systems
  27. 27. Architecture
  28. 28. Read Data not Cached in Alluxio + Caching 31 RAM / SSD / HDD Application Alluxio Client Alluxio WorkerUnder Store
  29. 29. Read Cached Data in Alluxio Alluxio Worker RAM / SSD / HDD Application Alluxio Client
  30. 30. Write data only to Alluxio Alluxio Worker RAM / SSD / HDD Application Alluxio Client
  31. 31. Write to Alluxio and Under Store Synchronously RAM / SSD / HDD Application Alluxio Client Alluxio Worker Under Store
  32. 32. What’s New
  33. 33. Alluxio FUSE Alluxio’s FUSE Interface makes all enterprise data available locally SUPPORTS • HDFS • NFS • OpenStack • Ceph • Amazon S3 • Azure • Google Cloud IT OPS FRIENDLY • Storage mounted into Alluxio by central IT • Security in Alluxio mirrors source data • Authentication through LDAP/AD • Wireline encryption HDFS #1 Obj Store NFS HDFS #2 9/17/2018 36Confidential © Alluxio, Inc. All Rights Reserved.
  34. 34. Deep Learning Input Pipeline Deep Learning training involves three stages of utilizing different resources: • Data reads (I/O): e.g. choose and read image files from source. • Data Preprocessing (CPU): e.g. decode image records into images, preprocess, and organize into mini-batches. • Modeling training (GPU): Calculate and update the parameters in the multiple convolutional layers
  35. 35. Alluxio overcomes I/O bottleneck
  36. 36. Thank you