Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Architecture of Decoupling Compute and Storage with Alluxio

355 views

Published on

Strata Singapore 2017
Haoyuan Li, Alluxio, Inc.

Published in: Technology
  • Be the first to comment

The Architecture of Decoupling Compute and Storage with Alluxio

  1. 1. The Architecture of Decoupling Compute and Storage with Alluxio Haoyuan Li, Alluxio Inc. December 2017 @ Strata Singapore
  2. 2. Confidential © Alluxio, Inc.All Rights Reserved. 2 About Me •  Haoyuan (H.Y.) Li •  Founder and CEO at Alluxio, Inc. •  Created Alluxio (as name Tachyon) at UC Berkeley AMPLab, as Ph.D. candidate. •  Google, Cornell University, Peking University
  3. 3. Confidential © Alluxio, Inc.All Rights Reserved. 3 Decoupling Compute From Storage Benefits – Different compute and storage hardware requirements Scale compute and storage resources independently Traditional filers/SANs and cost effective object stores (Amazon S3, Google GCS, Microsoft Azure Blob Store) are inherently decoupled Fast-evolving big data eco-system Challenges – Accessing data requires remote I/O
  4. 4. Confidential © Alluxio, Inc.All Rights Reserved. 4 Remote I/O Spark Amazon S3 Every data operation requires data transfer, sometimes over the WAN High latency, network throughput
  5. 5. Confidential © Alluxio, Inc.All Rights Reserved. 5 Data Operations with Alluxio Spark Amazon S3 Alluxio Low latency, memory throughput High latency, network throughput Keeping data in Alluxio accelerates data access
  6. 6. Confidential © Alluxio, Inc.All Rights Reserved. 6 Data Ecosystem Develops • One Compute Framework • Single Storage System • Co-locatedETL ETL ETL
  7. 7. Confidential © Alluxio, Inc.All Rights Reserved. 7 Data Ecosystem Explodes •  Many Compute Frameworks •  Many Storage Systems •  Most not co-located
  8. 8. Confidential © Alluxio, Inc.All Rights Reserved. 8 Data Ecosystem Issues •  Each app manages multiple data sources •  Data source changes require global updates •  Storage optimizations requires app change •  Poor performance due to lack of locality
  9. 9. Confidential © Alluxio, Inc.All Rights Reserved. 9 Data Ecosystem with Alluxio •  Apps only talk to Alluxio •  Simple Add/Remove •  No App Changes •  Highest performance in Memory •  No Lock in Native File System Hadoop Compatible File System Amazon S3 Interface REST Web Service HDFS Interface Amazon S3 Interface Swift Interface NFS Interface
  10. 10. Confidential © Alluxio, Inc.All Rights Reserved. 10 Use Case – Multi -Cluster Federated Query
  11. 11. Confidential © Alluxio, Inc.All Rights Reserved. 11 11
  12. 12. Confidential © Alluxio, Inc.All Rights Reserved. 12 History Started at UC Berkeley AMPLab In Summer 2012 Originally named as Tachyon Rebranded to Alluxio in early 2016 Open Sourced in 2013 Apache License 2.0 Latest Stable Release:Alluxio 1.6.1 (Nov 2017)
  13. 13. Confidential © Alluxio, Inc.All Rights Reserved. 13 Fastest Growing Big Data Open Source Project Fastest Growing open-source project in the big data ecosystem Running world’s largest production clusters 600+ Contributors from 100+ organizations 0 100 200 300 400 500 600 700 800 0 10 20 30 40 45 50 55 NumberofContributors Open Source Contributors by Month (Github) Alluxio Spark Kafka Redis HDFS Cassandra Hive
  14. 14. Confidential © Alluxio, Inc.All Rights Reserved. 14 Non-persistent data-storage software. What’s Alluxio Memory-Speed Virtual Distributed Storage Scale out architecture. Virtualized across different storage types under a unified namespace. Memory-speed access to data.
  15. 15. Confidential © Alluxio, Inc.All Rights Reserved. 15 Alluxio Innovation: Unified Namespace Enables effective data management across different Under Stores Uses Mounting with Transparent Naming
  16. 16. Confidential © Alluxio, Inc.All Rights Reserved. 16 Alluxio Innovation: Unified Namespace Create a catalog of available data sources for Data Scientists /finance/customer-transactions/ /finance/vendor-transactions/ /operations/device-logs/ /operations/phone-call-recordings/ /operations/check-images/ /research/us-economic-data/ /research/intl-economic-data/ /marketing/advertising-dataset/ /marketing/marketing-funnel-dataset/ alluxio://
  17. 17. Confidential © Alluxio, Inc.All Rights Reserved. 17 Alluxio Innovation: Server-side API Translation Convert from Client-side Interface to native Storage Interface HDFS Interface HDFS Interface S3A Interface Swift Interface Google Cloud Interface
  18. 18. Confidential © Alluxio, Inc.All Rights Reserved. 18 Alluxio Innovation: Intelligent Cache Local performance from remote data using multi-tier storage RAM SSD HDD Hot Warm Cold Read &Write Buffering Transparent to App Policies for pinning, promotion/ demotion,TTL
  19. 19. Confidential © Alluxio, Inc.All Rights Reserved. 19 Alluxio Benefits Unification New workflows across any data in any storage system Orders of magnitude improvement in run time Choice in compute and storage – grow each independently, buy only what is needed Performance Flexibility
  20. 20. Confidential © Alluxio, Inc.All Rights Reserved. 20 100+ known production deployments AND MORE!
  21. 21. Confidential © Alluxio, Inc.All Rights Reserved. 21 Big Data Case Study – Challenge – Gain end to end view of business with large volume of data Queries were slow / not interactive, resulting in operational inefficiency SPARK TERADATA SPARK TERADATA Solution – ETL Data from Teradata to Alluxio Impact – Faster Time to Market – “Now we don’t have to work Sundays” Use Case: http://bit.ly/2oMx95W
  22. 22. Confidential © Alluxio, Inc.All Rights Reserved. 22 Big Data Case Study – Top 3 Retailer Challenge – Bottleneck in Trend Analysis of mission critical daily sales and inventory management Queries were slow / not interactive, resulting in operational inefficiency SPARK HDFS SPARK HDFS Solution – With Alluxio, data queries are 10X faster Impact – Higher operational efficiency Use case: http://bit.ly/2ook8Nh
  23. 23. Confidential © Alluxio, Inc.All Rights Reserved. 23 Consumer Intelligence Use Case – Top 3 Telco Challenge – Desired a central view of consumer information in near real time for proactive support. Many HDFS, different distributions, many incompatible versions. On- prem & cloud. Integration through heavy ETL. HADOOP Solution – Alluxio integrates data into central catalog for fast access to consumer interaction records. Impact – Reduced integration time Faster data speed & freshness ML HADOOP HDFS HDFS HDFS ML ETL HDP HDFS CDH HDFS MAPR HDFS HDFS
  24. 24. Confidential © Alluxio, Inc.All Rights Reserved. 24 Big Data Case Study – Challenge – Gain end to end view of business with large volume of data Queries were slow / not interactive, resulting in operational inefficiency SPARK Baidu File System SPARK Baidu File System Solution – With Alluxio, data queries are 30X faster Impact – Higher operational efficiency http://bit.ly/2pDHS3O
  25. 25. Confidential © Alluxio, Inc.All Rights Reserved. 25 Big Data Case Study – Challenge – Gain end to end view of business with data distributed across geographies Data ETL was not possible due to regulatory concerns SPARK SPARK Solution – With Alluxio, data can be accessed without storing locally Impact – Higher operational efficiency and solved regulatory concerns HDFS HDFS HDFS HDFS HDFS HDFS
  26. 26. Confidential © Alluxio, Inc.All Rights Reserved. 26 Big Data Case Study – Challenge – Gain end to end view of business with large volume of data for $5B Travel Site Queries were slow / not interactive, resulting in operational inefficiency SPARK HDFS Solution – With Alluxio, 300x improvement in performance Impact – Increased revenue from immediate response to user behavior Use case: http://bit.ly/2pDJdrq CEPH HDFS CEPH FLINK SPARK FLINK
  27. 27. Confidential © Alluxio, Inc.All Rights Reserved. 27 Machine Learning Case Study – Challenge – Disparate Data both on Prem and Cloud. Heterogeneous types of data. Scaling of Exabyte size data. Slow due to disk based approach. SPARK HDFS SPARK MINIO Solution – Using Alluxio to prevent I/O bottlenecks Impact – Orders of magnitude higher performance than before. http://bit.ly/2p18ds3 MESOS
  28. 28. Confidential © Alluxio, Inc.All Rights Reserved. 28 Visualizing the Stack with Alluxio FAST 104 - 105 MB/s Application Remote Storage MODERATE 103 - 104 MB/s SLOW 102 - 103 MB/s Alluxio MEM Often Only When Necessary Alluxio SSD/HDD Limited
  29. 29. Confidential © Alluxio, Inc.All Rights Reserved. 29 1.6.0/1 Release HIGHLIGHTS Ecosystem Integrations S3 client Python client Performance Improvement Avoid unnecessary read when closing a file with partial caching on Improve data distribution when using DeterministicHashPolicy Usability Improvements Audit logging Third party UFS connector management Dynamically adjusting log levels Remote logging And Many More! http://www.alluxio.org/releases
  30. 30. Twi$er.com/alluxio Linkedin.com/alluxio Website www.alluxio.com E-mail info@alluxio.com @ Social Media á ™ Confidential © Alluxio, Inc.All Rights Reserved. 30 Contact: haoyuan@alluxio.com Websites: www.alluxio.com and www.alluxio.org Demo: Spark + Alluxio + S3 https://youtu.be/QVtxDpA-jis Alluxio Unified Namespace https://youtu.be/lIXpNK4VxqE

×