Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Alluxio @ Uber Seattle Meetup

117 views

Published on

Over the past two decades, the Big Data stack has reshaped and evolved quickly with numerous innovations driven by the rise of many different open source projects and communities. In this meetup, speakers from Uber, Alibaba, and Alluxio will share best practices for addressing the challenges and opportunities in the developing data architectures using new and emerging open source building blocks. Topics include data format (ORC) optimization, storage security (HDFS), data format (Parquet) layers, and unified data access (Alluxio) layers.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Alluxio @ Uber Seattle Meetup

  1. 1. Unify Data at Memory Speed Bin Fan | Founding Engineer&VP of Open Source| Alluxio binfan@Alluxio.com
  2. 2. Agenda Why storage-independent compute? Alluxio Overview Demo: Spark + Alluxio + S3 Real-world Use Cases
  3. 3. From mainframes to Big Data Moving from tightly integrated to loosely integrated architectures Application, processing, data storage and hardware - All-in-one tightly coupled Client server architecture drives application separation. Processing and data storage still tightly coupled Data growth drives distributed MPP architectures but processing and data storage still tightly coupled Further data growth drives distributed file system architecture. Processing and data storage co- located but loosely coupled 1970s 1980s 2000s 2010s
  4. 4. The Big Data Ecosystem Explodes Moving from tightly integrated to loosely integrated architectures STORAGE COMPUTE
  5. 5. Why independently scale compute and storage for data-driven applications? Flexible compute scaling based on application demands Flexible storage scaling based on data growth patterns Compute is CPU bound Storage is I/O bound
  6. 6. Why independently scale compute and storage for data-driven applications? Flexibility of using hardware / instances that match the workload CPU / GPU + IO + Memory S3 Leverage cheaper and newer storage like object stores for big data / AI workloads Orchestrate & automate compute for greater operational efficiency Protect & control your data on premises and leverage public cloud for compute
  7. 7. The challenges of independent scaling for data-driven workloads Data Locality Data Accessibility Data Abstraction Data is no more local to compute and workload processing time will increase particularly in hybrid cloud deployments Data is in multiple storage systems in multiple locations. Highly complex when all compute frameworks talk to all storage systems Data can still only be accessed using the specific storage system APIs
  8. 8. Virtual Unified File System Java File API HDFS Interface S3 Interface REST APIFUSE Interface HDFS Driver Swift Driver S3 Driver NFS Driver
  9. 9. Alluxio Key Innovations
  10. 10. Unified Namespace Bring all files into a single interface Interact with data using any API Accelerate & tier data transparently API Translation Intelligent Multi-tiering Key Innovations of theVirtual Unified File System
  11. 11. Data Locality via Intelligent Multi-tiering Local performance from remote data using multi-tier storage Hot Warm Cold RAM SSD HDD Read & Write Buffering Transparent to App Policies for pinning, promotion/demotion,TTL
  12. 12. Data Accessibility via Server-side API Translation Convert from Client-side Interface to native Storage Interface Java File API HDFS Interface S3 Interface REST APIFUSE Interface HDFS Driver Swift DriverS3 Driver NFS Driver
  13. 13. Data Abstraction via Unified Namespace Enables effective data management across different Under Store - Uses Mounting withTransparent Naming
  14. 14. Unified Namespace: Global Data Accessibility Transparent access to understorage makes all enterprise data available locally SUPPORTS • HDFS • NFS • OpenStack • Ceph • Amazon S3 • Azure • Google Cloud IT OPS FRIENDLY • Storage mounted into Alluxio by central IT • Security in Alluxio mirrors source data • Authentication through LDAP/AD • Wireline encryption HDFS #1 Object Store NFS HDFS #2
  15. 15. Demo: Spark + Alluxio + S3
  16. 16. Deployment Approaches Spark Alluxio Storage Co-locate Alluxio Workers with Spark for optimal I/O performance Any Cloud Same instance / container Spark Alluxio Storage Deploy Alluxio as standalone cluster between Spark and Storage Any Cloud Same data center / region Presto
  17. 17. Alluxio Master Zookeeper / RAFT Standby Master Under Store 1 Under Store 2 WANAlluxio Client Application Object Store Alluxio Client Application Alluxio Worker RAM / SSD / HDD Alluxio Worker RAM / SSD / HDD Alluxio Reference Architecture
  18. 18. Data Flow In Alluxio Read Scenarios • Data not in Alluxio (i.e. first time, or no cache) • Data on same node as client • Data on different node from client 18 Write Scenarios • Write only to Alluxio • Write only to Under Store • Write synchronously to Alluxio and Under Store • Write to Alluxio and asynchronously write to Under Store • Write to Alluxio and replicate to N other workers • Write to Alluxio and async write to multiple Under stores Application have great flexibility to read / write data with many options
  19. 19. Real world Use cases
  20. 20. • Huya: 1300s (100% Hadoop nodes) • Sogou: 1000s (60% Hadoop nodes) • Momo: 850 nodes • JD: 600 nodes • Tencent: 400 nodes • Huawei Cloud: 150 nodes • China Unicom: 100 nodes • … Truly independent scaling of the data stack Alluxio Production Deployments
  21. 21. Virtual Data Lake § Accelerate batch, micro- batch & streaming jobs § Slowly transition to lower cost object stores § Run in hybrid cloud environment with compute in the cloud § Accelerate ML jobs running on object stores or file systems § Provide consistent performance to data scientists § Provide unified interface to access all data § Accelerate & tier data transparently across storage tiers § Co-locate remote data with compute for performance Machine Learning Productivity Self-service data across hybrid cloud Popular Technical Use Cases
  22. 22. Elastic Model Training SPARK HDFS SPARK HDFS Challenge – Algorithmic trading in $46B data driven Hedge Fund. Model training in cloud for bursty workloads Data access was slow, costing them $$ in compute cost and lower modeler productivity https://www.meetup.com/Two-Sigma-Open-Source- Meetup/events/259368502 Mar 25 NYC Solution – With Alluxio, data access are 10- 30X faster Impact – Increased efficiency on training of ML algorithm, lowered compute cost and increased modeler productivity, resulting in 14 day ROI of Alluxio Public Cloud Public Cloud
  23. 23. Analytics Use Case – Top Retailer Challenge – Bottleneck in Trend Analysis of mission critical daily sales and inventory management Queries were slow / not interactive, resulting in operational inefficiency Solution – With Alluxio, data queries are 10X faster Impact – Higher operational efficiency Use case: http://bit.ly/2ook8Nh SPARK HDFS SPARK HDFS
  24. 24. Incredible Open Source Momentum with growing community 900+ contributors & growing 3900+ Git Stars Apache 2.0 Licensed Hundreds of thousands of downloads https://www.alluxio.org/slack
  25. 25. ThankYou Join the Alluxio Community www.alluxio.org | www.alluxio.com | Twitter: @Alluxio

×