DATA ORCHESTRATION SUMMIT
2019
Alluxio 2 Community Update
Calvin Jia & Bin Fan | Founding Engineers | Alluxio
Seamless
Operations
Alluxio 2 Series Directions
Advanced
Data Management
2
Hyper-scale
Architecture
Available Today
Alluxio
Structured Data Management
3
Alluxio 2.1.0 Developer Preview
DATA ORCHESTRATION
SUMMIT
2019
Seamless Operations
Cloud Native on AWS: AMI, CFT, EMR
Presto Hive
Cluster Metadata &
Data cache
Presto Hive
Metadata &
Data cache
Compute-driven
Continuous sync
Compute-driven
Continuous sync
5
§ Alluxio AMI in the Marketplace
§ Alluxio Cloud Formation Template for cluster deployment
§ AWS EMR with Alluxio with bootstrap script
Enable one-click to deploy Alluxio on AWS
Cloud Native on Google Cloud: Dataproc
Presto Hive
Metadata &
Data cache
Presto Hive
Metadata &
Data cache
Compute-driven
Continuous sync
Compute-driven
Continuous sync
6
§ Google Dataproc with Alluxio (init action integration available)
Google
Dataproc
Cluster
Enable one-click to deploy Alluxio on Google Cloud
Native Deployment with Kubernetes
Alluxio Worker
Kubernetes
Cluster
Host
Machine
7
Alluxio Master Alluxio Worker
Host
Machine
Journal Volume
Application
ApplicationApplicationApplicationApplication
Self-Managed Quorum
8
Available in 2.0.0
Distributed
Storage
(ie. HDFS)
Alluxio Standby
Master
Distributed
Quorum
(Zookeeper)
Alluxio Master
Alluxio Standby
Master
Alluxio Standby
Master
Alluxio Master
RAFT
No major external dependencies
DATA ORCHESTRATION
SUMMIT
2019
Hyper-scale architecture
§ Challenge:
• 1 file metadata takes 1KB of on-heap storage
• 1 billion files would take 1 TB of heap space, GC becomes a big problem
§ Solution:
• Add new tier with embedded RocksDB to store inode tree
• Keep an in-memory cache of frequently used inodes
10
Scaling to 1 Billion+ Files
Scale to one billion files and beyond, with performance comparable
to previous on-heap implementation
Scaling to 1 Billion+ Files
11
Available in 2.0.0
Alluxio Master
Local Disk
RocksDB (Embedded)
● Inode Table
● Edge Table
● Block Table
● Block to Worker Table
● Worker to Block Table
On Heap
● Inode Cache
● Mount Table
● Locks
Inode ID Metadata (Binary)
12392 010101101101
12393 110110110100
… …
Edge (ID, name) Inode ID
12392,foo 12393
… …
Efficient cluster communication with gRPC
12
Available in 2.0.0
Thrift (Metadata)
Netty (IO)
Alluxio Master
Alluxio Worker
Alluxio Worker
Alluxio Client
Alluxio Master
Alluxio Worker
Alluxio Worker
Alluxio Client
gRPC (Metadata + IO)
DATA ORCHESTRATION
SUMMIT
2019
Advanced Data Management
Replicated Asynchronous Writes
14
RAM / SSD / HDD
Network Speed Write of Data
Application
Alluxio
Client
Alluxio
Worker
RAM / SSD / HDD
Alluxio
Worker
Under Store
Available in 2.0.0
Fast and reliable writes to Alluxio, with data persisted in background
Policy Driven Data Management
15
Available in 2.0.0
Alluxio
Master
Alluxio Policy Engine
Example Policy
Move files older than 90
days from HDFS to S3
Application
Apps access the same path regardless
of where the actual data is stored
Decouple logical file system namespace with physical storage systems
DATA ORCHESTRATION
SUMMIT
2019
Structured Data Management
§ New Alluxio Catalog Service
• Provides the Abstraction of Structured Data
• Attaching a Hive MetaStore like Mounting a File system
• Understand and Serve Schema of Files or Objects
§ New Alluxio Data Transformation Service
• Tranform csv à parquet
• Compact many files à fewer files
Deeper Integration with Presto
17
Presto Alluxio Connector Based off the Hive Connector
Now available as Developer Preview
DATA ORCHESTRATION
SUMMIT
2019
Community Update
1 3 70
210
750
1080
Fast Growing Developer Community
Started as Haoyuan Li’s PhD project “Tachyon”
v1.0
Feb ‘16
v0.6
Mar ‘15
v0.2
Apr ‘13
v0.1
Dec ‘12
v2.1
Nov ‘19
v1.8
Jul ‘18
Open sourced in under Apache 2.0 License
Contributors
19
§ Deeper Integration with Presto
• Collaboration w/ Presto maintainers
§ Kubernetes Helm Chart
• Collaboration w/
§ Improved Alluxio POSIX interface and distributed operations
• Contributed by
§ Kubernetes Container Storage Interface (CSI) Implementation
• Contributed by individual contributor Mingfang
Great Community Collaborations
Available in 2.1.0
20
Consumer Travel & TransportationTelco & Media
TechnologyFinancial Services Retail & Entertainment Data & Analytics Services
Deployed in Hundreds of Companies
https://www.alluxio.io/powered-by-alluxio/ 21
Deployed at Scale in Different Environment
On-Prem
• Huya: 1300 nodes
• Sogou: 1000 nodes
• JD.com: 1000 nodes
• Momo: 850 nodes
• Tencent: 400 nodes
Single Cloud
• Bazaarvoice: AWS
• Ryte: AWS
• Myntra: AWS
• Cuelogic: AWS
• Walmart Labs: GCP
Hybrid Cloud
• DBS Bank
• ING Bank
• Comcast
• Ligadata
• Qiniu Cloud
22
Community Activities Around the World
23
New York, March 2019
Seattle, March 2019
Singapore, April 2019
Bay Area, Jun 2019
Beijing, Jun 2019
Austin, Aug 2019
Join Our User Community
Join Slack channel
alluxio.io/slack
Wechat Public AccountJoin meetup groups near you
alluxio-open-source-community/
24

Alluxio 2 Community Update

  • 1.
    DATA ORCHESTRATION SUMMIT 2019 Alluxio2 Community Update Calvin Jia & Bin Fan | Founding Engineers | Alluxio
  • 2.
    Seamless Operations Alluxio 2 SeriesDirections Advanced Data Management 2 Hyper-scale Architecture
  • 3.
    Available Today Alluxio Structured DataManagement 3 Alluxio 2.1.0 Developer Preview
  • 4.
  • 5.
    Cloud Native onAWS: AMI, CFT, EMR Presto Hive Cluster Metadata & Data cache Presto Hive Metadata & Data cache Compute-driven Continuous sync Compute-driven Continuous sync 5 § Alluxio AMI in the Marketplace § Alluxio Cloud Formation Template for cluster deployment § AWS EMR with Alluxio with bootstrap script Enable one-click to deploy Alluxio on AWS
  • 6.
    Cloud Native onGoogle Cloud: Dataproc Presto Hive Metadata & Data cache Presto Hive Metadata & Data cache Compute-driven Continuous sync Compute-driven Continuous sync 6 § Google Dataproc with Alluxio (init action integration available) Google Dataproc Cluster Enable one-click to deploy Alluxio on Google Cloud
  • 7.
    Native Deployment withKubernetes Alluxio Worker Kubernetes Cluster Host Machine 7 Alluxio Master Alluxio Worker Host Machine Journal Volume Application ApplicationApplicationApplicationApplication
  • 8.
    Self-Managed Quorum 8 Available in2.0.0 Distributed Storage (ie. HDFS) Alluxio Standby Master Distributed Quorum (Zookeeper) Alluxio Master Alluxio Standby Master Alluxio Standby Master Alluxio Master RAFT No major external dependencies
  • 9.
  • 10.
    § Challenge: • 1file metadata takes 1KB of on-heap storage • 1 billion files would take 1 TB of heap space, GC becomes a big problem § Solution: • Add new tier with embedded RocksDB to store inode tree • Keep an in-memory cache of frequently used inodes 10 Scaling to 1 Billion+ Files Scale to one billion files and beyond, with performance comparable to previous on-heap implementation
  • 11.
    Scaling to 1Billion+ Files 11 Available in 2.0.0 Alluxio Master Local Disk RocksDB (Embedded) ● Inode Table ● Edge Table ● Block Table ● Block to Worker Table ● Worker to Block Table On Heap ● Inode Cache ● Mount Table ● Locks Inode ID Metadata (Binary) 12392 010101101101 12393 110110110100 … … Edge (ID, name) Inode ID 12392,foo 12393 … …
  • 12.
    Efficient cluster communicationwith gRPC 12 Available in 2.0.0 Thrift (Metadata) Netty (IO) Alluxio Master Alluxio Worker Alluxio Worker Alluxio Client Alluxio Master Alluxio Worker Alluxio Worker Alluxio Client gRPC (Metadata + IO)
  • 13.
  • 14.
    Replicated Asynchronous Writes 14 RAM/ SSD / HDD Network Speed Write of Data Application Alluxio Client Alluxio Worker RAM / SSD / HDD Alluxio Worker Under Store Available in 2.0.0 Fast and reliable writes to Alluxio, with data persisted in background
  • 15.
    Policy Driven DataManagement 15 Available in 2.0.0 Alluxio Master Alluxio Policy Engine Example Policy Move files older than 90 days from HDFS to S3 Application Apps access the same path regardless of where the actual data is stored Decouple logical file system namespace with physical storage systems
  • 16.
  • 17.
    § New AlluxioCatalog Service • Provides the Abstraction of Structured Data • Attaching a Hive MetaStore like Mounting a File system • Understand and Serve Schema of Files or Objects § New Alluxio Data Transformation Service • Tranform csv à parquet • Compact many files à fewer files Deeper Integration with Presto 17 Presto Alluxio Connector Based off the Hive Connector Now available as Developer Preview
  • 18.
  • 19.
    1 3 70 210 750 1080 FastGrowing Developer Community Started as Haoyuan Li’s PhD project “Tachyon” v1.0 Feb ‘16 v0.6 Mar ‘15 v0.2 Apr ‘13 v0.1 Dec ‘12 v2.1 Nov ‘19 v1.8 Jul ‘18 Open sourced in under Apache 2.0 License Contributors 19
  • 20.
    § Deeper Integrationwith Presto • Collaboration w/ Presto maintainers § Kubernetes Helm Chart • Collaboration w/ § Improved Alluxio POSIX interface and distributed operations • Contributed by § Kubernetes Container Storage Interface (CSI) Implementation • Contributed by individual contributor Mingfang Great Community Collaborations Available in 2.1.0 20
  • 21.
    Consumer Travel &TransportationTelco & Media TechnologyFinancial Services Retail & Entertainment Data & Analytics Services Deployed in Hundreds of Companies https://www.alluxio.io/powered-by-alluxio/ 21
  • 22.
    Deployed at Scalein Different Environment On-Prem • Huya: 1300 nodes • Sogou: 1000 nodes • JD.com: 1000 nodes • Momo: 850 nodes • Tencent: 400 nodes Single Cloud • Bazaarvoice: AWS • Ryte: AWS • Myntra: AWS • Cuelogic: AWS • Walmart Labs: GCP Hybrid Cloud • DBS Bank • ING Bank • Comcast • Ligadata • Qiniu Cloud 22
  • 23.
    Community Activities Aroundthe World 23 New York, March 2019 Seattle, March 2019 Singapore, April 2019 Bay Area, Jun 2019 Beijing, Jun 2019 Austin, Aug 2019
  • 24.
    Join Our UserCommunity Join Slack channel alluxio.io/slack Wechat Public AccountJoin meetup groups near you alluxio-open-source-community/ 24