Building a Distributed File System
for the Cloud-Native Era
Bin Fan, 05-19-2022 @ Big Data Bellevue
ALLUXIO 1
ALLUXIO 2
About Me
2
Bin Fan (https://www.linkedin.com/in/bin-fan/)
● Founding Engineer, VP Open Source @ Alluxio
● Alluxio PMC Co-Chair, Presto TSC/committer
● Email: binfan@alluxio.com
● PhD in CS @ Carnegie Mellon University
● Originally a research project (Tachyon) in UC Berkeley AMPLab led by by-then PHD student
Haoyuan Li (Alluxio founder CEO)
● Backed by top VCs (e.g., Andreessen Horowitz) with $70M raised in total, Series C ($50M)
announced in 2021
● Deployed in production at large scale in Facebook, Uber, Microsoft, Tencent, Tiktok and etc
● More than 1100 Contributors on Github. In 2021, more than 40% commits in Github were
contributed by the community users
● The 9th most critical Java-based Open-Source projects on Github by Google/OpenSSF[1]
Alluxio Overview
ALLUXIO 3
[1] Google Comes Up With A Metric For Gauging Critical Open-Source Projects
ALLUXIO 4
7 Years Ago
5
AMPLab活动上Tachyon演讲的截图
Alluxio (Tachyon) in 2015
Alluxio(Tachyon) in 2015: Enable Data Sharing Among Spark Jobs
Spark Task1 Spark Task 2
HDFS / Amazon S3
HDFS
disk
block 1
block 3
block 2
block 4
Tachyon
in-memory
RDD
6
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Tachyon
in-memory
block 1
block 3 block 4
Spark Task
Spark memory
block manager
execution engine &
storage engine
same process
Alluxio(Tachyon) in 2015: Fast Checkpoint for job reliability
7
ALLUXIO 8
Today
ALLUXIO 9
COMPANIES USING ALLUXIO
INTERNET
PUBLIC CLOUD PROVIDERS
GENERAL
E-COMMERCE
OTHERS
TECHNOLOGY FINANCIAL SERVICES
TELCO & MEDIA
LEARN MORE
Topology
● On-prem Hadoop → Cloud-native, Multi- or Hybrid-cloud,
Multi-datacenter
Computation
● MR/Spark → Spark, Presto, Hive, Tensorflow, Pytorch ….
● More mature frameworks (less frequent OOM etc)
Data access pattern
● Sequential-read (e.g., scanning) on unstructured files → Ad-hoc
read into structured/columnar data
● Hundred to thousand of big files → millions of small files
The Evolution from Hadoop to Cloud-native Era
ALLUXIO 10
Data Storage
● On-prem & colocated HDFS → S3 !!! and other object stores
(possibly across regions like us-east & us-west),
and legacy on-prem HDFS in service
Resource/Job Orchestration
● YARN → K8s
○ Lost focus on data locality
The Evolution from Hadoop to Cloud-native Era
ALLUXIO 11
Analytics & AI in the Hybrid and
Multi-Cloud Era
Available:
ALLUXIO 12
ALLUXIO 13
Unified Namespace
Providing Users a single and consistent logical filesystem
• Extension: Single logical Alluxio path backed by multiple storage systems
○ Example customized data policy: Migrate data older than 7 days from HDFS to S3
Reports Sales
hdfs://host:port/directory/
ALLUXIO 14
Data Locality with Scale-out Workers
Local I/O performance for remote data with intelligent multi-tiering
Hot Warm Cold
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL
On-premises
Public Cloud
Model
Training
Big Data ETL
Big Data Query
ALLUXIO 15
Synchronization of changes across clusters
Old File at path
/file1 ->
New File at path
/file1 ->
Alluxio Master
Policies for pinning,
promotion/demotion,TTL
Metadata Synchronization
Mutation
On-premises
Public Cloud
Model
Training
Big Data ETL
Big Data Query RAM SSD
Metadata Locality with Scalable Masters
RocksDB
Tencent has implemented a 1000-node Alluxio
cluster and designed a scalable, robust, and
performant architecture to speed up Ceph
storage for game AI training.
Case study: TencentHKG:0700
https://www.alluxio.io/blog/thousand-node-alluxio-cluster-powers-game-ai-platform-production-case-study-from-tencent/
16
ALLUXIO 16
Jobs Failure Failure rate
alluxio-fuse 250000 1848 0.73%
ceph-fuse 250000 7070 2.8%
This company consolidated compute platform to a main datalake in one AWS region.
Previously, this would require copying data daily from satellite data lakes from other regions
to the main region for joint queries. This process is expensive and time consuming, and more
importantly may yield inconsistent query results. Alluxio provides a consistent view and
“cheap” data access to remote AWS data.
Case study: Top Online Travel Shopping in US
17
ALLUXIO 17
v
TEAM A
TEAM C
Main Data Lake (us-east-1)
Hiv
e
Mounted
Mounted
Satellite Data Lake (us-west-1)
Satellite Data Lake (us-west-2)
ALLUXIO 18
Under the Hood
Metadata Storage Challenges
• Storing the raw metadata becomes a problem with a large number of
files
• On average, each file takes 1KB of on-heap storage
• 1 billion files would take 1 TB of heap space!
• A typical JVM runs with < 64GB of heap space
• GC becomes a big problem when using larger heaps
19
Metadata Serving Challenges
• Common file operations (ie. getStatus, create) need to be fast
• On heap data structures excel in this case
• Operations need to be optimized for high concurrency
• Generally many readers and few writers for large-scale analytics
• The metadata service also needs to sustain high load
• A cluster of 100 machines can easily house over 5k concurrent clients!
• Connection life cycles need to be managed well
• Connection handshake is expensive
• Holding an idle connection is also detrimental
20
Solving Scalable Metadata
Storage Using RocksDB
21
Tiered Metadata Storage => 1 Billion Files
22
Alluxio Master
Local Disk
RocksDB (Embedded)
● Inode Tree
● Block Map
● Worker Block Locations
On Heap
● Inode Cache
● Mount Table
● Locks
RocksDB
• https://rocksdb.org/
• RocksDB is an embeddable
persistent key-value store for
fast storage
23
Working with RocksDB
• Abstract the metadata storage layer
• Redesign the data structure representation of the Filesystem Tree
• Each inode is represented by a numerical ID
• Edge table maps <ID,childname> to <ID of child> Ex: <1foo, 2>
• Inode table maps <ID> to <Metadata blob of inode> Ex: <2, proto>
• Two table solution provides good performance for common
operations
• One lookup for listing by using prefix scan
• Path depth lookups for tree traversal
• Constant number of inserts for updates/deletes/creates
24
Effects of the Inode Cache
• Generally can store up to 10M inodes
• Caching top levels of the Filesystem Tree greatly speeds up read
performance
• 20-50% performance loss when addressing a filesystem tree that does not
mostly fit into memory
• Writes can be buffered in the cache and are asynchronously flushed
to RocksDB
• No requirement for durability - that is handled by the journal
25
Self-Managed Quorum for
Leader Election and Journal
Fault Tolerance Using Raft
26
Built-in Fault Tolerance
• Alluxio Masters are run as a quorum for journal fault tolerance
• Metadata can be recovered, solving the durability problem
• This was previously done utilizing an external fault tolerance storage
• Alluxio Masters leverage the same quorum to elect a leader
• Enables hot standbys for rapid recovery in case of single node failure
27
RAFT
• https://raft.github.io/
• Raft is a consensus algorithm that is
designed to be easy to understand.
It's equivalent to Paxos in
fault-tolerance and performance.
• Implemented by
https://github.com/atomix/copycat
28
• Consensus achieved
internally
• Leading masters commits state
change
• Benefits
• Local disk for journal
• Challenges
• Performance tuning
A New HA Mode with Self-managed Services
29
Standby
Master
Leading
Master
Standby
Master
Raft
State Change
State
Change
State
Change
Alluxio + Raft architecture
30
Advantages
- Simplicity
- No external systems (Raft colocated with masters)
- Raft takes care of logging, snapshotting, recovery, etc.
- Performance
- Journal stored directly on masters
- RocksDB key-value store + cache
31
Twitter.com/alluxio
Linkedin.com/alluxio
Website
www.alluxio.io
Slack
http://slackin.alluxio.io/
@
Social Media
Questions?
32

Building a Distributed File System for the Cloud-Native Era

  • 1.
    Building a DistributedFile System for the Cloud-Native Era Bin Fan, 05-19-2022 @ Big Data Bellevue ALLUXIO 1
  • 2.
    ALLUXIO 2 About Me 2 BinFan (https://www.linkedin.com/in/bin-fan/) ● Founding Engineer, VP Open Source @ Alluxio ● Alluxio PMC Co-Chair, Presto TSC/committer ● Email: binfan@alluxio.com ● PhD in CS @ Carnegie Mellon University
  • 3.
    ● Originally aresearch project (Tachyon) in UC Berkeley AMPLab led by by-then PHD student Haoyuan Li (Alluxio founder CEO) ● Backed by top VCs (e.g., Andreessen Horowitz) with $70M raised in total, Series C ($50M) announced in 2021 ● Deployed in production at large scale in Facebook, Uber, Microsoft, Tencent, Tiktok and etc ● More than 1100 Contributors on Github. In 2021, more than 40% commits in Github were contributed by the community users ● The 9th most critical Java-based Open-Source projects on Github by Google/OpenSSF[1] Alluxio Overview ALLUXIO 3 [1] Google Comes Up With A Metric For Gauging Critical Open-Source Projects
  • 4.
  • 5.
  • 6.
    Alluxio(Tachyon) in 2015:Enable Data Sharing Among Spark Jobs Spark Task1 Spark Task 2 HDFS / Amazon S3 HDFS disk block 1 block 3 block 2 block 4 Tachyon in-memory RDD 6
  • 7.
    HDFS / AmazonS3 block 1 block 3 block 2 block 4 Tachyon in-memory block 1 block 3 block 4 Spark Task Spark memory block manager execution engine & storage engine same process Alluxio(Tachyon) in 2015: Fast Checkpoint for job reliability 7
  • 8.
  • 9.
    ALLUXIO 9 COMPANIES USINGALLUXIO INTERNET PUBLIC CLOUD PROVIDERS GENERAL E-COMMERCE OTHERS TECHNOLOGY FINANCIAL SERVICES TELCO & MEDIA LEARN MORE
  • 10.
    Topology ● On-prem Hadoop→ Cloud-native, Multi- or Hybrid-cloud, Multi-datacenter Computation ● MR/Spark → Spark, Presto, Hive, Tensorflow, Pytorch …. ● More mature frameworks (less frequent OOM etc) Data access pattern ● Sequential-read (e.g., scanning) on unstructured files → Ad-hoc read into structured/columnar data ● Hundred to thousand of big files → millions of small files The Evolution from Hadoop to Cloud-native Era ALLUXIO 10
  • 11.
    Data Storage ● On-prem& colocated HDFS → S3 !!! and other object stores (possibly across regions like us-east & us-west), and legacy on-prem HDFS in service Resource/Job Orchestration ● YARN → K8s ○ Lost focus on data locality The Evolution from Hadoop to Cloud-native Era ALLUXIO 11
  • 12.
    Analytics & AIin the Hybrid and Multi-Cloud Era Available: ALLUXIO 12
  • 13.
    ALLUXIO 13 Unified Namespace ProvidingUsers a single and consistent logical filesystem • Extension: Single logical Alluxio path backed by multiple storage systems ○ Example customized data policy: Migrate data older than 7 days from HDFS to S3 Reports Sales hdfs://host:port/directory/
  • 14.
    ALLUXIO 14 Data Localitywith Scale-out Workers Local I/O performance for remote data with intelligent multi-tiering Hot Warm Cold RAM SSD HDD Read & Write Buffering Transparent to App Policies for pinning, promotion/demotion,TTL On-premises Public Cloud Model Training Big Data ETL Big Data Query
  • 15.
    ALLUXIO 15 Synchronization ofchanges across clusters Old File at path /file1 -> New File at path /file1 -> Alluxio Master Policies for pinning, promotion/demotion,TTL Metadata Synchronization Mutation On-premises Public Cloud Model Training Big Data ETL Big Data Query RAM SSD Metadata Locality with Scalable Masters RocksDB
  • 16.
    Tencent has implementeda 1000-node Alluxio cluster and designed a scalable, robust, and performant architecture to speed up Ceph storage for game AI training. Case study: TencentHKG:0700 https://www.alluxio.io/blog/thousand-node-alluxio-cluster-powers-game-ai-platform-production-case-study-from-tencent/ 16 ALLUXIO 16 Jobs Failure Failure rate alluxio-fuse 250000 1848 0.73% ceph-fuse 250000 7070 2.8%
  • 17.
    This company consolidatedcompute platform to a main datalake in one AWS region. Previously, this would require copying data daily from satellite data lakes from other regions to the main region for joint queries. This process is expensive and time consuming, and more importantly may yield inconsistent query results. Alluxio provides a consistent view and “cheap” data access to remote AWS data. Case study: Top Online Travel Shopping in US 17 ALLUXIO 17 v TEAM A TEAM C Main Data Lake (us-east-1) Hiv e Mounted Mounted Satellite Data Lake (us-west-1) Satellite Data Lake (us-west-2)
  • 18.
  • 19.
    Metadata Storage Challenges •Storing the raw metadata becomes a problem with a large number of files • On average, each file takes 1KB of on-heap storage • 1 billion files would take 1 TB of heap space! • A typical JVM runs with < 64GB of heap space • GC becomes a big problem when using larger heaps 19
  • 20.
    Metadata Serving Challenges •Common file operations (ie. getStatus, create) need to be fast • On heap data structures excel in this case • Operations need to be optimized for high concurrency • Generally many readers and few writers for large-scale analytics • The metadata service also needs to sustain high load • A cluster of 100 machines can easily house over 5k concurrent clients! • Connection life cycles need to be managed well • Connection handshake is expensive • Holding an idle connection is also detrimental 20
  • 21.
  • 22.
    Tiered Metadata Storage=> 1 Billion Files 22 Alluxio Master Local Disk RocksDB (Embedded) ● Inode Tree ● Block Map ● Worker Block Locations On Heap ● Inode Cache ● Mount Table ● Locks
  • 23.
    RocksDB • https://rocksdb.org/ • RocksDBis an embeddable persistent key-value store for fast storage 23
  • 24.
    Working with RocksDB •Abstract the metadata storage layer • Redesign the data structure representation of the Filesystem Tree • Each inode is represented by a numerical ID • Edge table maps <ID,childname> to <ID of child> Ex: <1foo, 2> • Inode table maps <ID> to <Metadata blob of inode> Ex: <2, proto> • Two table solution provides good performance for common operations • One lookup for listing by using prefix scan • Path depth lookups for tree traversal • Constant number of inserts for updates/deletes/creates 24
  • 25.
    Effects of theInode Cache • Generally can store up to 10M inodes • Caching top levels of the Filesystem Tree greatly speeds up read performance • 20-50% performance loss when addressing a filesystem tree that does not mostly fit into memory • Writes can be buffered in the cache and are asynchronously flushed to RocksDB • No requirement for durability - that is handled by the journal 25
  • 26.
    Self-Managed Quorum for LeaderElection and Journal Fault Tolerance Using Raft 26
  • 27.
    Built-in Fault Tolerance •Alluxio Masters are run as a quorum for journal fault tolerance • Metadata can be recovered, solving the durability problem • This was previously done utilizing an external fault tolerance storage • Alluxio Masters leverage the same quorum to elect a leader • Enables hot standbys for rapid recovery in case of single node failure 27
  • 28.
    RAFT • https://raft.github.io/ • Raftis a consensus algorithm that is designed to be easy to understand. It's equivalent to Paxos in fault-tolerance and performance. • Implemented by https://github.com/atomix/copycat 28
  • 29.
    • Consensus achieved internally •Leading masters commits state change • Benefits • Local disk for journal • Challenges • Performance tuning A New HA Mode with Self-managed Services 29 Standby Master Leading Master Standby Master Raft State Change State Change State Change
  • 30.
    Alluxio + Raftarchitecture 30
  • 31.
    Advantages - Simplicity - Noexternal systems (Raft colocated with masters) - Raft takes care of logging, snapshotting, recovery, etc. - Performance - Journal stored directly on masters - RocksDB key-value store + cache 31
  • 32.