Building a Distributed File System for the Cloud-Native Era

Building a Distributed File System
for the Cloud-Native Era
Bin Fan, 05-19-2022 @ Big Data Bellevue
ALLUXIO 1

ALLUXIO 2
About Me
2
Bin Fan (https://www.linkedin.com/in/bin-fan/)
● Founding Engineer, VP Open Source @ Alluxio
● Alluxio PMC Co-Chair, Presto TSC/committer
● Email: binfan@alluxio.com
● PhD in CS @ Carnegie Mellon University

● Originally a research project (Tachyon) in UC Berkeley AMPLab led by by-then PHD student
Haoyuan Li (Alluxio founder CEO)
● Backed by top VCs (e.g., Andreessen Horowitz) with $70M raised in total, Series C ($50M)
announced in 2021
● Deployed in production at large scale in Facebook, Uber, Microsoft, Tencent, Tiktok and etc
● More than 1100 Contributors on Github. In 2021, more than 40% commits in Github were
contributed by the community users
● The 9th most critical Java-based Open-Source projects on Github by Google/OpenSSF[1]
Alluxio Overview
ALLUXIO 3
[1] Google Comes Up With A Metric For Gauging Critical Open-Source Projects

5
AMPLab活动上Tachyon演讲的截图
Alluxio (Tachyon) in 2015

Alluxio(Tachyon) in 2015: Enable Data Sharing Among Spark Jobs
Spark Task1 Spark Task 2
HDFS / Amazon S3
HDFS
disk
block 1
block 3
block 2
block 4
Tachyon
in-memory
RDD
6

HDFS / Amazon S3
block 1
block 3
block 2
block 4
Tachyon
in-memory
block 1
block 3 block 4
Spark Task
Spark memory
block manager
execution engine &
storage engine
same process
Alluxio(Tachyon) in 2015: Fast Checkpoint for job reliability
7

ALLUXIO 9
COMPANIES USING ALLUXIO
INTERNET
PUBLIC CLOUD PROVIDERS
GENERAL
E-COMMERCE
OTHERS
TECHNOLOGY FINANCIAL SERVICES
TELCO & MEDIA
LEARN MORE

Topology
● On-prem Hadoop → Cloud-native, Multi- or Hybrid-cloud,
Multi-datacenter
Computation
● MR/Spark → Spark, Presto, Hive, Tensorflow, Pytorch ….
● More mature frameworks (less frequent OOM etc)
Data access pattern
● Sequential-read (e.g., scanning) on unstructured files → Ad-hoc
read into structured/columnar data
● Hundred to thousand of big files → millions of small files
The Evolution from Hadoop to Cloud-native Era
ALLUXIO 10

Data Storage
● On-prem & colocated HDFS → S3 !!! and other object stores
(possibly across regions like us-east & us-west),
and legacy on-prem HDFS in service
Resource/Job Orchestration
● YARN → K8s
○ Lost focus on data locality
The Evolution from Hadoop to Cloud-native Era
ALLUXIO 11

Analytics & AI in the Hybrid and
Multi-Cloud Era
Available:
ALLUXIO 12

ALLUXIO 13
Unified Namespace
Providing Users a single and consistent logical filesystem
• Extension: Single logical Alluxio path backed by multiple storage systems
○ Example customized data policy: Migrate data older than 7 days from HDFS to S3
Reports Sales
hdfs://host:port/directory/

ALLUXIO 14
Data Locality with Scale-out Workers
Local I/O performance for remote data with intelligent multi-tiering
Hot Warm Cold
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL
On-premises
Public Cloud
Model
Training
Big Data ETL
Big Data Query

ALLUXIO 15
Synchronization of changes across clusters
Old File at path
/file1 ->
New File at path
/file1 ->
Alluxio Master
Policies for pinning,
promotion/demotion,TTL
Metadata Synchronization
Mutation
On-premises
Public Cloud
Model
Training
Big Data ETL
Big Data Query RAM SSD
Metadata Locality with Scalable Masters
RocksDB

Tencent has implemented a 1000-node Alluxio
cluster and designed a scalable, robust, and
performant architecture to speed up Ceph
storage for game AI training.
Case study: TencentHKG:0700
https://www.alluxio.io/blog/thousand-node-alluxio-cluster-powers-game-ai-platform-production-case-study-from-tencent/
16
ALLUXIO 16
Jobs Failure Failure rate
alluxio-fuse 250000 1848 0.73%
ceph-fuse 250000 7070 2.8%

This company consolidated compute platform to a main datalake in one AWS region.
Previously, this would require copying data daily from satellite data lakes from other regions
to the main region for joint queries. This process is expensive and time consuming, and more
importantly may yield inconsistent query results. Alluxio provides a consistent view and
“cheap” data access to remote AWS data.
Case study: Top Online Travel Shopping in US
17
ALLUXIO 17
v
TEAM A
TEAM C
Main Data Lake (us-east-1)
Hiv
e
Mounted
Mounted
Satellite Data Lake (us-west-1)
Satellite Data Lake (us-west-2)

Metadata Storage Challenges
• Storing the raw metadata becomes a problem with a large number of
files
• On average, each file takes 1KB of on-heap storage
• 1 billion files would take 1 TB of heap space!
• A typical JVM runs with < 64GB of heap space
• GC becomes a big problem when using larger heaps
19

Metadata Serving Challenges
• Common file operations (ie. getStatus, create) need to be fast
• On heap data structures excel in this case
• Operations need to be optimized for high concurrency
• Generally many readers and few writers for large-scale analytics
• The metadata service also needs to sustain high load
• A cluster of 100 machines can easily house over 5k concurrent clients!
• Connection life cycles need to be managed well
• Connection handshake is expensive
• Holding an idle connection is also detrimental
20

Solving Scalable Metadata
Storage Using RocksDB
21

Tiered Metadata Storage => 1 Billion Files
22
Alluxio Master
Local Disk
RocksDB (Embedded)
● Inode Tree
● Block Map
● Worker Block Locations
On Heap
● Inode Cache
● Mount Table
● Locks

RocksDB
• https://rocksdb.org/
• RocksDB is an embeddable
persistent key-value store for
fast storage
23

Working with RocksDB
• Abstract the metadata storage layer
• Redesign the data structure representation of the Filesystem Tree
• Each inode is represented by a numerical ID
• Edge table maps <ID,childname> to <ID of child> Ex: <1foo, 2>
• Inode table maps <ID> to <Metadata blob of inode> Ex: <2, proto>
• Two table solution provides good performance for common
operations
• One lookup for listing by using prefix scan
• Path depth lookups for tree traversal
• Constant number of inserts for updates/deletes/creates
24

Eﬀects of the Inode Cache
• Generally can store up to 10M inodes
• Caching top levels of the Filesystem Tree greatly speeds up read
performance
• 20-50% performance loss when addressing a filesystem tree that does not
mostly fit into memory
• Writes can be buﬀered in the cache and are asynchronously flushed
to RocksDB
• No requirement for durability - that is handled by the journal
25

Self-Managed Quorum for
Leader Election and Journal
Fault Tolerance Using Raft
26

Built-in Fault Tolerance
• Alluxio Masters are run as a quorum for journal fault tolerance
• Metadata can be recovered, solving the durability problem
• This was previously done utilizing an external fault tolerance storage
• Alluxio Masters leverage the same quorum to elect a leader
• Enables hot standbys for rapid recovery in case of single node failure
27

RAFT
• https://raft.github.io/
• Raft is a consensus algorithm that is
designed to be easy to understand.
It's equivalent to Paxos in
fault-tolerance and performance.
• Implemented by
https://github.com/atomix/copycat
28

• Consensus achieved
internally
• Leading masters commits state
change
• Benefits
• Local disk for journal
• Challenges
• Performance tuning
A New HA Mode with Self-managed Services
29
Standby
Master
Leading
Master
Standby
Master
Raft
State Change
State
Change
State
Change

Alluxio + Raft architecture
30

Advantages
- Simplicity
- No external systems (Raft colocated with masters)
- Raft takes care of logging, snapshotting, recovery, etc.
- Performance
- Journal stored directly on masters
- RocksDB key-value store + cache
31

Twitter.com/alluxio
Linkedin.com/alluxio
Website
www.alluxio.io
Slack
http://slackin.alluxio.io/
@
Social Media
Questions?
32

Building a Distributed File System for the Cloud-Native Era

More Related Content

Similar to Building a Distributed File System for the Cloud-Native Era

More from Alluxio, Inc.

Recently uploaded

Building a Distributed File System for the Cloud-Native Era