Alluxio Community Office Hour: Getting Started with Alluxio Open Source
1. Getting Started with
Alluxio Open Source
1
2019/01/28 Office Hour
Follow us | @alluxio
Download Alluxio | www.alluxio.org
Questions? | http://alluxio.org/slack
2. About Me
• Bin Fan
• PhD CS@CMU
• Founding Engineer@Alluxio
2
Email: binfan@alluxio.com
Github: apc999
Twitter: @binfan
3. Company
Overview
• Founded Feb. 2015 – Haoyuan Li
• PhD research at UC Berkeley AMPLab
• Initially Tachyon Nexus
• Venture Backed
• Andreessen Horowitz, Seven Seas etc.
• Open Source Business Model
• Tachyon Open Sourced in Dec. 2012
• Open source v1.0 released Feb. 2016
• Commercial product released Oct. 2016
• Office in San Mateo, CA
• Team: Google, Palantir, Vmware, AMD, Cisco…
4. Data Access Layer
Data Access Layer: Alluxio
Security Standard APIsHigh Performance
Compatibility Decoupling
Transparent
Migration
4
5. The Data Access Layer
5
• Abstraction layer between applications and storage systems
• Present a stable storage interface to applications, including
semantics, security, and performance
• Eliminate weakness of data silos instead of data silos
themselves
• Enable transparent migration of underlying storage systems
• Enable application API to storage API translation in a single
layer
6. Alluxio
6
• Our implementation of the data access layer – a virtual
distributed file system
• Open source project with over 900 contributors from 100s of
organizations worldwide
• Deployed in many top internet and financial companies
8. Install Alluxio
8
- Install Alluxio using brew on Local MacOS
- brew install alluxio
- Install Alluxio using docker on Local Linux
- http://www.alluxio.org/docs/1.8/en/deploy/Running-Alluxio-On-Docker.ht
ml
10. Read Data not Cached in Alluxio + Caching
10
RAM / SSD / HDD
Application
Alluxio
Client
Alluxio
WorkerUnder Store 12
3
4
4
11. Read Cached Data in Alluxio
Alluxio
Worker
RAM / SSD / HDD
Application
Alluxio
Client
11
1
2
3
12. Write data only to Alluxio
Alluxio
Worker
RAM / SSD / HDD
Application
Alluxio
Client
12
1
2
3
13. Write to Alluxio and Under Store Synchronously
RAM / SSD / HDD
Application
Alluxio
Client
Alluxio
Worker
Under Store
13
12
2
3
14. A Common File System Abstraction
14
• Common interface across apps
• HDFS-compatible interface:
change hdfs://foo/bar to
alluxio://foo/bar
• Other interfaces: Native Alluxio Java
FS, POSIX and S3.
• Cloud storage becomes “hidden”
to apps
• Less vendor lock-in!
Compute Zone
Standalone or managed with Mesos or Yarn
Storage in Different Availability Zone
Either on-prem or cloud
TensorflowPrestoMR
HDFS API POSIX API
15. Data Path: Improved I/O Performance
15
• A New Tier Above Cloud Storage for Compute
• Distributed buffer cache
• Restore locality to compute
• Read:
• Cache-hit read: served by Alluxio workers (local worker preferred)
• Cache-miss read: served by cloud storage, then cache to Alluxio worker
• Write:
• Burst buffer, then async propagate to S3 (Alluxio 2.0)
• Challenges:
• Locality: expose location information to applications; serve local apps
through ramdisk (rather than network)
16. Data Path: Async Persist to S3 (Alluxio 2.0)
16
RAM / SSD / HDD
Application
Alluxio
Client
Alluxio
Master
Alluxio
Worker
Under Store
• Async Writes
• Step1: App writes to Alluxio
• Step2: Alluxio writes to UFS
• Benefits
• Apps writes in Alluxio speed
• Data gets persisted
• Challenges
• File rename/delete before
persist: 2PC
• Fault-tolerance: journal async
requests
17. Metadata Path: Familiar Semantics
17
• Listing / renaming on object store can be expensive
• Common operations for batch or SQL analytics
• Overwriting Put is eventually consistent
• Alluxio loads and manages metadata in master
• Apps can continue assuming HDFS-like semantics and performance
implication
• Challenges
• Data modification bypassing Alluxio: when and how to re-sync
• Slow lists in object store: batch operations
• Too many objects: off-heap metadata (Alluxio 2.0)
18. - Alluxio: A New Data Access Layer
- Between compute and storage
- Transparent to bigdata analytics (HDFS-compatible, POSIX)
- Improve data and metadata performance on cloud storage
- Architecture and Data Flow
- Master, Worker, Under Storage
- Cache-{hit, miss} reads, Sync/Async writes
- More Use Cases:
https://www.alluxio.org/community/powered-by-alluxio
- Send your mailing address to mel@alluxio.com to get the
Tshirt!
Summary
18