Deep Learning and Gene Computing Acceleration with Alluxio in Kubernetes

Eric Li
Senior Architect of Alibaba Cloud

2
Agenda
• Why Alluxio On Kubernetes
• Brief introduction to Alibaba Cloud Kubernetes
• Challenges
• Alluxio Helm chart
• Contribution to Alluxio
• Best practice
• Known issues

3
Kubernetes: Cloud Native OS
BA
,/ ,
AD
,
,B : : B BA , C A : B
A
,
B A E F
Web/mobile applications
- Stateless
- Idempotent
- Horizontal scalable
Mysql Kafka TIDB
Elastic
Search
Tenso
r
Flow
Spark FlinkRedis
Zoo
keeper
Stateless -> StatefulSet (Enterprise App) -> Data Intelligence
B : D The 1st Choice to train AI model with 32/64/128 v100 GPU

Why Alluxio on Kubernetes ?
More and more data-driven applications run on Kubernetes
Unified Orchestration
Consistent, declarative provisioning
Fastest Growing Community
Disaggregated compute and storage is becoming mainstream in cloud
Flexible
Scalable
Easy to maintain
But the data access of application in Kubernetes is bottleneck
Adaption for different storages and computation framework
Speed
Efficiency
4

5
ECFJI D
. ) ( )
/ ILEGA
/ . - /
IEG
/ (
,J GD I O ,P GK GB ,J GD I ,
IGN GK
. GA IFB
GK
DI G I EDJG IN
ED EGC D
GK B IN
B I IN
D EM
EDI D G
I E ,D I K JIE BEI
K F . GE GK GK . DI GFG FFB I ED DDEK I ED
) IB + DA D ) I F I E D I + BE A D EFG D BEJ
J B BEJ ECFJI D G K I BEJ
GK GB FFB I ED
) D D E D
.JBI BEJ
Overview of ACK
. GE GK I I JB . B L G I DDEK I ED
F J E FG D BEJ F (B DA D EG BEL BE A D E.N - E A I.

6
The Challenges of Alluxio + Kubernetes
How to deploy Alluxio in Kubernetes way?
How to access data without any change of application?
How to achieve the best performance of Alluxio in Kubernetes?
6

7
The Challenges of Alluxio + Kubernetes
Helm/Operator
UFS and POSIX Fuse, lazy load oss
Optimize OSS SDK and short circuit
7

8
Node
Caffe
Alluxio-fusePod
Worke
r
Job Worker
Pod RAM/SSD/HDD
fuse
Short circuit
Caffe
Node
MxNet
Alluxio-fusePod
Worke
r
Job Worker
Pod RAM/SSD/HDD
fuse
Short circuit
MxNet
Node
TensorFlo
w
Alluxio-fusePod
Worke
r
Job Worker
Pod RAM/SSD/HDD
fuse
Short circuit
TensorFlo
w
Master
Alluxio Worker Daemonset
Alluxio Fuse Daemonset
Master Job Master
ConfigMap
ALLUXIO_JAVA_OPTS
ALLUXIO_WORKER_JAVA_OPTS
ALLUXIO_MASTER_JAVA_OPTS
Pod
Statefulset
Alluxio On Kubernetes Architecture

9
OSS SDK Optimization for Alluxio
0
5
10
15
20
25
30
35
40
45
ossfuse ossutil Alluxio
Minutes
The time cost of Data Load of ImageNet(143GB)

10
One-click Installation with Helm
value file of Helm Chart:
An application-specific YAML file
Custom free
Simple to deploy
Easy to share through helm repo
Move to Operator in next step
10

11
Usage of Alluxio Helm Chart
$ cat << EOF > config.yaml
properties:
fs.oss.accessKeyId: xxx
fs.oss.accessKeySecret: yyy
alluxio.master.mount.table.root.ufs: oss://imagenet-huabei5/
EOF
# One click install
$ helm install -f config.yaml alluxio-repo/alluxio --version 2.1.0-SNAPSHOT
# Preload the data
$ helm install --set dir=/images --set threads=54 alluxio-job
11

12
Poor performance
Poor scalability
Good performance
Good scalability
Explicit copying
Expensive
Good Performance
Good scalability
Lazy load
Cheap!
Why Choose Alluxio for HPC
CPFS
Alibaba OSS Alibaba OSS Alibaba OSS

13
Arena for Deep Learning Training
. . . . , '.
. . . , '
https://github.com/kubeflow/arena
Kubernetes / Docker
Kubeflow
arena CLI
Other backends CRD
Arena
Tensorflow, Caffe, PyTorch, MPI, Hovorod
CPU/GPU/FPGA Ethernet/RDMA Hadoop/OSS/CPFS
Flink, Spark

14
Run Deep Learning Job with Alluxio
$ arena submit mpi
--name alluxio-4x8-cold
--gpus=8
--workers=4
--data-dir /alluxio-fuse/images:/data/imagenet
-e DATA_DIR=/data/imagenet
--image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/perseus-benchmark
./launch-example.sh 4 8
2019-10-24T07:51:42.021611213Z ----------------------------------------------------------------
2019-10-24T07:51:42.024245962Z 1000 images/sec: 234.2 +/- 0.7 (jitter = 8.3) 5.781
2019-10-24T07:51:42.024259919Z ----------------------------------------------------------------
2019-10-24T07:51:42.024264488Z total images/sec: 7492.44
2019-10-24T07:51:42.024267687Z ----------------------------------------------------------------
14

15
100% Faster (alluxio-fuse vs ossgw-nfs)
309.79
569.8 699.2
1349.87
3478.98
209.82
1154.8
2244.3
3868.79
7492.44
0
1000
2000
3000
4000
5000
6000
7000
8000
1 4 8 16 32
Images/seconds
GPUs
Training throughput between Alluxio and OSS(ResNet50, Batch Size 128)
ossmounter alluxio-fuse

16
50% Faster(alluxio-fuse vs ossfs-fuse)
284.05
833.6
1312.02
2685.07
5054.61
209.82
1154.8
2244.3
3868.79
7492.44
0
1000
2000
3000
4000
5000
6000
7000
8000
1 4 8 16 32
Images/seconds
GPUs
Training throughput between Alluxio and OSS(ResNet50, Batch Size 128)
oss alluxio-fuse

17
HPC: Genomic Computing on Kubernetes
KN LF
0 0
2 A
U
WT
1
OG
1
KN
C LF
QP
SE
+SE
+
0 0
B 2 00 A 02
CSI PVC
Users submit pipeline

18
IO Feature
1. Few number of files
(100)
2. High Throughput
3. Intensive request 1W
s
4. Frequently read the
same reference
data.(50GB) in
different pipeline.

19
Read/Write Intensive throughput
- Leverage Alluxio to Reduce read IO for reference data

20
Best Practice – Cont.
1. Data size is less than whole cache(mem + ssd), leverage
LocalFirstAvoidEvictionPolicy, avoid to swap data from disk to memory
frequently.
2. Data size is huge than whole cache, keep default eviction behavior.
Cache PolicyTradeoff
alluxio.user.ufs.block.read.location.policy: alluxio.client.block.policy.LocalFirstAvoidEvictionPolicy
alluxio.user.block.avoid.eviction.policy.reserved.size.bytes: 8GB
alluxio.worker.tieredstore.level0.dirs.path=/dev/shm,/var/lib/docker/alluxio-ssd
20

21
ResNet50 : UnderFileSystemBlockReader failure
170 images/sec: 191.7 +/- 2.5 (jitter = 36.8) 7.467
### Pin + No Eviction, data size exceed the pool size
21

22
ResNet50: Eviction
### No pin and no eviction, Eviction happened
22

23
Short Circuit with LocalVolume
Tiered storage
capacity, medium type and quota
hostPath or emptyDir
Different choice of short circuit
Unix socket for grpc
Shared hostPath volume for fuse
23

24
DL:Avoid Passive Cache
Training Data is distributed in Alluxio cluster, the client Do Not
synchronize to the local.
passive vs initiative
Worker configuration,Turn Off passive cache
alluxio.user.file.passive.cache.enabled: false

25
Not So Cloud NativeYet
• Health check and availability check
• How to leverage API to detect health of fuse and worker?
• Missing Liveness Probe and Readyness Probe
• Observerability support for Prometheus
• fs report metrics exporter
• Graphana dashboard
• Data cache aware scheduling
• Scheduler locality according to block host

26
Known Issues
1. Performance downgrade 10%-20% during data eviction.
2. Append write
3. Intensive Write
26

27
OOM for JVM/OS
1. Different node specifications, high and low node 8c16G/8c32G, need to use distributed memory effectively
Fuse memory/worker memory
2. FUSE process memory consumption is high
jvmOptions: " -XX:MaxDirectMemorySize=16g ” Bug: https://github.com/Alluxio/alluxio/issues/9525
3. Alluxio's caching strategy, which retains the most frequently accessed pieces of data, can be accessed?
alluxio.worker.evictor.class =alluxio.worker.block.evictor.LRUEvictor
4. Data refresh strategy?
alluxio.user.ufs.block.read.location.policy: alluxio.client.block.policy.LocalFirstAvoidEvictionPolicy
27

28
Take away
1. DL: Sample data size is less than whole cache(mem + ssd), avoid to
swap data from disk to memory frequently.
2. DL: Sample data size is larger than whole cache, keep default
eviction behavior.
3. SSD tiered, enable short circuit with local volume
4. HPC: Object Storage, accelerate reading only at present
5. HPC: For small size of worker node, disable passive mode.
6. HPC:Always keep frequent access data in memory tier
7. K8s Scheduler locality for MPI/PS jobs.

Deep Learning and Gene Computing Acceleration with Alluxio in Kubernetes

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep Learning and Gene Computing Acceleration with Alluxio in Kubernetes

Similar to Deep Learning and Gene Computing Acceleration with Alluxio in Kubernetes (20)

More from Alluxio, Inc.

More from Alluxio, Inc. (20)

Recently uploaded

Recently uploaded (20)

Deep Learning and Gene Computing Acceleration with Alluxio in Kubernetes