Eric Li
Senior Architect of Alibaba Cloud
2
Agenda
• Why Alluxio On Kubernetes
• Brief introduction to Alibaba Cloud Kubernetes
• Challenges
• Alluxio Helm chart
• Contribution to Alluxio
• Best practice
• Known issues
3
Kubernetes: Cloud Native OS
BA
,/ ,
AD
,
,B : : B BA , C A : B
A
,
B A E F
Web/mobile applications
- Stateless
- Idempotent
- Horizontal scalable
Mysql Kafka TIDB
Elastic
Search
Tenso
r
Flow
Spark FlinkRedis
Zoo
keeper
Stateless -> StatefulSet (Enterprise App) -> Data Intelligence
B : D The 1st Choice to train AI model with 32/64/128 v100 GPU
Why Alluxio on Kubernetes ?
More and more data-driven applications run on Kubernetes
Unified Orchestration
Consistent, declarative provisioning
Fastest Growing Community
Disaggregated compute and storage is becoming mainstream in cloud
Flexible
Scalable
Easy to maintain
But the data access of application in Kubernetes is bottleneck
Adaption for different storages and computation framework
Speed
Efficiency
4
5
ECFJI D
. ) ( )
/ ILEGA
/ . - /
IEG
/ (
,J GD I O ,P GK GB ,J GD I ,
IGN GK
. GA IFB
GK
DI G I EDJG IN
ED EGC D
GK B IN
B I IN
D EM
EDI D G
I E ,D I K JIE BEI
K F . GE GK GK . DI GFG FFB I ED DDEK I ED
) IB + DA D ) I F I E D I + BE A D EFG D BEJ
J B BEJ ECFJI D G K I BEJ
GK GB FFB I ED
) D D E D
.JBI BEJ
Overview of ACK
. GE GK I I JB . B L G I DDEK I ED
F J E FG D BEJ F (B DA D EG BEL BE A D E.N - E A I.
6
The Challenges of Alluxio + Kubernetes
How to deploy Alluxio in Kubernetes way?
How to access data without any change of application?
How to achieve the best performance of Alluxio in Kubernetes?
6
7
The Challenges of Alluxio + Kubernetes
Helm/Operator
UFS and POSIX Fuse, lazy load oss
Optimize OSS SDK and short circuit
7
8
Node
Caffe
Alluxio-fusePod
Worke
r
Job Worker
Pod RAM/SSD/HDD
fuse
Short circuit
Caffe
Node
MxNet
Alluxio-fusePod
Worke
r
Job Worker
Pod RAM/SSD/HDD
fuse
Short circuit
MxNet
Node
TensorFlo
w
Alluxio-fusePod
Worke
r
Job Worker
Pod RAM/SSD/HDD
fuse
Short circuit
TensorFlo
w
Master
Alluxio Worker Daemonset
Alluxio Fuse Daemonset
Master Job Master
ConfigMap
ALLUXIO_JAVA_OPTS
ALLUXIO_WORKER_JAVA_OPTS
ALLUXIO_MASTER_JAVA_OPTS
Pod
Statefulset
Alluxio On Kubernetes Architecture
9
OSS SDK Optimization for Alluxio
0
5
10
15
20
25
30
35
40
45
ossfuse ossutil Alluxio
Minutes
The time cost of Data Load of ImageNet(143GB)
10
One-click Installation with Helm
value file of Helm Chart:
An application-specific YAML file
Custom free
Simple to deploy
Easy to share through helm repo
Move to Operator in next step
10
11
Usage of Alluxio Helm Chart
$ cat << EOF > config.yaml
properties:
fs.oss.accessKeyId: xxx
fs.oss.accessKeySecret: yyy
alluxio.master.mount.table.root.ufs: oss://imagenet-huabei5/
EOF
# One click install
$ helm install -f config.yaml alluxio-repo/alluxio --version 2.1.0-SNAPSHOT
# Preload the data
$ helm install --set dir=/images --set threads=54 alluxio-job
11
12
Poor performance
Poor scalability
Good performance
Good scalability
Explicit copying
Expensive
Good Performance
Good scalability
Lazy load
Cheap!
Why Choose Alluxio for HPC
CPFS
Alibaba OSS Alibaba OSS Alibaba OSS
13
Arena for Deep Learning Training
. . . . , '.
. . . , '
https://github.com/kubeflow/arena
Kubernetes / Docker
Kubeflow
arena CLI
Other backends CRD
Arena
Tensorflow, Caffe, PyTorch, MPI, Hovorod
CPU/GPU/FPGA Ethernet/RDMA Hadoop/OSS/CPFS
Flink, Spark
14
Run Deep Learning Job with Alluxio
$ arena submit mpi 
--name alluxio-4x8-cold 
--gpus=8 
--workers=4 
--data-dir /alluxio-fuse/images:/data/imagenet 
-e DATA_DIR=/data/imagenet 
--image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/perseus-benchmark 
./launch-example.sh 4 8
2019-10-24T07:51:42.021611213Z ----------------------------------------------------------------
2019-10-24T07:51:42.024245962Z 1000 images/sec: 234.2 +/- 0.7 (jitter = 8.3) 5.781
2019-10-24T07:51:42.024259919Z ----------------------------------------------------------------
2019-10-24T07:51:42.024264488Z total images/sec: 7492.44
2019-10-24T07:51:42.024267687Z ----------------------------------------------------------------
14
15
100% Faster (alluxio-fuse vs ossgw-nfs)
309.79
569.8 699.2
1349.87
3478.98
209.82
1154.8
2244.3
3868.79
7492.44
0
1000
2000
3000
4000
5000
6000
7000
8000
1 4 8 16 32
Images/seconds
GPUs
Training throughput between Alluxio and OSS(ResNet50, Batch Size 128)
ossmounter alluxio-fuse
16
50% Faster(alluxio-fuse vs ossfs-fuse)
284.05
833.6
1312.02
2685.07
5054.61
209.82
1154.8
2244.3
3868.79
7492.44
0
1000
2000
3000
4000
5000
6000
7000
8000
1 4 8 16 32
Images/seconds
GPUs
Training throughput between Alluxio and OSS(ResNet50, Batch Size 128)
oss alluxio-fuse
17
HPC: Genomic Computing on Kubernetes
KN LF
0 0
2 A
U
WT
1
OG
1
KN
C LF
QP
SE
+SE
+
0 0
B 2 00 A 02
CSI PVC
Users submit pipeline
18
IO Feature
1. Few number of files
(100)
2. High Throughput
3. Intensive request 1W
s
4. Frequently read the
same reference
data.(50GB) in
different pipeline.
19
Read/Write Intensive throughput
- Leverage Alluxio to Reduce read IO for reference data
20
Best Practice – Cont.
1. Data size is less than whole cache(mem + ssd), leverage
LocalFirstAvoidEvictionPolicy, avoid to swap data from disk to memory
frequently.
2. Data size is huge than whole cache, keep default eviction behavior.
Cache PolicyTradeoff
alluxio.user.ufs.block.read.location.policy: alluxio.client.block.policy.LocalFirstAvoidEvictionPolicy
alluxio.user.block.avoid.eviction.policy.reserved.size.bytes: 8GB
alluxio.worker.tieredstore.level0.dirs.path=/dev/shm,/var/lib/docker/alluxio-ssd
20
21
ResNet50 : UnderFileSystemBlockReader failure
170 images/sec: 191.7 +/- 2.5 (jitter = 36.8) 7.467
170 images/sec: 191.7 +/- 2.5 (jitter = 36.8) 7.299
### Pin + No Eviction, data size exceed the pool size
450 images/sec: 91.8 +/- 3.0 (jitter = 40.1) 5.701
450 images/sec: 91.8 +/- 3.0 (jitter = 40.5) 5.487
650 images/sec: 75.5 +/- 2.9 (jitter = 35.2) 5.455
650 images/sec: 75.5 +/- 2.9 (jitter = 33.1) 5.776
21
22
ResNet50: Eviction
950 images/sec: 206.0 +/- 1.2 (jitter = 23.2) 6.197
950 images/sec: 206.0 +/- 1.2 (jitter = 23.2) 6.214
### No pin and no eviction, Eviction happened
990 images/sec: 191.1 +/- 1.3 (jitter = 23.5) 6.234
1000 images/sec: 189.5 +/- 1.3 (jitter = 23.5) 6.171
22
23
Short Circuit with LocalVolume
Tiered storage
capacity, medium type and quota
hostPath or emptyDir
Different choice of short circuit
Unix socket for grpc
Shared hostPath volume for fuse
23
24
DL:Avoid Passive Cache
Training Data is distributed in Alluxio cluster, the client Do Not
synchronize to the local.
passive vs initiative
Worker configuration,Turn Off passive cache
alluxio.user.file.passive.cache.enabled: false
25
Not So Cloud NativeYet
• Health check and availability check
• How to leverage API to detect health of fuse and worker?
• Missing Liveness Probe and Readyness Probe
• Observerability support for Prometheus
• fs report metrics exporter
• Graphana dashboard
• Data cache aware scheduling
• Scheduler locality according to block host
26
Known Issues
1. Performance downgrade 10%-20% during data eviction.
2. Append write
3. Intensive Write
26
27
OOM for JVM/OS
1. Different node specifications, high and low node 8c16G/8c32G, need to use distributed memory effectively
Fuse memory/worker memory
2. FUSE process memory consumption is high
jvmOptions: " -XX:MaxDirectMemorySize=16g ” Bug: https://github.com/Alluxio/alluxio/issues/9525
3. Alluxio's caching strategy, which retains the most frequently accessed pieces of data, can be accessed?
alluxio.worker.evictor.class =alluxio.worker.block.evictor.LRUEvictor
4. Data refresh strategy?
alluxio.user.ufs.block.read.location.policy: alluxio.client.block.policy.LocalFirstAvoidEvictionPolicy
27
28
Take away
1. DL: Sample data size is less than whole cache(mem + ssd), avoid to
swap data from disk to memory frequently.
2. DL: Sample data size is larger than whole cache, keep default
eviction behavior.
3. SSD tiered, enable short circuit with local volume
4. HPC: Object Storage, accelerate reading only at present
5. HPC: For small size of worker node, disable passive mode.
6. HPC:Always keep frequent access data in memory tier
7. K8s Scheduler locality for MPI/PS jobs.
THANK YOU
29

Deep Learning and Gene Computing Acceleration with Alluxio in Kubernetes

  • 1.
    Eric Li Senior Architectof Alibaba Cloud
  • 2.
    2 Agenda • Why AlluxioOn Kubernetes • Brief introduction to Alibaba Cloud Kubernetes • Challenges • Alluxio Helm chart • Contribution to Alluxio • Best practice • Known issues
  • 3.
    3 Kubernetes: Cloud NativeOS BA ,/ , AD , ,B : : B BA , C A : B A , B A E F Web/mobile applications - Stateless - Idempotent - Horizontal scalable Mysql Kafka TIDB Elastic Search Tenso r Flow Spark FlinkRedis Zoo keeper Stateless -> StatefulSet (Enterprise App) -> Data Intelligence B : D The 1st Choice to train AI model with 32/64/128 v100 GPU
  • 4.
    Why Alluxio onKubernetes ? More and more data-driven applications run on Kubernetes Unified Orchestration Consistent, declarative provisioning Fastest Growing Community Disaggregated compute and storage is becoming mainstream in cloud Flexible Scalable Easy to maintain But the data access of application in Kubernetes is bottleneck Adaption for different storages and computation framework Speed Efficiency 4
  • 5.
    5 ECFJI D . )( ) / ILEGA / . - / IEG / ( ,J GD I O ,P GK GB ,J GD I , IGN GK . GA IFB GK DI G I EDJG IN ED EGC D GK B IN B I IN D EM EDI D G I E ,D I K JIE BEI K F . GE GK GK . DI GFG FFB I ED DDEK I ED ) IB + DA D ) I F I E D I + BE A D EFG D BEJ J B BEJ ECFJI D G K I BEJ GK GB FFB I ED ) D D E D .JBI BEJ Overview of ACK . GE GK I I JB . B L G I DDEK I ED F J E FG D BEJ F (B DA D EG BEL BE A D E.N - E A I.
  • 6.
    6 The Challenges ofAlluxio + Kubernetes How to deploy Alluxio in Kubernetes way? How to access data without any change of application? How to achieve the best performance of Alluxio in Kubernetes? 6
  • 7.
    7 The Challenges ofAlluxio + Kubernetes Helm/Operator UFS and POSIX Fuse, lazy load oss Optimize OSS SDK and short circuit 7
  • 8.
    8 Node Caffe Alluxio-fusePod Worke r Job Worker Pod RAM/SSD/HDD fuse Shortcircuit Caffe Node MxNet Alluxio-fusePod Worke r Job Worker Pod RAM/SSD/HDD fuse Short circuit MxNet Node TensorFlo w Alluxio-fusePod Worke r Job Worker Pod RAM/SSD/HDD fuse Short circuit TensorFlo w Master Alluxio Worker Daemonset Alluxio Fuse Daemonset Master Job Master ConfigMap ALLUXIO_JAVA_OPTS ALLUXIO_WORKER_JAVA_OPTS ALLUXIO_MASTER_JAVA_OPTS Pod Statefulset Alluxio On Kubernetes Architecture
  • 9.
    9 OSS SDK Optimizationfor Alluxio 0 5 10 15 20 25 30 35 40 45 ossfuse ossutil Alluxio Minutes The time cost of Data Load of ImageNet(143GB)
  • 10.
    10 One-click Installation withHelm value file of Helm Chart: An application-specific YAML file Custom free Simple to deploy Easy to share through helm repo Move to Operator in next step 10
  • 11.
    11 Usage of AlluxioHelm Chart $ cat << EOF > config.yaml properties: fs.oss.accessKeyId: xxx fs.oss.accessKeySecret: yyy alluxio.master.mount.table.root.ufs: oss://imagenet-huabei5/ EOF # One click install $ helm install -f config.yaml alluxio-repo/alluxio --version 2.1.0-SNAPSHOT # Preload the data $ helm install --set dir=/images --set threads=54 alluxio-job 11
  • 12.
    12 Poor performance Poor scalability Goodperformance Good scalability Explicit copying Expensive Good Performance Good scalability Lazy load Cheap! Why Choose Alluxio for HPC CPFS Alibaba OSS Alibaba OSS Alibaba OSS
  • 13.
    13 Arena for DeepLearning Training . . . . , '. . . . , ' https://github.com/kubeflow/arena Kubernetes / Docker Kubeflow arena CLI Other backends CRD Arena Tensorflow, Caffe, PyTorch, MPI, Hovorod CPU/GPU/FPGA Ethernet/RDMA Hadoop/OSS/CPFS Flink, Spark
  • 14.
    14 Run Deep LearningJob with Alluxio $ arena submit mpi --name alluxio-4x8-cold --gpus=8 --workers=4 --data-dir /alluxio-fuse/images:/data/imagenet -e DATA_DIR=/data/imagenet --image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/perseus-benchmark ./launch-example.sh 4 8 2019-10-24T07:51:42.021611213Z ---------------------------------------------------------------- 2019-10-24T07:51:42.024245962Z 1000 images/sec: 234.2 +/- 0.7 (jitter = 8.3) 5.781 2019-10-24T07:51:42.024259919Z ---------------------------------------------------------------- 2019-10-24T07:51:42.024264488Z total images/sec: 7492.44 2019-10-24T07:51:42.024267687Z ---------------------------------------------------------------- 14
  • 15.
    15 100% Faster (alluxio-fusevs ossgw-nfs) 309.79 569.8 699.2 1349.87 3478.98 209.82 1154.8 2244.3 3868.79 7492.44 0 1000 2000 3000 4000 5000 6000 7000 8000 1 4 8 16 32 Images/seconds GPUs Training throughput between Alluxio and OSS(ResNet50, Batch Size 128) ossmounter alluxio-fuse
  • 16.
    16 50% Faster(alluxio-fuse vsossfs-fuse) 284.05 833.6 1312.02 2685.07 5054.61 209.82 1154.8 2244.3 3868.79 7492.44 0 1000 2000 3000 4000 5000 6000 7000 8000 1 4 8 16 32 Images/seconds GPUs Training throughput between Alluxio and OSS(ResNet50, Batch Size 128) oss alluxio-fuse
  • 17.
    17 HPC: Genomic Computingon Kubernetes KN LF 0 0 2 A U WT 1 OG 1 KN C LF QP SE +SE + 0 0 B 2 00 A 02 CSI PVC Users submit pipeline
  • 18.
    18 IO Feature 1. Fewnumber of files (100) 2. High Throughput 3. Intensive request 1W s 4. Frequently read the same reference data.(50GB) in different pipeline.
  • 19.
    19 Read/Write Intensive throughput -Leverage Alluxio to Reduce read IO for reference data
  • 20.
    20 Best Practice –Cont. 1. Data size is less than whole cache(mem + ssd), leverage LocalFirstAvoidEvictionPolicy, avoid to swap data from disk to memory frequently. 2. Data size is huge than whole cache, keep default eviction behavior. Cache PolicyTradeoff alluxio.user.ufs.block.read.location.policy: alluxio.client.block.policy.LocalFirstAvoidEvictionPolicy alluxio.user.block.avoid.eviction.policy.reserved.size.bytes: 8GB alluxio.worker.tieredstore.level0.dirs.path=/dev/shm,/var/lib/docker/alluxio-ssd 20
  • 21.
    21 ResNet50 : UnderFileSystemBlockReaderfailure 170 images/sec: 191.7 +/- 2.5 (jitter = 36.8) 7.467 170 images/sec: 191.7 +/- 2.5 (jitter = 36.8) 7.299 ### Pin + No Eviction, data size exceed the pool size 450 images/sec: 91.8 +/- 3.0 (jitter = 40.1) 5.701 450 images/sec: 91.8 +/- 3.0 (jitter = 40.5) 5.487 650 images/sec: 75.5 +/- 2.9 (jitter = 35.2) 5.455 650 images/sec: 75.5 +/- 2.9 (jitter = 33.1) 5.776 21
  • 22.
    22 ResNet50: Eviction 950 images/sec:206.0 +/- 1.2 (jitter = 23.2) 6.197 950 images/sec: 206.0 +/- 1.2 (jitter = 23.2) 6.214 ### No pin and no eviction, Eviction happened 990 images/sec: 191.1 +/- 1.3 (jitter = 23.5) 6.234 1000 images/sec: 189.5 +/- 1.3 (jitter = 23.5) 6.171 22
  • 23.
    23 Short Circuit withLocalVolume Tiered storage capacity, medium type and quota hostPath or emptyDir Different choice of short circuit Unix socket for grpc Shared hostPath volume for fuse 23
  • 24.
    24 DL:Avoid Passive Cache TrainingData is distributed in Alluxio cluster, the client Do Not synchronize to the local. passive vs initiative Worker configuration,Turn Off passive cache alluxio.user.file.passive.cache.enabled: false
  • 25.
    25 Not So CloudNativeYet • Health check and availability check • How to leverage API to detect health of fuse and worker? • Missing Liveness Probe and Readyness Probe • Observerability support for Prometheus • fs report metrics exporter • Graphana dashboard • Data cache aware scheduling • Scheduler locality according to block host
  • 26.
    26 Known Issues 1. Performancedowngrade 10%-20% during data eviction. 2. Append write 3. Intensive Write 26
  • 27.
    27 OOM for JVM/OS 1.Different node specifications, high and low node 8c16G/8c32G, need to use distributed memory effectively Fuse memory/worker memory 2. FUSE process memory consumption is high jvmOptions: " -XX:MaxDirectMemorySize=16g ” Bug: https://github.com/Alluxio/alluxio/issues/9525 3. Alluxio's caching strategy, which retains the most frequently accessed pieces of data, can be accessed? alluxio.worker.evictor.class =alluxio.worker.block.evictor.LRUEvictor 4. Data refresh strategy? alluxio.user.ufs.block.read.location.policy: alluxio.client.block.policy.LocalFirstAvoidEvictionPolicy 27
  • 28.
    28 Take away 1. DL:Sample data size is less than whole cache(mem + ssd), avoid to swap data from disk to memory frequently. 2. DL: Sample data size is larger than whole cache, keep default eviction behavior. 3. SSD tiered, enable short circuit with local volume 4. HPC: Object Storage, accelerate reading only at present 5. HPC: For small size of worker node, disable passive mode. 6. HPC:Always keep frequent access data in memory tier 7. K8s Scheduler locality for MPI/PS jobs.
  • 29.