Recent years have witnessed an exponential growth of the model scale in recommendation/Ads/search—from Google’s 2016 model with 1 billion parameters to the latest Facebook’s model with 12 trillion parameters. Significant quality boost has come with each jump of the model capacity, which makes people believe the era of 100 trillion parameters is around the corner. To prepare the exponential growth of the model size, an efficient distributed training system is in urgent need. However, the training of such huge models is challenging even within industrial scale data centers. In this talk, I will introduce Persia -- an open training system developed by my team -- to resolve this challenge by careful co-design of both the optimization algorithm and the distributed system architecture. Persia admits nearly linear speedup properties while scaling the number of workers and the model size. Beside the capability of training 100 trillion parameters, it also shows a clear advantage in efficiency over other open sourced engines.
paper link:
https://arxiv.org/pdf/2111.05897.pdf
Speaker: Ji Liu
Dr. Ji Liu received his Ph.D in computer science and his bachelor degree in automation from University of Wisconsin-Madison and University of Science and Technology of China, respectively. After graduation, he joined the University of Rochester as an assistant professor, conducting research in machine learning, optimization, and reinforcement learning. The developed asynchronous and decentralized algorithms were widely used in industry, such as IBM, Microsoft, etc. He left academia and joined Tencent in 2017, exploring AI’s boundary. The developing AI agent Tstarbot was considered to be a milestone for mastering the most challenging RTS game -- Starcraft II. His second stop in industry is Kwai - the second largest short video company in China. He founded and led multiple international teams with different functionalities: platform team, product team, and research team. His team Contributed to 15+% annual revenue growth in Ads. He published 100+ papers in top-tier CS conferences and journals, and received multiple best paper awards (e.g., SIGKDD 2010 and UAI 2015 Facebook best paper). He was an awardee of MIT TR 35 under 35 in China and IBM faculty award in 2017. He was nominated to be one of China top 5 AI innovators under 35 in 2018
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 Trillion Parameters
1. Persia: A Hybrid System Scaling Deep Learning-
based Recommenders up to 100 Trillion
Parameters
Xiangru Lian, Xuefeng Zhu, Yulong Wang, Honghuan Wu, Lei Sun, Haodong Lyu, Chengjun Liu, Xing Dong, Yiqiao
Liao, Mingnan Luo, Congfei Zhang, Jingru Xie, Haonan Li, Lei Chen, Renjie Huang, Jianying Lin, Chengchun Shu,
Xuezhong Qiu, Zhishan Liu, Dongying Kong, Lei Yuan, Hai Yu, Sen Yang, Ji Liu
Binhang Yuan, Yongjun He, Ce Zhang
github.com/
PersiaML
Contacts: Ji Liu (ji.liu.uwisc@gmail.com)
Ce Zhang (ce.zhang@inf.ethz.ch)
Dr. Ji Liu
Former Head, AI platform department
Former Director, Seattle AI lab
Kwai Inc.
2. Why Recommendation is So Important?
Revenue = user engagement ✖️ user consumption
e.g., content recommendation,
game recommendation, etc.
e.g., ads recommendation,
personalized anchor recommendation, etc.
3. Challenges to recommender infrastructure: how to accommodate the ever growing recommendation model
- Training (our focus today)
- Inference
- Serving
Trend: Data Increases Exponentially, So Is the Model
5. Recommendation Model Evolution – Math View
Naïve logistic regression
Complex logistic regression
DL based model
LR( Linear( Emb(U_id), Emd(I_id) ) )
LR( Linear ( Emb(U_id), Emd(I_id); Emb(U_f1), …, Emb(U_fm); Emb(I_f1), …, Emb(I_fn); Emb(U_f1, I_f1), Emb(U_f2, I_f1),
… ) )
LR( Nonlinear ( Emb(U_id), Emd(I_id); Emb(U_f1), …, Emb(U_fm); Emb(I_f1), …, Emb(I_fn); Emb(U_f1, I_f1), Emb(U_f2, I_f1), …
) )
More ID features Cross ID features
Complicated DNN
6. New Challenges for Training Super Large Models
- Key challenge: how to fully utilize the ever upgrading
hardware to meet the ever growing models
Mio (STOA, maybe different
names, e.g., XDL)
- Asynchronous update
- Homogeneous
- Bottlenecks: only good for
LR, accuracy drop
Mio+
- Asynchronous update
- Homogeneous
- Bottlenecks: high cost,
accuracy drop
Persia by us
- Hybrid update
- Heterogeneous
- Bottlenecks: ?
Aibox by Baidu
- Synchronous update
- Single big machine
- Bottlenecks: hard to scale,
training efficiency
7. User ID embedding
characterizing each user’s
preference
Gender ID embedding
characterizing each type of
gender
Some cross ID embedding,
e.g., user ID by Gender ID
Model the nonlinear
correlation between the
provided user and item,
e.g., click through rate
(CTR)
Emb parameters NN parameters
# parameters 1011~14
106~7
# computation
per sample
103~4
106~7
Emb 1
Emb 0
Emb K
…
NN
Model parameters
Preliminary for DL Based Models: Parameters
Emb parameters: storage heavy,
computation light
NN parameters: storage light, computation
heavy
8. Emb 1
NN
Emb 0
Emb K
…
feature collection
push back emb gradient
backward
forward
Synchronization
Sample: [ID1, ID2, ID3, …, IDk, label]
Preliminary for DL Based Models: Training Protocol
9. Perisa’s Parameter Placement
Emb 0
Emb 1
… … …
… … …
CPU 0
CPU 1
CPU N-1
GPU 0
GPU M-1
NN
0 1 …. N …. 2N …
Emb-Cpu-PS: Emb parameters stored in multiple CPUs
like a PS
NN-Gpu-DP: NN parameters stored in multiple GPUs like
data parallel
Emb-Cpu-PS NN-Gpu-DP
10. Why “EMB-CPU-PS” and “NN-GPU-DP”?
Cannot take benefit from mature communication
toolkits, e.g, DDP, Horovod, Bagua, etc.
GPU 0
GPU M-1
CPU 0
CPU N-1
GPU 0
GPU M-1
CPU 0
CPU N-1
Emb-Cpu-PS NN-Gpu-PS Emb-Cpu-PS NN-Cpu-DP
Cannot take benefit from efficient GPU-GPU
communication, such as NV-link, GDR
11. Naïve Implementation of Training a Recommendation Model
CPU 0
CPU 1
CPU N-1
GPU 0
GPU M-1
Embedding server
Workers EO
EO feature collection
EO
FC
FC forward computation
FC
FC
BC
BC backward computation
BC
BC
PEG
PEG push embedding gradient
PEG
W-E wait for embedding update done
W-E
W-E
W-E
SNN
SNN synchronize NN
SNN
Naive implementation would not be efficient
12. Synchronous Updating vs. Asynchronous Updating
Async (ideal
in runtime)
EO
FC BC
PEG
EO
FC BC
PEG
SNN SNN
Use out-of-date information for computation to get high efficiency, but often cause ``low accuracy’’, like
XDL-async, Mio (recall 0.1% accuracy drop may cause a significant loss in recommendation)
Question: can we design an algorithm/system to make a better tradeoff between efficiency and accuracy?
Sync (naive) EO FC BC SNN PEG W-E EO FC BC SNN PEG W-E
Ensure model accuracy but with ``low efficiency’’ due to synchronization, like XDL-sync, Paddlepaddle
EO FC BC
SNN
PEG W-E EO FC BC
SNN
PEG W-E
Sync (opt)
13. Perisa: A Sync-Async-Hybrid Algorithm/System
Sync EO FC BC SNN PEG W-E EO FC BC SNN PEG W-E
Persia (Bagua +
Sync-Aync-Hybrid)
EO
FC BC
SNN
PEG
FC BC
SNN
EO
PEG
Sync-Aync-Hybrid
EO
FC BC
PEG
SNN
EO
FC BC
PEG
SNN
Async (ideal)
EO
FC BC
PEG
EO
FC BC
PEG
SNN SNN
Key idea
o Synchronous update for NN
o Asynchronous update for Emb
Key idea
o Overlap backward
computation and
synchronization of NN
The minor loss in efficiency comparing to the ideal (Async) implementation
14. Convergence of Persia
Under the standard assumptions, setting the learning rate appropriately, it admits the following convergence
rate
ϕ ≪ 1 : ID’s Maximal frequency (within all samples)
𝜏 : Maximal staleness – #GPUs in Persia
Rate of purely syn algorithm Minor term usually ϕ𝜏 ≪ 1
OBS: Persia admits almost the same convergence rate as the purely synchronous algorithm!
Theorem
Updating of Sync-Async-Hybrid algorithm
15. Other System Optimization
Optimizing communication efficiency
o Persia-RPC (10+ times faster than gRPC)
o Middleware (save bandwidth for workers)
o Bagua
o Compact sample batch representation (CSB)
(de)serialization compression thread pool
gRPC protobuf
(high overhead)
gzip/deflate
(slow)
no affinity
persia-RPC zerocopy
(no overhead)
lz4/zstd
(fast)
NUMA aware,
thread affinity
[0000000000000001, 0001000000000001, label_1]
[0011000100010001, 0001000000000001, label_2]
[0000000000000001, 0000000010000001, label_3]
…
Batch vocabulary =
[0000000000000001: 1, 3;
0001000000000001: 1, 2;
0011000100010001: 2;
0000000010000001: 3]
[label_1]
[label_2]
[label_3]
…
Original sample batch
Compact sample batch
Middleware
Data
loader
Gpu_0 Gpu_1
Hdfs / Kafka
Cpu_0 Cpu_1 Cpu_2 Cpu_3
Raw data
ID, Emb gradient
CSB
Emb sum
Emb
Emb gradient
16. Other System Optimization
Optimizing computational efficiency
o PyTorch mix-precision
o CPU SIMD instruction
o Embedding and optimizer parameter co-location
o Etc.
17. Empirical Study: Sync vs. Async vs. Hybrid
OBS: The efficiency of Persia is comparable to Async, and the accuracy is consistent with Sync.
Setup: 64-NVIDIA V100
100-CPU machines (52 cores & 480 GB RAM)
GPU bandwidth: 100Gbps
CPU bandwidth: 10Gbps
19. Scalability (“#workers” and “#parameters”)
OBS: Nearly linear speedup wrt #GPUs; Linear scaleup wrt #parameters!
Setup: Google cloud
o 8 a2-highgpu-8g instances (each with 8 Nvidia A100 GPUs)
o 100 c2-standard-30 instances (each with 30vCPUs, 120GB RAM)
o 30 m2-ultramem-416 instances (each with 416vCPUs, 12TB RAM)
20. Summary for Persia
- Heterogeneous architecture
- Hybrid update (Sync + Async): fast and accurate
- Resource efficient
- Flexible (all components can be integrated into one machine)
- Push the model size to a new magnitude (100 trillion parameters)
- Invited to integrate into Pytorch Lightning