SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 Trillion Parameters

Persia: A Hybrid System Scaling Deep Learning-
based Recommenders up to 100 Trillion
Parameters
Xiangru Lian, Xuefeng Zhu, Yulong Wang, Honghuan Wu, Lei Sun, Haodong Lyu, Chengjun Liu, Xing Dong, Yiqiao
Liao, Mingnan Luo, Congfei Zhang, Jingru Xie, Haonan Li, Lei Chen, Renjie Huang, Jianying Lin, Chengchun Shu,
Xuezhong Qiu, Zhishan Liu, Dongying Kong, Lei Yuan, Hai Yu, Sen Yang, Ji Liu
Binhang Yuan, Yongjun He, Ce Zhang
github.com/
PersiaML
Contacts: Ji Liu (ji.liu.uwisc@gmail.com)
Ce Zhang (ce.zhang@inf.ethz.ch)
Dr. Ji Liu
Former Head, AI platform department
Former Director, Seattle AI lab
Kwai Inc.

Why Recommendation is So Important?
Revenue = user engagement ✖️ user consumption
e.g., content recommendation,
game recommendation, etc.
e.g., ads recommendation,
personalized anchor recommendation, etc.

Challenges to recommender infrastructure: how to accommodate the ever growing recommendation model
- Training (our focus today)
- Inference
- Serving
Trend: Data Increases Exponentially, So Is the Model

Recommendation Model Evolution
Decision tree based models
(not popular anymore)
Naïve logistic regression Complex logistic regression DL based model

Recommendation Model Evolution – Math View
Naïve logistic regression
Complex logistic regression
DL based model
LR( Linear( Emb(U_id), Emd(I_id) ) )
LR( Linear ( Emb(U_id), Emd(I_id); Emb(U_f1), …, Emb(U_fm); Emb(I_f1), …, Emb(I_fn); Emb(U_f1, I_f1), Emb(U_f2, I_f1),
… ) )
LR( Nonlinear ( Emb(U_id), Emd(I_id); Emb(U_f1), …, Emb(U_fm); Emb(I_f1), …, Emb(I_fn); Emb(U_f1, I_f1), Emb(U_f2, I_f1), …
) )
More ID features Cross ID features
Complicated DNN

New Challenges for Training Super Large Models
- Key challenge: how to fully utilize the ever upgrading
hardware to meet the ever growing models
Mio (STOA, maybe different
names, e.g., XDL)
- Asynchronous update
- Homogeneous
- Bottlenecks: only good for
LR, accuracy drop
Mio+
- Asynchronous update
- Homogeneous
- Bottlenecks: high cost,
accuracy drop
Persia by us
- Hybrid update
- Heterogeneous
- Bottlenecks: ?
Aibox by Baidu
- Synchronous update
- Single big machine
- Bottlenecks: hard to scale,
training efficiency

User ID embedding
characterizing each user’s
preference
Gender ID embedding
characterizing each type of
gender
Some cross ID embedding,
e.g., user ID by Gender ID
Model the nonlinear
correlation between the
provided user and item,
e.g., click through rate
(CTR)
Emb parameters NN parameters
# parameters 1011~14
106~7
# computation
per sample
103~4
106~7
Emb 1
Emb 0
Emb K
…
NN
Model parameters
Preliminary for DL Based Models: Parameters
Emb parameters: storage heavy,
computation light
NN parameters: storage light, computation
heavy

Emb 1
NN
Emb 0
Emb K
…
feature collection
push back emb gradient
backward
forward
Synchronization
Sample: [ID1, ID2, ID3, …, IDk, label]
Preliminary for DL Based Models: Training Protocol

Perisa’s Parameter Placement
Emb 0
Emb 1
… … …
… … …
CPU 0
CPU 1
CPU N-1
GPU 0
GPU M-1
NN
0 1 …. N …. 2N …
Emb-Cpu-PS: Emb parameters stored in multiple CPUs
like a PS
NN-Gpu-DP: NN parameters stored in multiple GPUs like
data parallel
Emb-Cpu-PS NN-Gpu-DP

Why “EMB-CPU-PS” and “NN-GPU-DP”?
Cannot take benefit from mature communication
toolkits, e.g, DDP, Horovod, Bagua, etc.
GPU 0
GPU M-1
CPU 0
CPU N-1
GPU 0
GPU M-1
CPU 0
CPU N-1
Emb-Cpu-PS NN-Gpu-PS Emb-Cpu-PS NN-Cpu-DP
Cannot take benefit from efficient GPU-GPU
communication, such as NV-link, GDR

Naïve Implementation of Training a Recommendation Model
CPU 0
CPU 1
CPU N-1
GPU 0
GPU M-1
Embedding server
Workers EO
EO feature collection
EO
FC
FC forward computation
FC
FC
BC
BC backward computation
BC
BC
PEG
PEG push embedding gradient
PEG
W-E wait for embedding update done
W-E
W-E
W-E
SNN
SNN synchronize NN
SNN
Naive implementation would not be efficient

Synchronous Updating vs. Asynchronous Updating
Async (ideal
in runtime)
EO
FC BC
PEG
EO
FC BC
PEG
SNN SNN
Use out-of-date information for computation to get high efficiency, but often cause ``low accuracy’’, like
XDL-async, Mio (recall 0.1% accuracy drop may cause a significant loss in recommendation)
Question: can we design an algorithm/system to make a better tradeoff between efficiency and accuracy?
Sync (naive) EO FC BC SNN PEG W-E EO FC BC SNN PEG W-E
Ensure model accuracy but with ``low efficiency’’ due to synchronization, like XDL-sync, Paddlepaddle
EO FC BC
SNN
PEG W-E EO FC BC
SNN
PEG W-E
Sync (opt)

Perisa: A Sync-Async-Hybrid Algorithm/System
Sync EO FC BC SNN PEG W-E EO FC BC SNN PEG W-E
Persia (Bagua +
Sync-Aync-Hybrid)
EO
FC BC
SNN
PEG
FC BC
SNN
EO
PEG
Sync-Aync-Hybrid
EO
FC BC
PEG
SNN
EO
FC BC
PEG
SNN
Async (ideal)
EO
FC BC
PEG
EO
FC BC
PEG
SNN SNN
Key idea
o Synchronous update for NN
o Asynchronous update for Emb
Key idea
o Overlap backward
computation and
synchronization of NN
The minor loss in efficiency comparing to the ideal (Async) implementation

Convergence of Persia
Under the standard assumptions, setting the learning rate appropriately, it admits the following convergence
rate
ϕ ≪ 1 : ID’s Maximal frequency (within all samples)
𝜏 : Maximal staleness – #GPUs in Persia
Rate of purely syn algorithm Minor term usually ϕ𝜏 ≪ 1
OBS: Persia admits almost the same convergence rate as the purely synchronous algorithm!
Theorem
Updating of Sync-Async-Hybrid algorithm

Other System Optimization
Optimizing communication efficiency
o Persia-RPC (10+ times faster than gRPC)
o Middleware (save bandwidth for workers)
o Bagua 
o Compact sample batch representation (CSB)
(de)serialization compression thread pool
gRPC protobuf
(high overhead)
gzip/deflate
(slow)
no affinity
persia-RPC zerocopy
(no overhead)
lz4/zstd
(fast)
NUMA aware,
thread affinity
[0000000000000001, 0001000000000001, label_1]
[0011000100010001, 0001000000000001, label_2]
[0000000000000001, 0000000010000001, label_3]
…
Batch vocabulary =
[0000000000000001: 1, 3;
0001000000000001: 1, 2;
0011000100010001: 2;
0000000010000001: 3]
[label_1]
[label_2]
[label_3]
…
Original sample batch
Compact sample batch
Middleware
Data
loader
Gpu_0 Gpu_1
Hdfs / Kafka
Cpu_0 Cpu_1 Cpu_2 Cpu_3
Raw data
ID, Emb gradient
CSB
Emb sum
Emb
Emb gradient

Other System Optimization
Optimizing computational efficiency
o PyTorch mix-precision
o CPU SIMD instruction
o Embedding and optimizer parameter co-location
o Etc.

Empirical Study: Sync vs. Async vs. Hybrid
OBS: The efficiency of Persia is comparable to Async, and the accuracy is consistent with Sync.
Setup: 64-NVIDIA V100
100-CPU machines (52 cores & 480 GB RAM)
GPU bandwidth: 100Gbps
CPU bandwidth: 10Gbps

Comparison to Other Public Frameworks

Scalability (“#workers” and “#parameters”)
OBS: Nearly linear speedup wrt #GPUs; Linear scaleup wrt #parameters!
Setup: Google cloud
o 8 a2-highgpu-8g instances (each with 8 Nvidia A100 GPUs)
o 100 c2-standard-30 instances (each with 30vCPUs, 120GB RAM)
o 30 m2-ultramem-416 instances (each with 416vCPUs, 12TB RAM)

Summary for Persia
- Heterogeneous architecture
- Hybrid update (Sync + Async): fast and accurate
- Resource efficient
- Flexible (all components can be integrated into one machine)
- Push the model size to a new magnitude (100 trillion parameters)
- Invited to integrate into Pytorch Lightning

Thanks!
——Q&A
Paper github

SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 Trillion Parameters

Recommended

Recommended

More Related Content

Similar to SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 Trillion Parameters

Similar to SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 Trillion Parameters (20)

More from Chester Chen

More from Chester Chen (20)

Recently uploaded

Recently uploaded (20)

SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 Trillion Parameters