SlideShare a Scribd company logo
1 of 21
Persia: A Hybrid System Scaling Deep Learning-
based Recommenders up to 100 Trillion
Parameters
Xiangru Lian, Xuefeng Zhu, Yulong Wang, Honghuan Wu, Lei Sun, Haodong Lyu, Chengjun Liu, Xing Dong, Yiqiao
Liao, Mingnan Luo, Congfei Zhang, Jingru Xie, Haonan Li, Lei Chen, Renjie Huang, Jianying Lin, Chengchun Shu,
Xuezhong Qiu, Zhishan Liu, Dongying Kong, Lei Yuan, Hai Yu, Sen Yang, Ji Liu
Binhang Yuan, Yongjun He, Ce Zhang
github.com/
PersiaML
Contacts: Ji Liu (ji.liu.uwisc@gmail.com)
Ce Zhang (ce.zhang@inf.ethz.ch)
Dr. Ji Liu
Former Head, AI platform department
Former Director, Seattle AI lab
Kwai Inc.
Why Recommendation is So Important?
Revenue = user engagement ✖️ user consumption
e.g., content recommendation,
game recommendation, etc.
e.g., ads recommendation,
personalized anchor recommendation, etc.
Challenges to recommender infrastructure: how to accommodate the ever growing recommendation model
- Training (our focus today)
- Inference
- Serving
Trend: Data Increases Exponentially, So Is the Model
Recommendation Model Evolution
Decision tree based models
(not popular anymore)
Naïve logistic regression Complex logistic regression DL based model
Recommendation Model Evolution – Math View
Naïve logistic regression
Complex logistic regression
DL based model
LR( Linear( Emb(U_id), Emd(I_id) ) )
LR( Linear ( Emb(U_id), Emd(I_id); Emb(U_f1), …, Emb(U_fm); Emb(I_f1), …, Emb(I_fn); Emb(U_f1, I_f1), Emb(U_f2, I_f1),
… ) )
LR( Nonlinear ( Emb(U_id), Emd(I_id); Emb(U_f1), …, Emb(U_fm); Emb(I_f1), …, Emb(I_fn); Emb(U_f1, I_f1), Emb(U_f2, I_f1), …
) )
More ID features Cross ID features
Complicated DNN
New Challenges for Training Super Large Models
- Key challenge: how to fully utilize the ever upgrading
hardware to meet the ever growing models
Mio (STOA, maybe different
names, e.g., XDL)
- Asynchronous update
- Homogeneous
- Bottlenecks: only good for
LR, accuracy drop
Mio+
- Asynchronous update
- Homogeneous
- Bottlenecks: high cost,
accuracy drop
Persia by us
- Hybrid update
- Heterogeneous
- Bottlenecks: ?
Aibox by Baidu
- Synchronous update
- Single big machine
- Bottlenecks: hard to scale,
training efficiency
User ID embedding
characterizing each user’s
preference
Gender ID embedding
characterizing each type of
gender
Some cross ID embedding,
e.g., user ID by Gender ID
Model the nonlinear
correlation between the
provided user and item,
e.g., click through rate
(CTR)
Emb parameters NN parameters
# parameters 1011~14
106~7
# computation
per sample
103~4
106~7
Emb 1
Emb 0
Emb K
…
NN
Model parameters
Preliminary for DL Based Models: Parameters
Emb parameters: storage heavy,
computation light
NN parameters: storage light, computation
heavy
Emb 1
NN
Emb 0
Emb K
…
feature collection
push back emb gradient
backward
forward
Synchronization
Sample: [ID1, ID2, ID3, …, IDk, label]
Preliminary for DL Based Models: Training Protocol
Perisa’s Parameter Placement
Emb 0
Emb 1
… … …
… … …
CPU 0
CPU 1
CPU N-1
GPU 0
GPU M-1
NN
0 1 …. N …. 2N …
Emb-Cpu-PS: Emb parameters stored in multiple CPUs
like a PS
NN-Gpu-DP: NN parameters stored in multiple GPUs like
data parallel
Emb-Cpu-PS NN-Gpu-DP
Why “EMB-CPU-PS” and “NN-GPU-DP”?
Cannot take benefit from mature communication
toolkits, e.g, DDP, Horovod, Bagua, etc.
GPU 0
GPU M-1
CPU 0
CPU N-1
GPU 0
GPU M-1
CPU 0
CPU N-1
Emb-Cpu-PS NN-Gpu-PS Emb-Cpu-PS NN-Cpu-DP
Cannot take benefit from efficient GPU-GPU
communication, such as NV-link, GDR
Naïve Implementation of Training a Recommendation Model
CPU 0
CPU 1
CPU N-1
GPU 0
GPU M-1
Embedding server
Workers EO
EO feature collection
EO
FC
FC forward computation
FC
FC
BC
BC backward computation
BC
BC
PEG
PEG push embedding gradient
PEG
W-E wait for embedding update done
W-E
W-E
W-E
SNN
SNN synchronize NN
SNN
Naive implementation would not be efficient
Synchronous Updating vs. Asynchronous Updating
Async (ideal
in runtime)
EO
FC BC
PEG
EO
FC BC
PEG
SNN SNN
Use out-of-date information for computation to get high efficiency, but often cause ``low accuracy’’, like
XDL-async, Mio (recall 0.1% accuracy drop may cause a significant loss in recommendation)
Question: can we design an algorithm/system to make a better tradeoff between efficiency and accuracy?
Sync (naive) EO FC BC SNN PEG W-E EO FC BC SNN PEG W-E
Ensure model accuracy but with ``low efficiency’’ due to synchronization, like XDL-sync, Paddlepaddle
EO FC BC
SNN
PEG W-E EO FC BC
SNN
PEG W-E
Sync (opt)
Perisa: A Sync-Async-Hybrid Algorithm/System
Sync EO FC BC SNN PEG W-E EO FC BC SNN PEG W-E
Persia (Bagua +
Sync-Aync-Hybrid)
EO
FC BC
SNN
PEG
FC BC
SNN
EO
PEG
Sync-Aync-Hybrid
EO
FC BC
PEG
SNN
EO
FC BC
PEG
SNN
Async (ideal)
EO
FC BC
PEG
EO
FC BC
PEG
SNN SNN
Key idea
o Synchronous update for NN
o Asynchronous update for Emb
Key idea
o Overlap backward
computation and
synchronization of NN
The minor loss in efficiency comparing to the ideal (Async) implementation
Convergence of Persia
Under the standard assumptions, setting the learning rate appropriately, it admits the following convergence
rate
ϕ ≪ 1 : ID’s Maximal frequency (within all samples)
𝜏 : Maximal staleness – #GPUs in Persia
Rate of purely syn algorithm Minor term usually ϕ𝜏 ≪ 1
OBS: Persia admits almost the same convergence rate as the purely synchronous algorithm!
Theorem
Updating of Sync-Async-Hybrid algorithm
Other System Optimization
Optimizing communication efficiency
o Persia-RPC (10+ times faster than gRPC)
o Middleware (save bandwidth for workers)
o Bagua 
o Compact sample batch representation (CSB)
(de)serialization compression thread pool
gRPC protobuf
(high overhead)
gzip/deflate
(slow)
no affinity
persia-RPC zerocopy
(no overhead)
lz4/zstd
(fast)
NUMA aware,
thread affinity
[0000000000000001, 0001000000000001, label_1]
[0011000100010001, 0001000000000001, label_2]
[0000000000000001, 0000000010000001, label_3]
…
Batch vocabulary =
[0000000000000001: 1, 3;
0001000000000001: 1, 2;
0011000100010001: 2;
0000000010000001: 3]
[label_1]
[label_2]
[label_3]
…
Original sample batch
Compact sample batch
Middleware
Data
loader
Gpu_0 Gpu_1
Hdfs / Kafka
Cpu_0 Cpu_1 Cpu_2 Cpu_3
Raw data
ID, Emb gradient
CSB
Emb sum
Emb
Emb gradient
Other System Optimization
Optimizing computational efficiency
o PyTorch mix-precision
o CPU SIMD instruction
o Embedding and optimizer parameter co-location
o Etc.
Empirical Study: Sync vs. Async vs. Hybrid
OBS: The efficiency of Persia is comparable to Async, and the accuracy is consistent with Sync.
Setup: 64-NVIDIA V100
100-CPU machines (52 cores & 480 GB RAM)
GPU bandwidth: 100Gbps
CPU bandwidth: 10Gbps
Comparison to Other Public Frameworks
Scalability (“#workers” and “#parameters”)
OBS: Nearly linear speedup wrt #GPUs; Linear scaleup wrt #parameters!
Setup: Google cloud
o 8 a2-highgpu-8g instances (each with 8 Nvidia A100 GPUs)
o 100 c2-standard-30 instances (each with 30vCPUs, 120GB RAM)
o 30 m2-ultramem-416 instances (each with 416vCPUs, 12TB RAM)
Summary for Persia
- Heterogeneous architecture
- Hybrid update (Sync + Async): fast and accurate
- Resource efficient
- Flexible (all components can be integrated into one machine)
- Push the model size to a new magnitude (100 trillion parameters)
- Invited to integrate into Pytorch Lightning
Thanks!
——Q&A
Paper github

More Related Content

Similar to SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 Trillion Parameters

TensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.jsTensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.jsStijn Decubber
 
Intelligent Monitoring
Intelligent MonitoringIntelligent Monitoring
Intelligent MonitoringIntelie
 
Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned
Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons LearnedOptimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned
Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons LearnedIntel IT Center
 
Build, train, and deploy machine learning models at scale
Build, train, and deploy machine learning models at scaleBuild, train, and deploy machine learning models at scale
Build, train, and deploy machine learning models at scaleAmazon Web Services
 
Clipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemClipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemDatabricks
 
Scaling Machine Learning to Billions of Parameters - Spark Summit 2016
Scaling Machine Learning to Billions of Parameters - Spark Summit 2016Scaling Machine Learning to Billions of Parameters - Spark Summit 2016
Scaling Machine Learning to Billions of Parameters - Spark Summit 2016Badri Narayan Bhaskar
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersJen Aman
 
Integration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsIntegration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsMichael Häusler
 
Accelerators: the good, the bad, and the ugly
Accelerators: the good, the bad, and the uglyAccelerators: the good, the bad, and the ugly
Accelerators: the good, the bad, and the uglyIntel IT Center
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLNordic APIs
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用CHENHuiMei
 
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...Spark Summit
 
Weakly Supervised Whole Slide Image Analysis Using Cloud Computing
Weakly Supervised Whole Slide Image Analysis Using Cloud ComputingWeakly Supervised Whole Slide Image Analysis Using Cloud Computing
Weakly Supervised Whole Slide Image Analysis Using Cloud ComputingSean Yu
 
A Platform for Accelerating Machine Learning Applications
 A Platform for Accelerating Machine Learning Applications A Platform for Accelerating Machine Learning Applications
A Platform for Accelerating Machine Learning ApplicationsNVIDIA Taiwan
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkIvo Andreev
 
IMCSummit 2015 - Day 2 IT Business Track - Drive IMC Efficiency with Flash E...
IMCSummit 2015 - Day 2  IT Business Track - Drive IMC Efficiency with Flash E...IMCSummit 2015 - Day 2  IT Business Track - Drive IMC Efficiency with Flash E...
IMCSummit 2015 - Day 2 IT Business Track - Drive IMC Efficiency with Flash E...In-Memory Computing Summit
 
Scaling AI in production using PyTorch
Scaling AI in production using PyTorchScaling AI in production using PyTorch
Scaling AI in production using PyTorchgeetachauhan
 
[Connect(); // Japan 2016] Microsoft の AI 開発最新アップデート ~ Cognitive Services からA...
[Connect(); // Japan 2016] Microsoft の AI 開発最新アップデート ~ Cognitive Services からA...[Connect(); // Japan 2016] Microsoft の AI 開発最新アップデート ~ Cognitive Services からA...
[Connect(); // Japan 2016] Microsoft の AI 開発最新アップデート ~ Cognitive Services からA...Naoki (Neo) SATO
 
Advances in Bayesian Learning
Advances in Bayesian LearningAdvances in Bayesian Learning
Advances in Bayesian Learningbutest
 

Similar to SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 Trillion Parameters (20)

TensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.jsTensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
 
Intelligent Monitoring
Intelligent MonitoringIntelligent Monitoring
Intelligent Monitoring
 
Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned
Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons LearnedOptimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned
Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned
 
Build, train, and deploy machine learning models at scale
Build, train, and deploy machine learning models at scaleBuild, train, and deploy machine learning models at scale
Build, train, and deploy machine learning models at scale
 
Deep learning
Deep learningDeep learning
Deep learning
 
Clipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemClipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving System
 
Scaling Machine Learning to Billions of Parameters - Spark Summit 2016
Scaling Machine Learning to Billions of Parameters - Spark Summit 2016Scaling Machine Learning to Billions of Parameters - Spark Summit 2016
Scaling Machine Learning to Billions of Parameters - Spark Summit 2016
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of Parameters
 
Integration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsIntegration Patterns for Big Data Applications
Integration Patterns for Big Data Applications
 
Accelerators: the good, the bad, and the ugly
Accelerators: the good, the bad, and the uglyAccelerators: the good, the bad, and the ugly
Accelerators: the good, the bad, and the ugly
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of ML
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用
 
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
 
Weakly Supervised Whole Slide Image Analysis Using Cloud Computing
Weakly Supervised Whole Slide Image Analysis Using Cloud ComputingWeakly Supervised Whole Slide Image Analysis Using Cloud Computing
Weakly Supervised Whole Slide Image Analysis Using Cloud Computing
 
A Platform for Accelerating Machine Learning Applications
 A Platform for Accelerating Machine Learning Applications A Platform for Accelerating Machine Learning Applications
A Platform for Accelerating Machine Learning Applications
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
IMCSummit 2015 - Day 2 IT Business Track - Drive IMC Efficiency with Flash E...
IMCSummit 2015 - Day 2  IT Business Track - Drive IMC Efficiency with Flash E...IMCSummit 2015 - Day 2  IT Business Track - Drive IMC Efficiency with Flash E...
IMCSummit 2015 - Day 2 IT Business Track - Drive IMC Efficiency with Flash E...
 
Scaling AI in production using PyTorch
Scaling AI in production using PyTorchScaling AI in production using PyTorch
Scaling AI in production using PyTorch
 
[Connect(); // Japan 2016] Microsoft の AI 開発最新アップデート ~ Cognitive Services からA...
[Connect(); // Japan 2016] Microsoft の AI 開発最新アップデート ~ Cognitive Services からA...[Connect(); // Japan 2016] Microsoft の AI 開発最新アップデート ~ Cognitive Services からA...
[Connect(); // Japan 2016] Microsoft の AI 開発最新アップデート ~ Cognitive Services からA...
 
Advances in Bayesian Learning
Advances in Bayesian LearningAdvances in Bayesian Learning
Advances in Bayesian Learning
 

More from Chester Chen

SFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdfSFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdfChester Chen
 
zookeeer+raft-2.pdf
zookeeer+raft-2.pdfzookeeer+raft-2.pdf
zookeeer+raft-2.pdfChester Chen
 
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...Chester Chen
 
A missing link in the ML infrastructure stack?
A missing link in the ML infrastructure stack?A missing link in the ML infrastructure stack?
A missing link in the ML infrastructure stack?Chester Chen
 
Shopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdataShopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdataChester Chen
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...Chester Chen
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK... SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...Chester Chen
 
SFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProSFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProChester Chen
 
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scaleSF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scaleChester Chen
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftChester Chen
 
SFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdapSFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdapChester Chen
 
Sf big analytics: bighead
Sf big analytics: bigheadSf big analytics: bighead
Sf big analytics: bigheadChester Chen
 
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformChester Chen
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Chester Chen
 
2018 data warehouse features in spark
2018   data warehouse features in spark2018   data warehouse features in spark
2018 data warehouse features in sparkChester Chen
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 Chester Chen
 
2018 02 20-jeg_index
2018 02 20-jeg_index2018 02 20-jeg_index
2018 02 20-jeg_indexChester Chen
 
Index conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathIndex conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathChester Chen
 
Index conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreathIndex conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreathChester Chen
 

More from Chester Chen (20)

SFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdfSFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdf
 
zookeeer+raft-2.pdf
zookeeer+raft-2.pdfzookeeer+raft-2.pdf
zookeeer+raft-2.pdf
 
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
 
A missing link in the ML infrastructure stack?
A missing link in the ML infrastructure stack?A missing link in the ML infrastructure stack?
A missing link in the ML infrastructure stack?
 
Shopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdataShopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdata
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK... SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 
SFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProSFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a Pro
 
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scaleSF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
 
SFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdapSFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdap
 
Sf big analytics: bighead
Sf big analytics: bigheadSf big analytics: bighead
Sf big analytics: bighead
 
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
 
2018 data warehouse features in spark
2018   data warehouse features in spark2018   data warehouse features in spark
2018 data warehouse features in spark
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3
 
2018 02 20-jeg_index
2018 02 20-jeg_index2018 02 20-jeg_index
2018 02 20-jeg_index
 
Index conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathIndex conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreath
 
Index conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreathIndex conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreath
 

Recently uploaded

Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 

Recently uploaded (20)

9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 

SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 Trillion Parameters

  • 1. Persia: A Hybrid System Scaling Deep Learning- based Recommenders up to 100 Trillion Parameters Xiangru Lian, Xuefeng Zhu, Yulong Wang, Honghuan Wu, Lei Sun, Haodong Lyu, Chengjun Liu, Xing Dong, Yiqiao Liao, Mingnan Luo, Congfei Zhang, Jingru Xie, Haonan Li, Lei Chen, Renjie Huang, Jianying Lin, Chengchun Shu, Xuezhong Qiu, Zhishan Liu, Dongying Kong, Lei Yuan, Hai Yu, Sen Yang, Ji Liu Binhang Yuan, Yongjun He, Ce Zhang github.com/ PersiaML Contacts: Ji Liu (ji.liu.uwisc@gmail.com) Ce Zhang (ce.zhang@inf.ethz.ch) Dr. Ji Liu Former Head, AI platform department Former Director, Seattle AI lab Kwai Inc.
  • 2. Why Recommendation is So Important? Revenue = user engagement ✖️ user consumption e.g., content recommendation, game recommendation, etc. e.g., ads recommendation, personalized anchor recommendation, etc.
  • 3. Challenges to recommender infrastructure: how to accommodate the ever growing recommendation model - Training (our focus today) - Inference - Serving Trend: Data Increases Exponentially, So Is the Model
  • 4. Recommendation Model Evolution Decision tree based models (not popular anymore) Naïve logistic regression Complex logistic regression DL based model
  • 5. Recommendation Model Evolution – Math View Naïve logistic regression Complex logistic regression DL based model LR( Linear( Emb(U_id), Emd(I_id) ) ) LR( Linear ( Emb(U_id), Emd(I_id); Emb(U_f1), …, Emb(U_fm); Emb(I_f1), …, Emb(I_fn); Emb(U_f1, I_f1), Emb(U_f2, I_f1), … ) ) LR( Nonlinear ( Emb(U_id), Emd(I_id); Emb(U_f1), …, Emb(U_fm); Emb(I_f1), …, Emb(I_fn); Emb(U_f1, I_f1), Emb(U_f2, I_f1), … ) ) More ID features Cross ID features Complicated DNN
  • 6. New Challenges for Training Super Large Models - Key challenge: how to fully utilize the ever upgrading hardware to meet the ever growing models Mio (STOA, maybe different names, e.g., XDL) - Asynchronous update - Homogeneous - Bottlenecks: only good for LR, accuracy drop Mio+ - Asynchronous update - Homogeneous - Bottlenecks: high cost, accuracy drop Persia by us - Hybrid update - Heterogeneous - Bottlenecks: ? Aibox by Baidu - Synchronous update - Single big machine - Bottlenecks: hard to scale, training efficiency
  • 7. User ID embedding characterizing each user’s preference Gender ID embedding characterizing each type of gender Some cross ID embedding, e.g., user ID by Gender ID Model the nonlinear correlation between the provided user and item, e.g., click through rate (CTR) Emb parameters NN parameters # parameters 1011~14 106~7 # computation per sample 103~4 106~7 Emb 1 Emb 0 Emb K … NN Model parameters Preliminary for DL Based Models: Parameters Emb parameters: storage heavy, computation light NN parameters: storage light, computation heavy
  • 8. Emb 1 NN Emb 0 Emb K … feature collection push back emb gradient backward forward Synchronization Sample: [ID1, ID2, ID3, …, IDk, label] Preliminary for DL Based Models: Training Protocol
  • 9. Perisa’s Parameter Placement Emb 0 Emb 1 … … … … … … CPU 0 CPU 1 CPU N-1 GPU 0 GPU M-1 NN 0 1 …. N …. 2N … Emb-Cpu-PS: Emb parameters stored in multiple CPUs like a PS NN-Gpu-DP: NN parameters stored in multiple GPUs like data parallel Emb-Cpu-PS NN-Gpu-DP
  • 10. Why “EMB-CPU-PS” and “NN-GPU-DP”? Cannot take benefit from mature communication toolkits, e.g, DDP, Horovod, Bagua, etc. GPU 0 GPU M-1 CPU 0 CPU N-1 GPU 0 GPU M-1 CPU 0 CPU N-1 Emb-Cpu-PS NN-Gpu-PS Emb-Cpu-PS NN-Cpu-DP Cannot take benefit from efficient GPU-GPU communication, such as NV-link, GDR
  • 11. Naïve Implementation of Training a Recommendation Model CPU 0 CPU 1 CPU N-1 GPU 0 GPU M-1 Embedding server Workers EO EO feature collection EO FC FC forward computation FC FC BC BC backward computation BC BC PEG PEG push embedding gradient PEG W-E wait for embedding update done W-E W-E W-E SNN SNN synchronize NN SNN Naive implementation would not be efficient
  • 12. Synchronous Updating vs. Asynchronous Updating Async (ideal in runtime) EO FC BC PEG EO FC BC PEG SNN SNN Use out-of-date information for computation to get high efficiency, but often cause ``low accuracy’’, like XDL-async, Mio (recall 0.1% accuracy drop may cause a significant loss in recommendation) Question: can we design an algorithm/system to make a better tradeoff between efficiency and accuracy? Sync (naive) EO FC BC SNN PEG W-E EO FC BC SNN PEG W-E Ensure model accuracy but with ``low efficiency’’ due to synchronization, like XDL-sync, Paddlepaddle EO FC BC SNN PEG W-E EO FC BC SNN PEG W-E Sync (opt)
  • 13. Perisa: A Sync-Async-Hybrid Algorithm/System Sync EO FC BC SNN PEG W-E EO FC BC SNN PEG W-E Persia (Bagua + Sync-Aync-Hybrid) EO FC BC SNN PEG FC BC SNN EO PEG Sync-Aync-Hybrid EO FC BC PEG SNN EO FC BC PEG SNN Async (ideal) EO FC BC PEG EO FC BC PEG SNN SNN Key idea o Synchronous update for NN o Asynchronous update for Emb Key idea o Overlap backward computation and synchronization of NN The minor loss in efficiency comparing to the ideal (Async) implementation
  • 14. Convergence of Persia Under the standard assumptions, setting the learning rate appropriately, it admits the following convergence rate ϕ ≪ 1 : ID’s Maximal frequency (within all samples) 𝜏 : Maximal staleness – #GPUs in Persia Rate of purely syn algorithm Minor term usually ϕ𝜏 ≪ 1 OBS: Persia admits almost the same convergence rate as the purely synchronous algorithm! Theorem Updating of Sync-Async-Hybrid algorithm
  • 15. Other System Optimization Optimizing communication efficiency o Persia-RPC (10+ times faster than gRPC) o Middleware (save bandwidth for workers) o Bagua  o Compact sample batch representation (CSB) (de)serialization compression thread pool gRPC protobuf (high overhead) gzip/deflate (slow) no affinity persia-RPC zerocopy (no overhead) lz4/zstd (fast) NUMA aware, thread affinity [0000000000000001, 0001000000000001, label_1] [0011000100010001, 0001000000000001, label_2] [0000000000000001, 0000000010000001, label_3] … Batch vocabulary = [0000000000000001: 1, 3; 0001000000000001: 1, 2; 0011000100010001: 2; 0000000010000001: 3] [label_1] [label_2] [label_3] … Original sample batch Compact sample batch Middleware Data loader Gpu_0 Gpu_1 Hdfs / Kafka Cpu_0 Cpu_1 Cpu_2 Cpu_3 Raw data ID, Emb gradient CSB Emb sum Emb Emb gradient
  • 16. Other System Optimization Optimizing computational efficiency o PyTorch mix-precision o CPU SIMD instruction o Embedding and optimizer parameter co-location o Etc.
  • 17. Empirical Study: Sync vs. Async vs. Hybrid OBS: The efficiency of Persia is comparable to Async, and the accuracy is consistent with Sync. Setup: 64-NVIDIA V100 100-CPU machines (52 cores & 480 GB RAM) GPU bandwidth: 100Gbps CPU bandwidth: 10Gbps
  • 18. Comparison to Other Public Frameworks
  • 19. Scalability (“#workers” and “#parameters”) OBS: Nearly linear speedup wrt #GPUs; Linear scaleup wrt #parameters! Setup: Google cloud o 8 a2-highgpu-8g instances (each with 8 Nvidia A100 GPUs) o 100 c2-standard-30 instances (each with 30vCPUs, 120GB RAM) o 30 m2-ultramem-416 instances (each with 416vCPUs, 12TB RAM)
  • 20. Summary for Persia - Heterogeneous architecture - Hybrid update (Sync + Async): fast and accurate - Resource efficient - Flexible (all components can be integrated into one machine) - Push the model size to a new magnitude (100 trillion parameters) - Invited to integrate into Pytorch Lightning