How We Scaled Bert To Serve 1+ Billion Daily Requests on CPU

How We Scaled BERT
To Serve 1+ Billion Daily
Requests on CPU
Quoc N. Le, Data Scientist, Roblox
Kip Kaehler, Engineering Manager, Roblox

Deep Learning for Text Classification
4
● Text Classification is a key capability on
Roblox platform
● BERT is a deep learning model that
has transformed the Natural Language
Processing (NLP) landscape
● Performance (Precision/Recall Area
Under Curve) of our text classifiers
improved by 10 percentage points fine-
tuning BERT versus classical machine
learning

Beyond Accuracy: Latency and Throughput
5
Latency: Speed of Request
Analogy -> How long it takes for a
single person to cross a bridge
We required latency under 20ms
Throughput: Completed Requests
Per Second
Analogy -> How many people can
cross a bridge in a period of time
We required over 50k requests per sec
We want a short, wide bridge to
maximize both Latency and
Throughput in a realtime
environment

GPU vs CPU (for our application)
Higher Throughput
for Real-Time Inference
Higher Throughput
for Model Training
~GPU (TESLA V100) 10x faster
at processing training examples
than CPU, due to efficiency in
doing large batch matrix
operations
OR
~CPU (Intel Xeon Scalable
Processor) 5x more throughput
than GPU due to CPU-specific
optimizations and spreading real-
time inference requests across
cores (with latency < 20ms)
(on cost equivalent hardware in 2020)

Which Comes First?
8
Build an Accurate
Model
Make It Fast
Choose a Known
Fast Model
Make It Accurate

Know Your Quoc Le’s
Quoc N. Le Quoc V. Le
Has over 85k citations
as an AI researcher
according to Google
Scholar
Once got kicked out of
the Boomtown Casino
in Reno for counting
cards in blackjack

Our Scaling Playbook on CPU: Less Is More!
10
❏ Smaller Model (Distillation)
❏ Smaller Inputs (Dynamic Inputs)
❏ Smaller Weights (Quantization)
❏ Smaller Number of Requests
(Caching)
❏ Smaller Number of Threads per
Core (Thread Tuning)

Where We Started
Smaller
Model
Smaller
Inputs
Smaller
Weights
Benchmarks
Run on Intel
Xeon
Scalable
Processors
Baseline
BERT

Smaller Model (DistilBERT)
Smaller
Model
Smaller
Inputs
Smaller
Weights
Benchmarks
Run on Intel
Xeon
Scalable
Processors
Baseline
BERT

Smaller Model (DistilBERT)
Bert Base has
110m parameters
DistilBert has
66m parameters
Tradeoff: < 1%
negative impact on
“Accuracy” (PR
AUC)
Text Input
..
..
..
..
..
Predict
Layer
Transformer Layer
..
..
..
..
Transformer Layer
Transformer Layer
..
..
..
..
Transformer Layer
Predict
Layer
Text Input
Student
Model
(DistilBERT)
Teacher
Model
(BERT)
DistilBERT: https://arxiv.org/pdf/1910.01108.pdf
..
Knowledge
Distillation

Smaller Inputs (Dynamic Shapes)
Smaller
Model
Smaller
Inputs
Smaller
Weights
Benchmarks
Run on Intel
Xeon
Scalable
Processors
Baseline
BERT

Smaller Inputs (Dynamic Shapes)
Fixed Shape Inputs
(Zero pad until all inputs have same shape)
Dynamic Shape Inputs
(Do not zero pad)

Smaller Weights (Quantization)
Smaller
Model
Smaller
Inputs
Smaller
Weights
Benchmarks
Run on Intel
Xeon
Scalable
Processors
Baseline
BERT

Image Credit:
https://towardsdatascience.com/how-to-
accelerate-and-compress-neural-networks-with-
quantization-edfbbabb6af7
Tradeoff: < 1%
negative impact on
“Accuracy” (PR
AUC)
“Dynamic Quantization” One-liner

Smaller Number of Requests to Model (Caching)
Image Credit:
https://peltarion.com/blog/data-
science/illustration-3d-bert
Text
Classification
Service
Cache
DistilBERT Model
1. Retrieve text
classification result
from cache (we’re done
if it’s there)
2. Else call deep
learning model for
result
3. Add result to cache,
then return result to
service

Smaller Number Threads Per Core (Thread Tuning)

Our Scaling Playbook on CPU: Less Is More!
21
✓ Smaller Model (Distillation)
✓ Smaller Inputs (Dynamic Inputs)
✓ Smaller Weights (Quantization)
✓ Smaller Number of Requests
(Caching)
✓ Smaller Number of Threads per
Core (Thread Tuning)

30x Improvement in Latency and Throughput on CPU
Smaller
Model
Smaller
Inputs
Smaller
Weights
Benchmarks
Run on Intel
Xeon
Scalable
Processors
Baseline
BERT

Takeaways
23
● For certain real-time deep learning
applications, it is feasible/natural to super-
scale inferences on CPU
● The key to scaling is making things smaller,
as shown in this presentation
● Many optimizations that enabled scale are
easy to implement (one-liners)
● Check out our blog for more details:
https://robloxtechblog.com/how-we-scaled-
bert-to-serve-1-billion-daily-requests-on-cpus-
d99be090db26

Questions? Suggestions?
24
We are always looking to get more performance
from our models. Please reach out to
kkaehler@roblox.com
PS We are always hiring 🤓

Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

How We Scaled Bert To Serve 1+ Billion Daily Requests on CPU

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to How We Scaled Bert To Serve 1+ Billion Daily Requests on CPU

Similar to How We Scaled Bert To Serve 1+ Billion Daily Requests on CPU (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

How We Scaled Bert To Serve 1+ Billion Daily Requests on CPU