SlideShare a Scribd company logo
How We Scaled BERT
To Serve 1+ Billion Daily
Requests on CPU
Quoc N. Le, Data Scientist, Roblox
Kip Kaehler, Engineering Manager, Roblox
About Roblox
3
Deep Learning for Text Classification
4
● Text Classification is a key capability on
Roblox platform
● BERT is a deep learning model that
has transformed the Natural Language
Processing (NLP) landscape
● Performance (Precision/Recall Area
Under Curve) of our text classifiers
improved by 10 percentage points fine-
tuning BERT versus classical machine
learning
Beyond Accuracy: Latency and Throughput
5
Latency: Speed of Request
Analogy -> How long it takes for a
single person to cross a bridge
We required latency under 20ms
Throughput: Completed Requests
Per Second
Analogy -> How many people can
cross a bridge in a period of time
We required over 50k requests per sec
We want a short, wide bridge to
maximize both Latency and
Throughput in a realtime
environment
Our Deep Learning Tech Stack
GPU vs CPU (for our application)
Higher Throughput
for Real-Time Inference
Higher Throughput
for Model Training
~GPU (TESLA V100) 10x faster
at processing training examples
than CPU, due to efficiency in
doing large batch matrix
operations
OR
~CPU (Intel Xeon Scalable
Processor) 5x more throughput
than GPU due to CPU-specific
optimizations and spreading real-
time inference requests across
cores (with latency < 20ms)
(on cost equivalent hardware in 2020)
Which Comes First?
8
Build an Accurate
Model
Make It Fast
Choose a Known
Fast Model
Make It Accurate
Know Your Quoc Le’s
Quoc N. Le Quoc V. Le
Has over 85k citations
as an AI researcher
according to Google
Scholar
Once got kicked out of
the Boomtown Casino
in Reno for counting
cards in blackjack
Our Scaling Playbook on CPU: Less Is More!
10
❏ Smaller Model (Distillation)
❏ Smaller Inputs (Dynamic Inputs)
❏ Smaller Weights (Quantization)
❏ Smaller Number of Requests
(Caching)
❏ Smaller Number of Threads per
Core (Thread Tuning)
Where We Started
Smaller
Model
Smaller
Inputs
Smaller
Weights
Benchmarks
Run on Intel
Xeon
Scalable
Processors
Baseline
BERT
Smaller Model (DistilBERT)
Smaller
Model
Smaller
Inputs
Smaller
Weights
Benchmarks
Run on Intel
Xeon
Scalable
Processors
Baseline
BERT
Smaller Model (DistilBERT)
Bert Base has
110m parameters
DistilBert has
66m parameters
Tradeoff: < 1%
negative impact on
“Accuracy” (PR
AUC)
Text Input
..
..
..
..
..
Predict
Layer
Transformer Layer
..
..
..
..
Transformer Layer
Transformer Layer
..
..
..
..
Transformer Layer
Predict
Layer
Text Input
Student
Model
(DistilBERT)
Teacher
Model
(BERT)
DistilBERT: https://arxiv.org/pdf/1910.01108.pdf
..
Knowledge
Distillation
Smaller Inputs (Dynamic Shapes)
Smaller
Model
Smaller
Inputs
Smaller
Weights
Benchmarks
Run on Intel
Xeon
Scalable
Processors
Baseline
BERT
Smaller Inputs (Dynamic Shapes)
Fixed Shape Inputs
(Zero pad until all inputs have same shape)
Dynamic Shape Inputs
(Do not zero pad)
Smaller Weights (Quantization)
Smaller
Model
Smaller
Inputs
Smaller
Weights
Benchmarks
Run on Intel
Xeon
Scalable
Processors
Baseline
BERT
Smaller Weights (Quantization)
Image Credit:
https://towardsdatascience.com/how-to-
accelerate-and-compress-neural-networks-with-
quantization-edfbbabb6af7
Tradeoff: < 1%
negative impact on
“Accuracy” (PR
AUC)
“Dynamic Quantization” One-liner
Smaller Weights (Quantization)
Smaller Number of Requests to Model (Caching)
Image Credit:
https://peltarion.com/blog/data-
science/illustration-3d-bert
Text
Classification
Service
Cache
DistilBERT Model
1. Retrieve text
classification result
from cache (we’re done
if it’s there)
2. Else call deep
learning model for
result
3. Add result to cache,
then return result to
service
Smaller Number Threads Per Core (Thread Tuning)
Our Scaling Playbook on CPU: Less Is More!
21
✓ Smaller Model (Distillation)
✓ Smaller Inputs (Dynamic Inputs)
✓ Smaller Weights (Quantization)
✓ Smaller Number of Requests
(Caching)
✓ Smaller Number of Threads per
Core (Thread Tuning)
30x Improvement in Latency and Throughput on CPU
Smaller
Model
Smaller
Inputs
Smaller
Weights
Benchmarks
Run on Intel
Xeon
Scalable
Processors
Baseline
BERT
Takeaways
23
● For certain real-time deep learning
applications, it is feasible/natural to super-
scale inferences on CPU
● The key to scaling is making things smaller,
as shown in this presentation
● Many optimizations that enabled scale are
easy to implement (one-liners)
● Check out our blog for more details:
https://robloxtechblog.com/how-we-scaled-
bert-to-serve-1-billion-daily-requests-on-cpus-
d99be090db26
Questions? Suggestions?
24
We are always looking to get more performance
from our models. Please reach out to
kkaehler@roblox.com
PS We are always hiring 🤓
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

More Related Content

What's hot

Why would you recommend me THAT?
Why would you recommend me THAT?Why would you recommend me THAT?
Why would you recommend me THAT?
Aish Fenton
 
Coinbase Seed Round Pitch Deck
Coinbase Seed Round Pitch DeckCoinbase Seed Round Pitch Deck
Coinbase Seed Round Pitch Deck
Brian Armstrong
 
gRPC in Go
gRPC in GogRPC in Go
gRPC in Go
Almog Baku
 
How Does Generative AI Actually Work? (a quick semi-technical introduction to...
How Does Generative AI Actually Work? (a quick semi-technical introduction to...How Does Generative AI Actually Work? (a quick semi-technical introduction to...
How Does Generative AI Actually Work? (a quick semi-technical introduction to...
ssuser4edc93
 
The Deck We Used to Raise $1M Seed Round
The Deck We Used to Raise $1M Seed RoundThe Deck We Used to Raise $1M Seed Round
The Deck We Used to Raise $1M Seed Round
Ben Lang
 
GENERATIVE AI, THE FUTURE OF PRODUCTIVITY
GENERATIVE AI, THE FUTURE OF PRODUCTIVITYGENERATIVE AI, THE FUTURE OF PRODUCTIVITY
GENERATIVE AI, THE FUTURE OF PRODUCTIVITY
Andre Muscat
 
Uber: $1.25M VC investment turned into $84.2B. Uber's initial pitch deck
Uber: $1.25M VC investment turned into $84.2B. Uber's initial pitch deckUber: $1.25M VC investment turned into $84.2B. Uber's initial pitch deck
Uber: $1.25M VC investment turned into $84.2B. Uber's initial pitch deck
AA BB
 
ChatGPT_Cheatsheet_Costa.pdf
ChatGPT_Cheatsheet_Costa.pdfChatGPT_Cheatsheet_Costa.pdf
ChatGPT_Cheatsheet_Costa.pdf
ssuser3e5d3a
 
Decentraland: Building the Metaverse
Decentraland: Building the MetaverseDecentraland: Building the Metaverse
Decentraland: Building the Metaverse
Travis Scher
 
Understanding GenAI/LLM and What is Google Offering - Felix Goh
Understanding GenAI/LLM and What is Google Offering - Felix GohUnderstanding GenAI/LLM and What is Google Offering - Felix Goh
Understanding GenAI/LLM and What is Google Offering - Felix Goh
NUS-ISS
 
Ansiblefest 2018 Network automation journey at roblox
Ansiblefest 2018 Network automation journey at robloxAnsiblefest 2018 Network automation journey at roblox
Ansiblefest 2018 Network automation journey at roblox
Damien Garros
 
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
StreamNative
 
The slide deck we used to raise half a million dollars
The slide deck we used to raise half a million dollarsThe slide deck we used to raise half a million dollars
The slide deck we used to raise half a million dollars
Buffer
 
Generative AI at the edge.pdf
Generative AI at the edge.pdfGenerative AI at the edge.pdf
Generative AI at the edge.pdf
Qualcomm Research
 
BuzzFeed Pitch Deck
BuzzFeed Pitch DeckBuzzFeed Pitch Deck
BuzzFeed Pitch Deck
Tech in Asia ID
 
[DL Hacks]スパイキングニューラルネットワークを用いた経路探索アルゴリズム
[DL Hacks]スパイキングニューラルネットワークを用いた経路探索アルゴリズム[DL Hacks]スパイキングニューラルネットワークを用いた経路探索アルゴリズム
[DL Hacks]スパイキングニューラルネットワークを用いた経路探索アルゴリズム
Deep Learning JP
 
Pitch Deck Teardown: Phospholutions's $10M Series A extension AgTech deck
Pitch Deck Teardown: Phospholutions's $10M Series A extension AgTech deckPitch Deck Teardown: Phospholutions's $10M Series A extension AgTech deck
Pitch Deck Teardown: Phospholutions's $10M Series A extension AgTech deck
HajeJanKamps
 
Proptech - disruptive technologies and their impact on the real estate indust...
Proptech - disruptive technologies and their impact on the real estate indust...Proptech - disruptive technologies and their impact on the real estate indust...
Proptech - disruptive technologies and their impact on the real estate indust...
Gyula Bajkai
 
ChatGPT Evaluation for NLP
ChatGPT Evaluation for NLPChatGPT Evaluation for NLP
ChatGPT Evaluation for NLP
XiachongFeng
 
Davai Pitchdeck
Davai Pitchdeck Davai Pitchdeck
Davai Pitchdeck
davaiapp
 

What's hot (20)

Why would you recommend me THAT?
Why would you recommend me THAT?Why would you recommend me THAT?
Why would you recommend me THAT?
 
Coinbase Seed Round Pitch Deck
Coinbase Seed Round Pitch DeckCoinbase Seed Round Pitch Deck
Coinbase Seed Round Pitch Deck
 
gRPC in Go
gRPC in GogRPC in Go
gRPC in Go
 
How Does Generative AI Actually Work? (a quick semi-technical introduction to...
How Does Generative AI Actually Work? (a quick semi-technical introduction to...How Does Generative AI Actually Work? (a quick semi-technical introduction to...
How Does Generative AI Actually Work? (a quick semi-technical introduction to...
 
The Deck We Used to Raise $1M Seed Round
The Deck We Used to Raise $1M Seed RoundThe Deck We Used to Raise $1M Seed Round
The Deck We Used to Raise $1M Seed Round
 
GENERATIVE AI, THE FUTURE OF PRODUCTIVITY
GENERATIVE AI, THE FUTURE OF PRODUCTIVITYGENERATIVE AI, THE FUTURE OF PRODUCTIVITY
GENERATIVE AI, THE FUTURE OF PRODUCTIVITY
 
Uber: $1.25M VC investment turned into $84.2B. Uber's initial pitch deck
Uber: $1.25M VC investment turned into $84.2B. Uber's initial pitch deckUber: $1.25M VC investment turned into $84.2B. Uber's initial pitch deck
Uber: $1.25M VC investment turned into $84.2B. Uber's initial pitch deck
 
ChatGPT_Cheatsheet_Costa.pdf
ChatGPT_Cheatsheet_Costa.pdfChatGPT_Cheatsheet_Costa.pdf
ChatGPT_Cheatsheet_Costa.pdf
 
Decentraland: Building the Metaverse
Decentraland: Building the MetaverseDecentraland: Building the Metaverse
Decentraland: Building the Metaverse
 
Understanding GenAI/LLM and What is Google Offering - Felix Goh
Understanding GenAI/LLM and What is Google Offering - Felix GohUnderstanding GenAI/LLM and What is Google Offering - Felix Goh
Understanding GenAI/LLM and What is Google Offering - Felix Goh
 
Ansiblefest 2018 Network automation journey at roblox
Ansiblefest 2018 Network automation journey at robloxAnsiblefest 2018 Network automation journey at roblox
Ansiblefest 2018 Network automation journey at roblox
 
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
 
The slide deck we used to raise half a million dollars
The slide deck we used to raise half a million dollarsThe slide deck we used to raise half a million dollars
The slide deck we used to raise half a million dollars
 
Generative AI at the edge.pdf
Generative AI at the edge.pdfGenerative AI at the edge.pdf
Generative AI at the edge.pdf
 
BuzzFeed Pitch Deck
BuzzFeed Pitch DeckBuzzFeed Pitch Deck
BuzzFeed Pitch Deck
 
[DL Hacks]スパイキングニューラルネットワークを用いた経路探索アルゴリズム
[DL Hacks]スパイキングニューラルネットワークを用いた経路探索アルゴリズム[DL Hacks]スパイキングニューラルネットワークを用いた経路探索アルゴリズム
[DL Hacks]スパイキングニューラルネットワークを用いた経路探索アルゴリズム
 
Pitch Deck Teardown: Phospholutions's $10M Series A extension AgTech deck
Pitch Deck Teardown: Phospholutions's $10M Series A extension AgTech deckPitch Deck Teardown: Phospholutions's $10M Series A extension AgTech deck
Pitch Deck Teardown: Phospholutions's $10M Series A extension AgTech deck
 
Proptech - disruptive technologies and their impact on the real estate indust...
Proptech - disruptive technologies and their impact on the real estate indust...Proptech - disruptive technologies and their impact on the real estate indust...
Proptech - disruptive technologies and their impact on the real estate indust...
 
ChatGPT Evaluation for NLP
ChatGPT Evaluation for NLPChatGPT Evaluation for NLP
ChatGPT Evaluation for NLP
 
Davai Pitchdeck
Davai Pitchdeck Davai Pitchdeck
Davai Pitchdeck
 

Similar to How We Scaled Bert To Serve 1+ Billion Daily Requests on CPU

Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance BarriersCeph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Community
 
Ceph
CephCeph
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Odinot Stanislas
 
Scale Up Performance with Intel® Development
Scale Up Performance with Intel® DevelopmentScale Up Performance with Intel® Development
Scale Up Performance with Intel® Development
Intel IT Center
 
High Frequency Trading and NoSQL database
High Frequency Trading and NoSQL databaseHigh Frequency Trading and NoSQL database
High Frequency Trading and NoSQL database
Peter Lawrey
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Intel® Software
 
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Community
 
In-Memory and TimeSeries Technology to Accelerate NoSQL Analytics
In-Memory and TimeSeries Technology to Accelerate NoSQL AnalyticsIn-Memory and TimeSeries Technology to Accelerate NoSQL Analytics
In-Memory and TimeSeries Technology to Accelerate NoSQL Analytics
sandor szabo
 
Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions
Ceph on Intel: Intel Storage Components, Benchmarks, and ContributionsCeph on Intel: Intel Storage Components, Benchmarks, and Contributions
Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions
Red_Hat_Storage
 
Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions
Ceph on Intel: Intel Storage Components, Benchmarks, and ContributionsCeph on Intel: Intel Storage Components, Benchmarks, and Contributions
Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions
Colleen Corrice
 
Using Deep Learning at Scale - Guhan Suriyanarayanan and Adi Oltean, Microsoft
Using Deep Learning at Scale - Guhan Suriyanarayanan and Adi Oltean, MicrosoftUsing Deep Learning at Scale - Guhan Suriyanarayanan and Adi Oltean, Microsoft
Using Deep Learning at Scale - Guhan Suriyanarayanan and Adi Oltean, Microsoft
Guhan Suriyanarayanan
 
The Microarchitecure Of FPGA Based Soft Processor
The Microarchitecure Of FPGA Based Soft ProcessorThe Microarchitecure Of FPGA Based Soft Processor
The Microarchitecure Of FPGA Based Soft Processor
Deepak Tomar
 
Deep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store Performance
Deep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store PerformanceDeep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store Performance
Deep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store Performance
Amazon Web Services
 
Deep Dive: Maximizing EC2 and EBS Performance
Deep Dive: Maximizing EC2 and EBS PerformanceDeep Dive: Maximizing EC2 and EBS Performance
Deep Dive: Maximizing EC2 and EBS Performance
Amazon Web Services
 
Deep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store Performance
Deep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store PerformanceDeep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store Performance
Deep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store Performance
Amazon Web Services
 
Scaling Streaming - Concepts, Research, Goals
Scaling Streaming - Concepts, Research, GoalsScaling Streaming - Concepts, Research, Goals
Scaling Streaming - Concepts, Research, Goals
kamaelian
 
Deep learning at the edge: 100x Inference improvement on edge devices
Deep learning at the edge: 100x Inference improvement on edge devicesDeep learning at the edge: 100x Inference improvement on edge devices
Deep learning at the edge: 100x Inference improvement on edge devices
Numenta
 
Performance improvements in etcd 3.5 release
Performance improvements in etcd 3.5 releasePerformance improvements in etcd 3.5 release
Performance improvements in etcd 3.5 release
LibbySchulze
 
Age of Language Models in NLP
Age of Language Models in NLPAge of Language Models in NLP
Age of Language Models in NLP
Tyrone Systems
 

Similar to How We Scaled Bert To Serve 1+ Billion Daily Requests on CPU (20)

Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance BarriersCeph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
 
Ceph
CephCeph
Ceph
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Scale Up Performance with Intel® Development
Scale Up Performance with Intel® DevelopmentScale Up Performance with Intel® Development
Scale Up Performance with Intel® Development
 
High Frequency Trading and NoSQL database
High Frequency Trading and NoSQL databaseHigh Frequency Trading and NoSQL database
High Frequency Trading and NoSQL database
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
 
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
 
In-Memory and TimeSeries Technology to Accelerate NoSQL Analytics
In-Memory and TimeSeries Technology to Accelerate NoSQL AnalyticsIn-Memory and TimeSeries Technology to Accelerate NoSQL Analytics
In-Memory and TimeSeries Technology to Accelerate NoSQL Analytics
 
Ibm cell
Ibm cell Ibm cell
Ibm cell
 
Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions
Ceph on Intel: Intel Storage Components, Benchmarks, and ContributionsCeph on Intel: Intel Storage Components, Benchmarks, and Contributions
Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions
 
Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions
Ceph on Intel: Intel Storage Components, Benchmarks, and ContributionsCeph on Intel: Intel Storage Components, Benchmarks, and Contributions
Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions
 
Using Deep Learning at Scale - Guhan Suriyanarayanan and Adi Oltean, Microsoft
Using Deep Learning at Scale - Guhan Suriyanarayanan and Adi Oltean, MicrosoftUsing Deep Learning at Scale - Guhan Suriyanarayanan and Adi Oltean, Microsoft
Using Deep Learning at Scale - Guhan Suriyanarayanan and Adi Oltean, Microsoft
 
The Microarchitecure Of FPGA Based Soft Processor
The Microarchitecure Of FPGA Based Soft ProcessorThe Microarchitecure Of FPGA Based Soft Processor
The Microarchitecure Of FPGA Based Soft Processor
 
Deep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store Performance
Deep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store PerformanceDeep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store Performance
Deep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store Performance
 
Deep Dive: Maximizing EC2 and EBS Performance
Deep Dive: Maximizing EC2 and EBS PerformanceDeep Dive: Maximizing EC2 and EBS Performance
Deep Dive: Maximizing EC2 and EBS Performance
 
Deep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store Performance
Deep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store PerformanceDeep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store Performance
Deep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store Performance
 
Scaling Streaming - Concepts, Research, Goals
Scaling Streaming - Concepts, Research, GoalsScaling Streaming - Concepts, Research, Goals
Scaling Streaming - Concepts, Research, Goals
 
Deep learning at the edge: 100x Inference improvement on edge devices
Deep learning at the edge: 100x Inference improvement on edge devicesDeep learning at the edge: 100x Inference improvement on edge devices
Deep learning at the edge: 100x Inference improvement on edge devices
 
Performance improvements in etcd 3.5 release
Performance improvements in etcd 3.5 releasePerformance improvements in etcd 3.5 release
Performance improvements in etcd 3.5 release
 
Age of Language Models in NLP
Age of Language Models in NLPAge of Language Models in NLP
Age of Language Models in NLP
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 

Recently uploaded (20)

My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 

How We Scaled Bert To Serve 1+ Billion Daily Requests on CPU

  • 1. How We Scaled BERT To Serve 1+ Billion Daily Requests on CPU Quoc N. Le, Data Scientist, Roblox Kip Kaehler, Engineering Manager, Roblox
  • 3. Deep Learning for Text Classification 4 ● Text Classification is a key capability on Roblox platform ● BERT is a deep learning model that has transformed the Natural Language Processing (NLP) landscape ● Performance (Precision/Recall Area Under Curve) of our text classifiers improved by 10 percentage points fine- tuning BERT versus classical machine learning
  • 4. Beyond Accuracy: Latency and Throughput 5 Latency: Speed of Request Analogy -> How long it takes for a single person to cross a bridge We required latency under 20ms Throughput: Completed Requests Per Second Analogy -> How many people can cross a bridge in a period of time We required over 50k requests per sec We want a short, wide bridge to maximize both Latency and Throughput in a realtime environment
  • 5. Our Deep Learning Tech Stack
  • 6. GPU vs CPU (for our application) Higher Throughput for Real-Time Inference Higher Throughput for Model Training ~GPU (TESLA V100) 10x faster at processing training examples than CPU, due to efficiency in doing large batch matrix operations OR ~CPU (Intel Xeon Scalable Processor) 5x more throughput than GPU due to CPU-specific optimizations and spreading real- time inference requests across cores (with latency < 20ms) (on cost equivalent hardware in 2020)
  • 7. Which Comes First? 8 Build an Accurate Model Make It Fast Choose a Known Fast Model Make It Accurate
  • 8. Know Your Quoc Le’s Quoc N. Le Quoc V. Le Has over 85k citations as an AI researcher according to Google Scholar Once got kicked out of the Boomtown Casino in Reno for counting cards in blackjack
  • 9. Our Scaling Playbook on CPU: Less Is More! 10 ❏ Smaller Model (Distillation) ❏ Smaller Inputs (Dynamic Inputs) ❏ Smaller Weights (Quantization) ❏ Smaller Number of Requests (Caching) ❏ Smaller Number of Threads per Core (Thread Tuning)
  • 10. Where We Started Smaller Model Smaller Inputs Smaller Weights Benchmarks Run on Intel Xeon Scalable Processors Baseline BERT
  • 12. Smaller Model (DistilBERT) Bert Base has 110m parameters DistilBert has 66m parameters Tradeoff: < 1% negative impact on “Accuracy” (PR AUC) Text Input .. .. .. .. .. Predict Layer Transformer Layer .. .. .. .. Transformer Layer Transformer Layer .. .. .. .. Transformer Layer Predict Layer Text Input Student Model (DistilBERT) Teacher Model (BERT) DistilBERT: https://arxiv.org/pdf/1910.01108.pdf .. Knowledge Distillation
  • 13. Smaller Inputs (Dynamic Shapes) Smaller Model Smaller Inputs Smaller Weights Benchmarks Run on Intel Xeon Scalable Processors Baseline BERT
  • 14. Smaller Inputs (Dynamic Shapes) Fixed Shape Inputs (Zero pad until all inputs have same shape) Dynamic Shape Inputs (Do not zero pad)
  • 16. Smaller Weights (Quantization) Image Credit: https://towardsdatascience.com/how-to- accelerate-and-compress-neural-networks-with- quantization-edfbbabb6af7 Tradeoff: < 1% negative impact on “Accuracy” (PR AUC) “Dynamic Quantization” One-liner
  • 18. Smaller Number of Requests to Model (Caching) Image Credit: https://peltarion.com/blog/data- science/illustration-3d-bert Text Classification Service Cache DistilBERT Model 1. Retrieve text classification result from cache (we’re done if it’s there) 2. Else call deep learning model for result 3. Add result to cache, then return result to service
  • 19. Smaller Number Threads Per Core (Thread Tuning)
  • 20. Our Scaling Playbook on CPU: Less Is More! 21 ✓ Smaller Model (Distillation) ✓ Smaller Inputs (Dynamic Inputs) ✓ Smaller Weights (Quantization) ✓ Smaller Number of Requests (Caching) ✓ Smaller Number of Threads per Core (Thread Tuning)
  • 21. 30x Improvement in Latency and Throughput on CPU Smaller Model Smaller Inputs Smaller Weights Benchmarks Run on Intel Xeon Scalable Processors Baseline BERT
  • 22. Takeaways 23 ● For certain real-time deep learning applications, it is feasible/natural to super- scale inferences on CPU ● The key to scaling is making things smaller, as shown in this presentation ● Many optimizations that enabled scale are easy to implement (one-liners) ● Check out our blog for more details: https://robloxtechblog.com/how-we-scaled- bert-to-serve-1-billion-daily-requests-on-cpus- d99be090db26
  • 23. Questions? Suggestions? 24 We are always looking to get more performance from our models. Please reach out to kkaehler@roblox.com PS We are always hiring 🤓
  • 24. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.