Parameter Server Approach for Online Learning at Twitter

Parameter Server Approach for
Online Learning @ Twitter
Joe Xie, Yong Wang and Yue Lu
ML Infra Group, Ads Prediction Team
Oct 10, 2017

Outline
• Background
– Online learning
– Challenges
• Parameter Server Approaches
– v1.0 Decouple the training and prediction
– v2.0 Scale the training
– v3.0 Scale the model
• Future Directions

Twitter is Realtime
• Twitter is all about real-time: news, events, trends,
hashtags.
– Users interest and intent change in realtime.
– Context changes in realtime.
– New advertisers, new campaigns are added in realtime.
• ML is increasingly at the core of everything we build at
Twitter
– ML model dynamically adapts to changes spanning as short as a few
hours even minutes

Real time:
Time
Model
Data Stream
Prediction Stream
Time
Model
Data Stream
Prediction Stream
Online Learning Offline Learning
Learning Phase Training Phase Serving Phase
ReadWriteRead &
Write
Read &
Write

Real time – Online Learning
Architecture
Simple and efficient for Ads Prediction and
Moments Relevance production services

Challenges
• Network fanout
– The same traffic stream is sent many times over to each prediction
instance, wasting network bandwidth.
• Limit to training traffic size
–Online training throughput is currently limited by the capacity (CPU /
Network bandwidth) of a single mesos worker
• Limit to model size
– All model are hosted within the memory for each instance.

Model Architecture
Raw Features
Raw Features Feature Crosses Decision Tree
(e.g., XGBoost...)
Neural Network
(e.g., Torch,
TensorFlow...)
...
Distributed Large-scale Online Logistic Regression
(Parameter Server)
● Fully explore the feature interaction
w/o training latency constraint.
● The feature interactions don’t
change frequently historically.
● Flexible architecture with new model
structure & external machine
learning framework.

20X training data
- Parameter server v2.0 to scale the
training traffic
10X features+algo complexity
- Parameter server v3.0 to scale the
model size
10X prediction qps
- Parameter server v1.0 to decouple
the training and prediction requests
Parameter Server Approaches

Parameter Server v1.0
Training
Worker
Training
Traffic
Observation
Service
Observation
Service
Observation
Workers
Instance of
Prediction
Service
M
od
el
Instance of
Prediction
Service
M
od
el
Instance of
Prediction
ServicePrediction
Worker
Pull Model
Model
Model
Pull
Downsampling
Through
■ New architecture to decouple
the training / prediction services
into different clusters.
10X Prediction capacity
Higher Serving efficiency
Prediction
Requests
Updates
Downsampling

• Separated training service
–Take training traffic to generate incremental model update
• New observation service
– Consume incremental model update
– Evaluate training traffic for model quality assurance
• Separated prediction service
– Consume incremental model update
– Serve the prediction request

• Launched into ads engagement
prediction models.
– Mesos Efficiency: 40% reduction in CPU cores
required.
– Network Efficiency: 60% reduction in fan-out
messages required.

Parameter
Server
Mo
del
Instance of
Prediction
Service Mo
del
Training
Workers
Training
Traffic
Observation
Service
Observation
Service
Observation
Worker
NO downsamplingPull
Push/Pull
Instance of
Prediction
Service
M
od
el
Instance of
Prediction
Service
M
od
el
Instance of
Prediction
Service
M
od
el
Instance of
Prediction
Service
M
od
el
M
od
el
Instance of
Prediction
ServicePrediction
Workers
Pull
Model
ModelModel
Model
Through
■ New architecture to
distribute the training
20X Training data
Higher model quality
Dispatch
Workers
Dispatch
Workers
Dispatch
Workers
Downsampling
Prediction
Requests

• New dispatch service
–Take un-sampled training traffic and dispatch to training service
• Updated training service
–Take training traffic and produce updates for parameter service
–Receive model update from parameter service
• New parameter service
– Aggregate the updates from training services
– Send model update to training / observation / prediction services

• Launched into ads engagement
prediction models.
• First version using simple model-average
aggregation.
–20x training capacity
–xx% model quality gain

Mo
del
Instance of
Prediction
Service Mo
del
Training
Workers
Training
Traffic
Observation
Service
Observation
Service
Observation
Worker
NO downsamplingPull
Push/Pull
Instance of
Prediction
Service
M
od
el
Instance of
Prediction
Service
M
od
el
Instance of
Prediction
Service
M
od
el
Instance of
Prediction
Service
M
od
el
M
od
el
Instance of
Prediction
ServicePrediction
Workers
Pull
Model
ModelModel
Model
Dispatch
Workers
Dispatch
Workers
Dispatch
Workers
Downsampling
Prediction
RequestsParameter
Server
Parameter
Server
Parameter
Server
Model
Through
■ New architecture for
model / feature sharding
More complex model
Higher model quality

• Updated parameter service (In progress)
–Model sharding: Parameter instance hosts single model instead of
multiple models.
•xx% model quality gain in experimentation.
–Feature sharding: Parameter instance hosts partial of single model.

Parameter Server Approach for Online Learning at Twitter

Parameter Server Approach for Online Learning at Twitter

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Parameter Server Approach for Online Learning at Twitter

Similar to Parameter Server Approach for Online Learning at Twitter (20)

Recently uploaded

Recently uploaded (20)

Parameter Server Approach for Online Learning at Twitter