ICML'16 Scaling ML System@Twitter

ICML 16’
Scaling Machine Learning @Twitter
Jack Guo

ML is Important
• ~80% of DAU is attributed to teams doing
ML
• ~90% of revenue comes from ads backed
by ML models
• ML platform supports many teams
– ads ranking, ads targeting, timeline ranking,
anti-spam, recommendation, moments
ranking, trends

ML is Large Scale
• Take ads ranking as an example
– Trillions of predictions made daily
– Hundreds of millions of weights per model
– Thousands of features per example
– TB of training data

ML is Realtime
• Twitter is all about realtime: news, events,
videos, trends
• Advertiser campaign targets realtime
event, hashtags, spanning as short as a
few hours even minutes
• ML needs to adapt to dynamically
changing traffic

Scaling Challenges
• Organization scaling
– How to support client team efficiently?
• System scaling
– How to train and make inference efficiently?
– How to enable fast iteration and experimentation?

Organization Scaling
• ML platform’s focus
– Define feature, transform, model format
– Provide framework and tooling
• data ETL, trainers, parameter search, serving runtime,
workflow management
– Onboard client and provide support
• Client team’s focus
– Define and extract features
– Own and maintain training pipeline and serving
runtime

Standardized Feature Format
• Enable feature sharing across teams
• Make ML platform iteration easy
• Feature format
– Support 4 dense, 2 sparse feature types
– Use hashed id instead of string name for
efficient serialization, storage, compute
– Collocate schema (id to name mapping) with
the data

TrainingAPI
• Make operations on distributed data
painless for ML practitioners
• Scala data ETL API
– Provide powerful abstractions for ML datasets
and operations
– Fluent API, enabling imperative programming
• Ensure data and metadata consistency
through operations

Example
1. Take my dataset whose path given by “input”
2. Sample it by 10% randomly
3. Discretize with the given discretizer
4. Left join with media label on tweet id
5. Dump the result to path given by “output”

PredictionEngine
• Large scale online SGD
learning
• Architecture
– Transform: MDL, Decision tree
– Feature crossing
– Logistic Regression: Vowpal
Wabbit or in-house JVM learner
Transform
Transform
Transform
Cross
Logistic
Regression
DataRecord
DataRecord

PredictionEngine Optimization
• Reduce serialization cost
– Model collocation
– Batch request API
• Reduce compute cost
– Feature id instead of string name
– Transform sharing across models
– Feature cross done on the fly

PredictionEngine Optimization
• Training/Serving throughput
– Sharding for model updates
– Separation of training and prediction services
– Elastic load based on latency
• Realtime feedback
– Treat ads impression as non-click event
• Fault tolerance
– Snapshot model every fixed interval
– Anomaly traffic detection

Tooling
• Autotune hyper parameter
• Insight and interpretation
– Inspect data/model in human readable format
– Compute dataset stats
– Visualize tree model
• Feature selection tool
– Forward/backward greedy search

Work in progress
• Algorithm flexibility
– Large scale torch based ML
• Better tooling
– Workflow management framework
– Visualization and interactive exploration

ICML'16 Scaling ML System@Twitter

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (11)

Similar to ICML'16 Scaling ML System@Twitter

Similar to ICML'16 Scaling ML System@Twitter (20)

ICML'16 Scaling ML System@Twitter