This document summarizes Jack Guo's talk on scaling machine learning at Twitter. It discusses how ML is critical to Twitter's business, supporting ads, recommendations, and other features. It also describes the challenges of scaling ML to Twitter's massive real-time data and traffic. The talk outlines Twitter's ML platform approach, which standardizes features, provides APIs for training and prediction, and tools to help client teams build and optimize ML models at Twitter's scale.
2. ML is Important
• ~80% of DAU is attributed to teams doing
ML
• ~90% of revenue comes from ads backed
by ML models
• ML platform supports many teams
– ads ranking, ads targeting, timeline ranking,
anti-spam, recommendation, moments
ranking, trends
3. ML is Large Scale
• Take ads ranking as an example
– Trillions of predictions made daily
– Hundreds of millions of weights per model
– Thousands of features per example
– TB of training data
4. ML is Realtime
• Twitter is all about realtime: news, events,
videos, trends
• Advertiser campaign targets realtime
event, hashtags, spanning as short as a
few hours even minutes
• ML needs to adapt to dynamically
changing traffic
5. Scaling Challenges
• Organization scaling
– How to support client team efficiently?
• System scaling
– How to train and make inference efficiently?
– How to enable fast iteration and experimentation?
6. Organization Scaling
• ML platform’s focus
– Define feature, transform, model format
– Provide framework and tooling
• data ETL, trainers, parameter search, serving runtime,
workflow management
– Onboard client and provide support
• Client team’s focus
– Define and extract features
– Own and maintain training pipeline and serving
runtime
7. Standardized Feature Format
• Enable feature sharing across teams
• Make ML platform iteration easy
• Feature format
– Support 4 dense, 2 sparse feature types
– Use hashed id instead of string name for
efficient serialization, storage, compute
– Collocate schema (id to name mapping) with
the data
8. TrainingAPI
• Make operations on distributed data
painless for ML practitioners
• Scala data ETL API
– Provide powerful abstractions for ML datasets
and operations
– Fluent API, enabling imperative programming
• Ensure data and metadata consistency
through operations
9. Example
1. Take my dataset whose path given by “input”
2. Sample it by 10% randomly
3. Discretize with the given discretizer
4. Left join with media label on tweet id
5. Dump the result to path given by “output”
11. PredictionEngine Optimization
• Reduce serialization cost
– Model collocation
– Batch request API
• Reduce compute cost
– Feature id instead of string name
– Transform sharing across models
– Feature cross done on the fly
12. PredictionEngine Optimization
• Training/Serving throughput
– Sharding for model updates
– Separation of training and prediction services
– Elastic load based on latency
• Realtime feedback
– Treat ads impression as non-click event
• Fault tolerance
– Snapshot model every fixed interval
– Anomaly traffic detection
13. Tooling
• Autotune hyper parameter
• Insight and interpretation
– Inspect data/model in human readable format
– Compute dataset stats
– Visualize tree model
• Feature selection tool
– Forward/backward greedy search
14. Work in progress
• Algorithm flexibility
– Large scale torch based ML
• Better tooling
– Workflow management framework
– Visualization and interactive exploration