The AdTech world has been totally reinvented a few years ago with the birth of real time auction technologies, known as Real-Time Bidding (RTB). Those auctions allow to buy ad inventory impression by impression. For each visit of a user on a publisher website, each advertiser can choose to display an ad or not and find the right maximum price he is willing to pay to buy this opportunity. Consequently, we see an increasing need of automation and optimisation for the players connected to the RTB and a lot of solutions make use of Machine Learning. The involved datasets are big (billions of lines per day) and they evolve very quickly.
Thus it’s challenging to be able to train models every few hours to use only up to date data in production. Furthermore, those models need to be easily improvable through feature selection and hyper parameter tuning. This requires the ability to run offline and online tests. In this talk, we will explain in more details why Machine Learning plays a key role in the AdTech industry and how Spark is used at Teads to train production models, evaluate them through AB-tests, and design new models according to specific offline metrics. We will also cover the way we use those Machine Learning models in real-time production servers. The main takeaways for the attendees will be the architecture and implementation choices (custom model training framework, model serving, job scheduling and deployment, â¦) that work at scale for the addressed use cases.
9. #SAISML11
True Value for Video Branding
Assumption: the bidder is paid a fixed amount CPV if and only if the
video ad is watched entirely.
True Value?
9
11. #SAISML11
Which constraints?
- Features: several hundred thousands modalities
- Training frequency: several times a day
- Number of predictions: a million per second
11
12. #SAISML11
Logistic regression VS DNN
12
https://cloud.google.com/blog/big-data/2017/02/using-google-cloud-machine-learning-to-predict-clicks-at-scale
Training set (23 days) Validation set (1 day)
Validation metric:
Prediction model Score = -LLH Training time Prediction complexity
Logistic regression 0.1293 70 min 40
Logistic regression with crosses 0.1272 142 min 86
Deep learning (11 layers x 1062 neurons)
1 epoch
0.1257 26h 40x1062+10x1062^2+1062
= 11 321 982
Deep learning (11 layers x 1062 neurons)
3 epochs
0.1250 78h 11 321 982
Training
time
19. #SAISML11
Why build our own stack?
No available framework/library back in 2015 that satisfied
all our needs
MLlib limitations
- GLM vector size limitations
- Model serving with spark dependency
Decided to do it ourselves
- Scala and Spark
19
21. #SAISML11
Model training
21
Different use cases in different
teams
- Not the same features
- Not the same model
settings
Model training steps should
be configurable by using a
configuration object
27. #SAISML11
Model training deployment
Model name tag
- Unique identifier of a production model
Example:
prod-team1-vtr-prediction-2018-07-15
27
def getTrainingConfig(tag: String): TrainingConfig
def getLatest(tag: String): Predictor
32. #SAISML11
Model serving
32
Impl as a scala library
- Dependency of online system
codebase
Fast and lightweight
- Prefer native code than a prediction
service
- Avoid http calls
- No Spark dependency
33. #SAISML11
Predictor
Configuration of model scoring
Components
- Feature transform
- Feature vectorization
- Trained model
Same feature generation as in training
- Vectors are the same online and
offline
33
{
"transform" : [...],
"vectorize" : [...],
"model" : {
"type" : "Logistic",
"intercept" : -0.59580942,
"slopes" : {
"type" : "Sparse",
"indices" : [
5385, 6271, ...
],
"values" : [
-0.002536684,
0.026973965,
...
],
"length" : 2145483000
}
}
}
34. #SAISML11
Online feature transform
34
case class PredictionKey private (values: Vector[Any], schema: Type) {
def add[T: Convert](name: String, t: T): PredictionKey = {...}
def value(idx: Int): Any = {...}
def get[T: Convert](name: String): T = {...}
def get[T: Convert](idx: Int): T = {...}
}
38. #SAISML11
Offline experiment
Internal tool to evaluate ML models on historical data
38
Simplify experiment
process
Formalize the
testing protocol
Keep track of the
experiment results
Wide range of
metrics with
confidence interval
39. #SAISML11
Offline experiment
39
Iteration 1 Training Set
Validation
Set
Training duration
Validation
duration
Validation start date
Iteration 2 Training Set
Validation
Set
Iteration 3 Training Set
Validation
Set
40. #SAISML11
Experiment creation
40
Notebook as main interface of
experiments
- jupyter notebook
- jupyter-scala kernel
Weighted metrics computation
Result projections
- Aggregate the result
metrics by categories (day,
business model, country,
etc.)
Jupyter-scala kernel: https://github.com/jupyter-scala/jupyter-scala/