dataxu bids on ads in real-time on behalf of its customers at the rate of 3 million requests a second and trains on past bids to optimize for future bids. Our system trains thousands of advertiser-specific models and runs multi-terabyte datasets. In this presentation we will share the lessons learned from our transition towards a fully automated Spark-based machine learning system and how this has drastically reduced the time to get a research idea into production. We'll also share how we: - continually ship models to production - train models in an unattended fashion with auto-tuning capabilities - tune and overbooked cluster resources for maximum performance - ported our previous ML solution into Spark - evaluate the performance of high-rate bidding models
Speakers: Maximo Gurmendez, Javier Buquet
5. dataxu: make marketing smarter
through data science!
Event data:
Bids
Wins
Losses
Attributions
ML System
Bidding models
$3
6. Scale?
Ø 2 Petabytes Processed Daily
Ø 3 Million Bid Decisions Per Second
Ø Runs 24 X 7 on 5 Continents
Ø Thousands of ML Models Trained per Day
7. Goals of dataxu’s ML System
Highly
Predictive
Fast to Bid
(< 1 millisecond)
Optimal use of
training
resources
No
downtime
Always fresh
models
Unattended
operation
Self tuning Transparent Easy to deploy
new algorithms
8. 9 years ago
Custom Hadoop
Jobs
(single pass)
f(x)
f(x)
f(x)
f(x)
Campaign
events
training data
Models used at
bid time for each
campaign
9. 4 years ago: Can we use
Spark?
Thread
safe?
Is it fast
enough?
Does it use
too much
memory?
Spark
models work
well with our
data?
Is it
expensive
to train?
Can we use
its out-the-
box ML
algorithms?
10. Problem #1: Data Partitioning
1 sample pass + 1 write pass
beware
of the fat
reducers!
11. Problem #2: Spark models not ready
for a low latency bidding setting
Feature
1
Feature
2
1 0
1 1
0 1
Feature
1
Feature
2
Prediction
1 0 0.3
1 1 0.7
0 1 0.4
Spark Model
Feature
1
Feature
2
1 0
Feature
1
Feature
2
Prediction
1 0 0.3
Model Needed
At bid time things are different…
Solution: Extended Spark with RowModels
12. Problem #2: Spark models not ready for a low latency bidding setting
Solution: Extended Spark with RowModels
13. Problem #3: Categorical Features Encoding Slow
F 1 F 2
A X
A Y
B Y
StringIndexer
F 1 F 2 IX 1
A X 0
A Y 0
B Y 1
StringIndexer
F 1 F 2 IX 1 IX 2
A X 0 1
A Y 0 0
B Y 1 0
F 1 F 2
A X
A Y
B Y
MultiTopK
F 1 F 2 IX 1 IX 2
A X 0 1
A Y 0 0
B Y 1 0
Instead:
Metwally, Agrawal, and Amr
Abbadi (Efficient computation of
frequent and top-k elements in
data streams)
Spark Typical:
14. Problem #4: Expensive to train
We were running one campaign at a time…
Observations:
• Some campaigns took hours, some a few minutes
• Some parts of training were IO bound, some CPU bound
• We observed cluster idleness between jobs
Solutions:
• Launch in parallel smart batches of jobs
• Carefully overbook the cluster resources, and
not use “maxResourceAllocation”
Result: 60% cheaper than legacy 1-pass Hadoop method!
15. Problem #5: How to switch systems?
Decorated Spark
Bidding Model
Active Bidding Model
A/B tests
Spark model pulsed on that dayStage 1: Decorated Model
16. Problem #5: How to switch systems?
Stage 2: Selected Bidding Machine Stage 3: Full Switch
17. Problem #5: How to switch systems?
Everything went smoothly? Not exactly!
• Reached S3 request limits upon deploy!
• Rolled back
• Implemented retries
• Random waits
• Back-offs & jitter
• Latencies not exposed in simulations
• Rolled back
• Deeper profiling with YourKit
18. What about self-tuning, unattended operations?
event data
models calibrations
Blackboard (S3)
insights
bidding
manifests
trainer
model
selector &
calibrator
insights
builder
manifest
builder
Bidding
machines
22. Outcomes
Benefits Lessons
Greater flexibility to adapt to new use cases
Better overall performance
Better reliability and upgrade path
50% less code
60% savings
Spark can be used for serious production
systems
Some tweaks are needed but still have the
benefits of the 3rd Party ML libraries
There’s no test like a full live test!
Gradual switchover, pulsing and vigilance
protected our business from harm.
23. DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT
Thank You!
mgurmendez@dataxu.com
jbuquet@dataxu.com