2. Who am I
• Lan
• Veteran hacker but new in AD world
• someone who can make a computer do what he wants—whether the computer
wants to or not. (http://paulgraham.com/gba.html)
• ex-{Rakuten, GREE}
• Distribution System, Info Retrieval, ML
3. Today’s Talk
• DMP in SmartNews Ads
• #1. Prediction
• #2. Targeting
• Future Work & Summary
5. DMP in SmartNews Ads
• Private DMP ( 90%+1st-party data )
• Data Collect, Clean, Aggregation
• ID Mapping
• User Profiling
• User Clustering
• CTR / CVR Prediction
• Lookalike
• Custom Audience
6. DMP
Clusters
AD delivery
cluster
AD Log in
S3
Kinesis
AD tracker
Video AD
delivery
cluster
DMP
streaming
Audience
Data
in
DynamoDBRDB
Hadoop
ML
Analytics
Models
&
Targeting
SmartNews
Log
ML
Small company but not small data
•Article Meta > 200K/day
•Article x {read, share, read_related …}
•Channel x {subscribe, preview, view, …}
•Push, Live, Weather, Setting, …
•Survey result
•Audience Data > 14M (~5M MAU)
•AD Meta
•AD History
•AD Conversions
•AD Optout
• Managed/Compressed Data > 130TB
• Lookalike seeds
• ~1TB Data for training CTR prediction model
•> 1M unique features
•User Demographics
•Device
•Locations
•…
10. More than Ranking
• When we do AD auction
• eCPM (effective Cost per Mille) = CTR (Click Through Rate) x CPC (Cost per Click)
• Suppose we have
• CTRad1=0.05 > CTRad2=0.04 > CTRad3=0.03
• CPCad1 = 10JPY, CPCad2 = 13JPY, CPCad3 = 20JPY(winner)
• but if: pCTRad1 = 0.2 (winner) > pCTR’ad2 = 0.1 > pCTR’ad3 = 0.03
• then we lost 0.1JPY potential income
12. CTR Prediction v1
• Train and scoring daily
• One GBDT (Gradient Boosting Decision Tree) model per AD campaign
• using ~1month’s data
• Hundreds of small batches inside Hadoop Yarn
• Quick and Simple
• dev in 1 month
• pick up best features for every campaign
• minutes ~ 1 hour for model training
• explainable Tree models
• no need for AD feature
• Same approach for CVR prediction (CPC / CVR = CPA (Cost Per Acquisition) )
delivery
result
User
Features
generate
samples
Yarn
Users
predictions
sample
model
scoring
sample
model
scoring
sample
model
scoring
…
13. Metrics
• NE (Normalized Cross- Entropy)
• the average log loss when using predicted CTR / the average log loss per impression
• https://facebook.com//download/321355358042503/adkdd_2014_camera_ready_junfeng.pdf
• AUC (Area under the ROC curve, AUROC)
• measure ranking quality
• others: Precision/Recall, ECS(Effective catalog size), CTR / CVR / Sales, etc
14. Review of CTR Prediction v1
• Marked improvement, moderate AUC & NE
• And
• hard to do overall tuning
• hard to prediction online (feature set differs)
• latency for new campaigns
• relatively poor performance to new campaigns (cold start)
• lost the connections between campaigns even for the same advertiser
• …
15. CTR Prediction v2
• A simple model for all
• AD feature added
• Dynamic features extraction
• All calculation distributed
• GBDT + LogisticRegression
• Train once per day, scoring twice
16. About the Features
• >1M unique features, sparse
• GBDT provides great feature engineering
• (sometimes) feature engineering is kind of intuition and trial-and-error
• demographic, device, location, reading interests…
• AD history is helpful
• Feature Hashing, Binarization & Discretization, …
20. Profiling User by Statistics and ML
• Gender Prediction (precision: 0.90+), Age Prediction, …
• News Channel / Source Preference
• AD Slot Preference
• …
23. Lookalike Targeting
• Our solution
• Solve it as an classification problem
• Seed user as Positive Sample
• While all targeting candidates as Negative Sample
(w/ random sampling )
• based on Spark MLlib Logistic Regression
• 30%~50% CVR↑ comparing to normal targeting
25. Custom Audience
SmartNews
AD
tracker
Send any custom event
(S2S req, web beacon, etc)
Event
Audience
BloomFilter
Obj
Updating
per
Several Minutes
Your
Service / App / Site
SmartNews
AD
Delivery
Cluster
AD targeting
/
Delete Targeting
Lookalike
Lookalike Targeting
29. Summary of My 1st SmartNews Year
• Challenge place. We’re startup so we can move quick and break things
• Learn from the industry leaders. Keep trial-and-error.
• Number don’t lie. Don’t trust your intuition over number.
• But if you really doubt the number, look closely. there may be BUG
hidden.