A General introduction to Ad ranking algorithms

Beyond Ad Selection to Automation
Jeong, Buhwan
https://brunch.co.kr/@jejugrapher

Nothing is certain but death and taxes.
“AD”
- Benjamin Franklin

Audience
SSP DSP
DMP
Publisher
Audience Tracking
(MAT/SDK/Pixel) Transaction log (train)
Audience Info. (target)
Log
Visit Ad
Inventory
Ad Selection
- Filtering
- Ranking
- Pricing
Mediation
(Auction)
Log
Traf
fi
c Req4Bid
Advertiser
Impression Bid (AD)
Data

SSP DSP
RANKr
DSP
SSP
Inventory
DSPs
ADs
Req4Bid
Abusing/HideAds
User, Inventory, RP Live, Budget
Inventory (Size, format)
Targeting: U, T, P
User, Ads
UserInfo
Top Ads by eCPM
eCPM (BA & pCTR)
pCVR, C/G & Cuto
f
Frequency/Recency
Duplicate
Top 1 Ad
Auction
DSPs
DSPs
DSPs

SSP AdServer
DSP AdServer
Targeting
Candidate Gen.
Ranking
Quality Control
DSP AdServer
SSP AdServer
Ranker
Reserve price, feedback (HideAds), abusing
On-live, budget, inventory (format, size), time
Adv-set user segment → Automatic (LAL)
Historical User-Ad interaction & similarity
eCPM = BA * pCTR [ * pCVR ]
Cut-o
ff
: eCPM, pCTR, pCVR, BA
Frequency capping, implicit feedback
Auction (RP, Hard/Soft bid
fl
oor)
SSP
DSP
DSP
DSP
From millions to one

E
ff
ective Cost Per Mille (eCPM)

M, 30, Riding, Travel
A Riding academy 1,000 / mille CPM
B Sports wear mall 100 / click CPC
C Bicycle shop 10,000 / acqs. CPA

BA ChargeRate
(CTR/CVR)
eCPM
(BA * CHR * 1,000)
A 1 100% 1,000
B 100 1.2% 1,200
C 10,000 0.011% 1,100

Impression
(1,000)
Click Conversion CHR
CPM 1,000 100%
CPC 1,200 100 1.2%
CPA 1,100 10,000 0.011%
eCPM: an estimated revenue per 1,000 impressions

eCPM: Single Comparison Metric
(Estimated Tra
ffi
c Value)
Ranking
(Order by eCPM desc)
Charging
(Second price / GSP)
Bidding
(SSP margin)
&

eCPM = BA * pCHR * 1,000
(pCTR)

Why Accurate pCTR?
- Correct ChargeAmount
- Wrong Ranking (pCTR < CTR)
- Reverse Margin (pCTR > CTR)

Leave (y = 0) Click (y = 1)
X
Traf
fi
c properties (ADxUSRxPLx…)

Pr(y = 1 | x)
Aggregation of historical data
Learning from historical data
Reactive method vs Predictive method

Segment Decision Tree
Logistic
Regression
FM/FFM DNN
Counting (hCTR) Prediction (pCTR)
Few Raw Embedding (DimRed)
Interaction & Latent
Deep & Wide

Logistic Regression
Pace, interpretability, ..

Linear Regression
(Minimizing MSE loss)
Logistic Regression
(Minimizing NLL loss)
0
1

More likely to click
Logistic Regression
(Maximum entropy)
Sum of traf
fi
c properties
Less likely to click
Pr(y = 1|x) =
1
1 + exp(−wTx)
Softmax of binary (1/0) output

Pr(y = 1|x) =
1
1 + exp(−wTx)
Loss = |y − ̂
y|
y
<latexit sha1_base64="paQhm8QH9RuJYjMoRm3VlxatzsM=">AAAB6HicdVDLSsNAFJ3UV62vqks3g0VwFSY1tHVXdOOyBfuANpTJdNKOnUzCzEQIoV/gxoUibv0kd/6Nk7aCih64cDjnXu69x485UxqhD6uwtr6xuVXcLu3s7u0flA+PuipKJKEdEvFI9n2sKGeCdjTTnPZjSXHoc9rzZ9e537unUrFI3Oo0pl6IJ4IFjGBtpHY6KleQfdmoVd0aRDZCdafq5KRady9c6BglRwWs0BqV34fjiCQhFZpwrNTAQbH2Miw1I5zOS8NE0RiTGZ7QgaECh1R52eLQOTwzyhgGkTQlNFyo3ycyHCqVhr7pDLGeqt9eLv7lDRIdNLyMiTjRVJDloiDhUEcw/xqOmaRE89QQTCQzt0IyxRITbbIpmRC+PoX/k27VdpDttN1K82oVRxGcgFNwDhxQB01wA1qgAwig4AE8gWfrznq0XqzXZWvBWs0cgx+w3j4BR9uNPw==</latexit>
ŷ
<latexit sha1_base64="QmQDjeeN4gpKWLfKwkS/Fz5qGt4=">AAAB7nicdVDLSgNBEOyNrxhfUY9eBoPgKcwEMckt6MVjBPOAZAmzk9lkyOyDmVlhWfIRXjwo4tXv8ebfOJtEUNGChqKqm+4uL5ZCG4w/nMLa+sbmVnG7tLO7t39QPjzq6ihRjHdYJCPV96jmUoS8Y4SRvB8rTgNP8p43u8793j1XWkThnUlj7gZ0EgpfMGqs1BtOqcnS+ahcwVWMMSEE5YTUL7ElzWajRhqI5JZFBVZoj8rvw3HEkoCHhkmq9YDg2LgZVUYwyeelYaJ5TNmMTvjA0pAGXLvZ4tw5OrPKGPmRshUatFC/T2Q00DoNPNsZUDPVv71c/MsbJMZvuJkI48TwkC0X+YlEJkL572gsFGdGppZQpoS9FbEpVZQZm1DJhvD1KfqfdGtVgqvk9qLSulrFUYQTOIVzIFCHFtxAGzrAYAYP8ATPTuw8Oi/O67K14KxmjuEHnLdP/reQAA==</latexit>

Find w that minimizes the negative log likelihood (w/ L2 regularization)
Control model complexity
NLL for logistic regression
arg min
w
n
∑
i=1
log(1 + exp(−yiwT
xi)) +
λ
2
∥w∥2
2

Stochastic Gradient Descent (SGD)
⌘t
<latexit sha1_base64="SU/TSRqhSNKT3zfwyFM+mpJHyjY=">AAAB73icbVBNS8NAEJ3Ur1q/qh69BIvgqSQi6LHoxWMF+wFtKJvttF262cTdiVBC/4QXD4p49e9489+4bXPQ1gcDj/dmmJkXJlIY8rxvp7C2vrG5Vdwu7ezu7R+UD4+aJk41xwaPZazbITMohcIGCZLYTjSyKJTYCse3M7/1hNqIWD3QJMEgYkMlBoIzslK7i8R6GU175YpX9eZwV4mfkwrkqPfKX91+zNMIFXHJjOn4XkJBxjQJLnFa6qYGE8bHbIgdSxWL0ATZ/N6pe2aVvjuItS1F7lz9PZGxyJhJFNrOiNHILHsz8T+vk9LgOsiESlJCxReLBql0KXZnz7t9oZGTnFjCuBb2VpePmGacbEQlG4K//PIqaV5Ufa/q319Wajd5HEU4gVM4Bx+uoAZ3UIcGcJDwDK/w5jw6L86787FoLTj5zDH8gfP5A11OkCs=</latexit>
gt
<latexit sha1_base64="P/GvVIeqVKiemQWRJKaaMovVVQM=">AAAB9XicbVDLSsNAFL3xWeur6tLNYBFclUQEXRbduKxgH9DWMplO2qGTSZi5UUrIf7hxoYhb/8Wdf+OkzUJbDwwczrmXe+b4sRQGXffbWVldW9/YLG2Vt3d29/YrB4ctEyWa8SaLZKQ7PjVcCsWbKFDyTqw5DX3J2/7kJvfbj1wbEal7nMa8H9KREoFgFK300Aspjv0gHWWDFLNBperW3BnIMvEKUoUCjUHlqzeMWBJyhUxSY7qeG2M/pRoFkzwr9xLDY8omdMS7lioactNPZ6kzcmqVIQkibZ9CMlN/b6Q0NGYa+nYyT2kWvVz8z+smGFz1U6HiBLli80NBIglGJK+ADIXmDOXUEsq0sFkJG1NNGdqiyrYEb/HLy6R1XvPcmnd3Ua1fF3WU4BhO4Aw8uIQ63EIDmsBAwzO8wpvz5Lw4787HfHTFKXaO4A+czx9Ca5L+</latexit>
wt
<latexit sha1_base64="wQsvs8XlfPgJ6APhixgXICv3Sn0=">AAAB9XicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFl047KCfUAby2Q6aYdOJmHmxlJC/sONC0Xc+i/u/BsnbRbaemDgcM693DPHjwXX6DjfVmltfWNzq7xd2dnd2z+oHh61dZQoylo0EpHq+kQzwSVrIUfBurFiJPQF6/iT29zvPDGleSQfcBYzLyQjyQNOCRrpsR8SHPtBOs0GKWaDas2pO3PYq8QtSA0KNAfVr/4woknIJFJBtO65ToxeShRyKlhW6SeaxYROyIj1DJUkZNpL56kz+8woQzuIlHkS7bn6eyMlodaz0DeTeUq97OXif14vweDaS7mME2SSLg4FibAxsvMK7CFXjKKYGUKo4iarTcdEEYqmqIopwV3+8ippX9Rdp+7eX9YaN0UdZTiBUzgHF66gAXfQhBZQUPAMr/BmTa0X6936WIyWrGLnGP7A+vwBWvuTDg==</latexit>
wt+1
<latexit sha1_base64="r/oTFMkBdTR9L3i5TOvM6rbcAK4=">AAAB+XicbVDLSsNAFL2pr1pfUZduBosgCCURQZdFNy4r2Ae0IUymk3boZBJmJpUS8iduXCji1j9x5984abPQ1gMDh3Pu5Z45QcKZ0o7zbVXW1jc2t6rbtZ3dvf0D+/Coo+JUEtomMY9lL8CKciZoWzPNaS+RFEcBp91gclf43SmVisXiUc8S6kV4JFjICNZG8m17EGE9DsLsKfczfeHmvl13Gs4caJW4JalDiZZvfw2GMUkjKjThWKm+6yTay7DUjHCa1wapogkmEzyifUMFjqjysnnyHJ0ZZYjCWJonNJqrvzcyHCk1iwIzWeRUy14h/uf1Ux3eeBkTSaqpIItDYcqRjlFRAxoySYnmM0MwkcxkRWSMJSbalFUzJbjLX14lncuG6zTch6t687asowoncArn4MI1NOEeWtAGAlN4hld4szLrxXq3PhajFavcOYY/sD5/ALPpk68=</latexit>
Loss/Cost function (w)
(Global) minimum
(Local) minimum
ηt =
α
β + ∑
t
s=1
g2
s
wt+1 = wt − ηtgt

FTRL-Proximal (Online)
Follow-the-leaders
Proximal (stability)
Regularization (sparsity)
Reference: Ad click prediction: a view from the trenches
wt+1 = arg min
w
(g1:t ⋅ w +
1
2
t
∑
s=1
σs∥w − ws∥2
2 + λ1∥w∥1)

AD1 AD2 AD3 AD4 AD5 …
AD6 …
WUxA
X
1
0
0
0
0
1
0
0
1
0
0
1
0
1
0
0
M
F
10
20
30
40
50+
SC1
SC2
SC3
…
PF1
PF2
PF3
…
…
WT
w340
w3SC2
w3PF1
w3PF3
w3M
w3F
𝛔
Pr(y = 1|X) =
1
1 + exp(−w⊤x)
Σ*
w3M
0
0
0
wTx = w3M+w340+w3SC2+w3PF1+w3PF3
USR

Any data (log) but not private
- Estimation
- Encapsulation/Abstraction
- k-Anonymity
con
fi
dential

Curse of Dimensionality
Millions of features and cardinality
Incapable (memory)
Speed
Sparsity
Over millions to billions of sparse encoding

User
Creative, subscription, KWD, …
PCA / AE
Clustering
Hashing trick
Random Projection
SVD / [N/B]MF
LDA (topic modeling)
W2V / Glove
Contrastive Learning
Dim. Reduction
Embedding vector

Registration #
Activity/Service Log
Gender, Age Far far ago
Naive Bayes (GA)
Ad Feedback (Click)
Mapping & Counting (Interest)
Clustering (k-means)
Topic Modeling (LDA)
FM & DNN
Subscription (Channel)

Feature Embeddingwith Dimensionality Reduction
• Reliability / Speed / Scalability
• Robustness (+) vs Information loss (-)
• Abstraction (anonymity) vs Less interpretability (-)
Lessons learned
• 30 ~ 50 topics enough
• Multiple sources in one embedding? Not work properly
• How to retain previous dimension structure (topic semantics)
- Syntactic hashing (short term) and re-training (long term)

RIG (Relative Information Gain)
0.058
0.059
0.060
0.061
0.062
0.17
0.18
0.20
0.21
0.22
baseline 10 20 30 40 50
LogLoss
# Topics

Prediction Layer
Embedding Layer 2
Soft max = Logistic Regression
Deep Aggregate Embedding
(Dimensionality reduction / projection)
Embedding for each features
(Raw data to numerical vectors)
Embedding Layer 1

𝛔
Prediction
Pr(Y = 1| X)
Deep & Cross Embedding
Primitive Embedding
Demography
AD response
Subscription
AD
Pooling & Concat.

https://paperswithcode.com/sota/click-through-rate-prediction-on-criteo
DCNv3 GDCN FinalMLP

Two-stream model (W&D, S&C)
- Feature interaction (LR → FM → DNN)
- Fusion (Ensemble, MoE)

Accuracy
Interpretability
Speed/latency
Economic feasibility
Security / Privacy
…
VS

Research / Academia Production / Industry
Maximize Accuracy Maximize f(I, S, E, …)
subject to
Accuracy > X
Reliability & Robustness

- Scale up & out
- Slim model
- Simple architecture
- Few #hidden layers & nodes
- Limited features —> incremental model
- Starport (C++) (vs deployment time)
- Candidate generation
- Hybrid (O
ff
-Heavy + On-Light)
Training Time
-> Model update delay
-> Lack of recency
Inference Time
-> Time-out (No Ad)

Daily tra
ffi
c: 1,000,000
Avg(eCPM): 2,000
Conversion/Tra
ffi
c: 0.01%
Daily budget: 1,000,000
Avg(pCTR): 1%
BAcpc: 100? 200? 500?
Ryan LLC
RUN with RYAN

A B
BidAmount (BA) 100 500
pCTR 1% (0.01) 1% (0.01)
eCPM
(1,000 * BA * pCTR)
1,000
= 1,000 * 100 * 0.01
5,000
= 1,000 * 500 * 0.01
Expected WinRate 10% 90%
Expected impression
(Tra
ffi
c * winRate)
100,000 900,000
Spending
(Budget: 1,000,000)
100,000
= 100,000 * 0.01 * 100
4,500,000
= 900,000 * 0.01 * 500
[Avg. eCPM = 2,000]

Budget
Time
1,000,000
900,000
Impressions: 100,000
Conversions: 10
Conversions: 22
A
B
What is the optimal BA?
BA = 200?
Conversions: 50+
00:00 24:00

Landscape Forecasting
Budget Smoothing
Tra
ffi
c Selection
Pacing & Control
Historical data, ARIMA, Prophet

(LinBid) BA = BAbase * Util(Response)
- pCTR(UxA) / pCTR(A)
- pCVR(UxA) / pCVR(A)

PID Control: Proportional (present) + Integral (past) + Derivative (future)

0
600
1,200
1,800
2,400
3,000
0
100
200
300
400
500
Cumulative
Clicks
Bid Amount
Fixed (300) vs AutoBid

LookALike Targeting
(Conversion-driven)

Gift for YOU
Buy one get one free
Shop Now
It’s Travel Time
Refresh yourself. Booking
Congratulations!
Happy birthday~~ Purchase
Male or young
Outdoor activity
Rider
Potential customers

Inventory buying Audience buying
Static Info.
• Gender, age, region
• Interest
Context
• Placement (inventory)
• Current time & location
• Device / OS
• Wi
fi
/ Cellular
Custom
• Upload customers
• Inclusive / Exclusive
Dynamic (behavior) Info.
• Site visit
• Product (Page) view
• Keyword query
• Category
• Cohort
LookALike
E
ff
ective & Coverage

AdvSet Auto (LAL)
Seemingly
Customers
Potential
Customers

Population
LookALike
(Likely to purchase)
Sorted by
total information value
Seed Audience
(Conversion Users)
Non-conversion Users
Feature #1 (IV)
Feature #2 (IV)
Feature #3 (IV)
Feature #x (IV)
Common (p)
but
Distinguishable (q)
IV = (p − q)log
p(1 − q)
(1 − p)q
Impression ➙ Click ➙ Conversion

Y = 1
Y = 0
Seed Audience
(Conversion Users)
Non-conversion Users
Pr(Y = 1 | X) = LR(X) = DNN(X)
Population
LookALike
(Likely to purchase)
Order by Pr(Y=1|X) desc limit #LAL

Only 1,000+ creatives held 95% impressions.

10 50 100 200 500 500+
# creatives (1w, Mobile only)
vs 1M creatives

User
Ad Creatives
= x
ui
T1 T2 T3 T4
T1
T2
T3
T4
A4 A8
click{user, creative}
Matrix Factorization

0%
25%
50%
75%
100%
10 50 100 500 1,000 2,000 3,000 5,000 10,000 50,000
2 8 32
Top-N
91~94%
+ New creatives
+ High performing creatives

Order by < hCTR * log(#Imp) > desc limit 1,000

UserEmbNet AdEmbNet
Same Dimension
Rank/Similarity
U/A Embedding Net

ANN (Approximate Nearest Neighbor)
- LSH, KD-tree
- ANNoy (ANN Oh Yeah)
- HNSW (Hierarchical Navigable Small World)
- Product Quantization (Meta’s FAISS)
- ScaNN (Scalable NN by Google)
- …
Find N nearest Ads approximately
Ads
User

User
Ad Creatives
= x
User Embedding Vector
AD Embedding Vector
ANN

Bloom Filter, Quotient Filter, etc

Impression
Click
Conversion
Branding, inventory, CPT/CPM
Tra
ffi
c, audience, CPC
Purchase, right audience, CPA/CPS/AutoBid

Time
Impression
Click ( ~ 10%)
s m d w
Conversion ( ~ 1%)
h

Survival Model
Delay time
D = D0e−λt
Pr(Y = 1 | X) ⟼ Pr(Y = 1 | X, D) * Pr(D | X)
Reference: Modeling delayed feedback in display advertising

eCPMCPA = BACPA * pCTR * pCVR (* 1,000)
#Click / #Impression
#Conversion / #Click

ABCDEFGHIJKLMNOPQRSTUVWXYZ
Order by { Relevance, Popularity, Quality }
Source
Times

Quality(CVR) = f( pCVR(UxA) / pCVR(A) )

eCPM = BA * pCTR * Q(CVR) (* 1,000)

It’s Travel Time
Refresh yourself. Booking
90% tra
ffi
c pCTR
pCTR’ = pCTR +
𝜶
Random bucket MAB
(Multi-armed bandit)
Thompson sampling
Posterior
Observed
10% tra
ffi
c
make unstable to make stable

Cold-start and Exploration
— Random bucket
— Thompson sampling
— Stochastic feature augmentation (drop-out)
— Transfer learning (with hierarchy)
— Model initialization
— Semantic embedding (learning to hash)
— Jitter (tie-breaking)
Explore to get more training data
Proximity

Negative Feedback
• Hide (Do Not Show Ads)
• AdBlock
• DNT (Do Not Track) / LMT (Limit Ad Tracking)
• ITP / ATT
• NDNC (No Response)
• Abusing / Fraud

Inventory-buying (CPT/CPM)
Audience-buying (CPC/CPA)
Hybrid-buying (CPM + CPC)

Auction with Reserve Price
No Bid
Win
Win
2nd price
2nd price
Win
2nd price
Win
Win & 1st price
Auction with Hard Bid Floor Auction with Soft Bid Floor
No Ad

ReCalibration (Platt Scaling, Isotonic Regression)
Image from https://machinelearningmastery.com/calibrated-classi cation-model-in-scikit-learn/

Image from http://www0.cs.ucl.ac.uk/sta
ff
/w.zhang/rtb-papers/linkedin-pacing.pdf

0%
20%
40%
60%
80%
100%
1 2 3 4 5
Viewable Count
vCTR

pCTR = f(USRxAD, Context, VCnt, …)

Quality Control: Cut-o
ff
low performing ADs
pCTR eCPM
No ad > wrong ad

Dynamic Creative Optimization (DCO)
in Perception AI Era
Sorry for nothing to talk about…

Creative Generation (& Personalization)
in Generative AI Era
Sorry for nothing to talk about…

Data Overload & Imbalance
Millions of clicks over billions of impressions
Negative downsampling (
𝞈
) q =
p
p +
1 − p
ω
Clicked
Not clicked

Research O
ffl
ine Test Online Test Production
• Model validity
• Log-loss, RIG
• Simulation
• Validity & revenue
• CTR, calibration
• 0 Bucket
Problem & ideation Complexity & Stability

Random
A’
B
C
D
A
• 5 ~ 10%
• Exploration (i.e., cold-start), serving-unbiased, reference (worst case)
• Main bucket (control group)
• Current serving version
• Identical model to main bucket
• To check the e
ff
ect of serving bias
• Do not reject null hypothesis (A = A’)
• Test bucket (treatment group)
• 10% (up-to 50%, except random bucket)
• Hours to weeks
• Buckets are randomly assigned to users or tra
ffi
c.
• User-based buckets are periodically re-assigned.
• B’?

Revenue, Revenue, Revenue
- CPM / RPM
- CTR / CVR / ROAS
Model Robustness
- RIG (Relative Information Gain)
- Calibration = predicted / observed
- AUC, Classi
fi
cation accuracy

Better Model More Clicks More Revenue Incentive?
A Data Scientist’s Happiness Circuit

Revenue (B / Y: 99.01%)
Observed CTR (B / Y: 112.83%)
Predicted CTR (B / Y: 113.28%)
Calibration (103.4 vs 102.6)

Serving Latency
• Dimensionality reduction (& feature selection)
• Negative down-sampling
• Candidate generation
• Simple & slim model ⟹ proper model
- Simple structure & less layers/nodes
• Binary representation (vs sparsity & high dimension)
• GoLang / C++
• Scale up & out
• …

Account
Campaign
Group (Set)
Creative
Objective
(Budget)
Targeting (PTA)
BidType & BidAmount
IMG/MOV/TXT/DCO/Gen

Rank by Group/Adv
Rank by Creative
BA * pCTR | Targeting(1/0)
Group Creative
BA * pCTR(G)
MAB or Generate
CTR, RPM (5~10%p lift)
Calibration -> bucket size

Contrastive Learning
for better embedding
& more applicable

Triplet Loss
Minimize Max(Sim(A, P) - Sim(A, N) + ⍺, 0)
UserEmbNet AdEmbNet
P
N
A
Positive Negative
UAnc APos ANeg

Loss = Loss +
𝜆
*Diff(Enew - Eold)

Deep Embedding
Simple
Prediction
Wide
Deep

None (2N) Inner (2N + 1) Outer (2N + N2)
PQ-Inner (2N + M2) PQ-Outer (2N + M) Element-wise (3N)

LLM / Generative AI
• As a ranker?
• Feature Augmentation (User & Ad)
• Cold-start
• Explainability
• Creative (Message) Generation
• Simulation / Judge
• …
• Agent?
Vibe creation

Ad Automation
• User Response Prediction
• Auto-Targeting (Performance)
• AutoBid
• Creative Generation (DCO/Gen)
• Set Objectives
• Budget Setting
• (Agent?)
• Go or Stop
• Nothing to do

Revenue, Tra
ffi
c, & Automation

Question!
Q’s will set you free

A General introduction to Ad ranking algorithms

More Related Content

Similar to A General introduction to Ad ranking algorithms

More from Buhwan Jeong

Recently uploaded

A General introduction to Ad ranking algorithms