Evaluation strategies for dealing with partially labelled or unlabelled data

Evaluation strategies
with partially or unlabeled data
30/11/2023

About us
Sharone Dayan
Machine Learning Engineer
@ Contentsquare
sharone.dayan@contentsquare.com
Daria Stefic, PhD
Data Scientist
@ Contentsquare
daria.stefic@contentsquare.com

Agenda
1. Global Picture
Evaluation methods
2. Case studies
a. Purchase Intent Prediction
b. Unsupervised Segment Discovery
c. Anomaly Detection for Alerting
3. Key takeaways
Academia VS Industry

Do we have
labels?
Semi-supervised/
Unsupervised
Can I use public
datasets?
Can I generate
artificial data?
Can I design a
proxy?
Supervised
YES NO or FEW
Can I get a
manually labelled
dataset?
Unsupervised
Segment Discovery
Purchase Intent
Prediction
Anomaly detection for
Alerting
Evaluation methods

Case study:
Purchase Intent Prediction
“Can I design a proxy?”

8
Main goal
“Who could have converted?”
non-buyers buyers
missed purchase segment
purchase intent segment
Detect non converting users who
had converting intentions based on
their behaviour (e.g. interaction with
product details, add to cart)
Design choices:
● Focus on anonymous users
● Focus on retail clients
● No difference between single-item
and multi-item purchase
● Offline prediction

9
If we had labels for purchase intent…
…we could:
1. directly train a classifier in a supervised
way to recognise intent
2. evaluate our classifier with standard
classification metrics (e.g. f1-score) directly
on these labels
3. compare different solutions in an unbiased
way
…but:
The only labels we have for purchase intent are
from converting sessions

10
Positive Negative
Positive
Negative
We don’t want any
converters predicted as
‘Not intended to purchase‘
Predicted intent
Actual
conversion
Purchase Intent Evaluation
We want some
non-converters predicted
as ‘Intended to purchase’

© Contentsquare 2022
We need proxy labels for purchase intent
11
1. converting sessions have the intention to purchase by definition
2. sessions that put something in the cart and purchase it in a later
session probably had intention to purchase
“Strict” proxy set “Loose” proxy set
Items added-to-cart in current session are
purchased in a later session of the same user
Add-to-cart event in current session culminate
in a later converting session of the same user,
on the same day
Merchandising CS clients only
All CS clients

12
Schematic predictions Non-buyers (test set)
Buyers (training set)
Buyers (test set)
Predict purchase
Predict purchase intent
Add to cart leading to conversion
in next sessions
MISSED PURCHASE

Case study:
User Segment Discovery
“Can I generate artificial data?”

© Contentsquare 2022
Main goal
14
Understand the main user persona
based on the user’s page visit
sequence thanks to unsupervised
learning
Hypothetical outcome:
Here are 3 typical visitor personas:
● Early leaver: Home page - PDP - Exit
● Heavy searcher: Home - Search -
Multiple PDPs - Cart - Exit
● Golden converting journey: Home
page - PLP - PDP - Cart - Delivery -
Payment
“What are the typical paths people take?”

15
Wait, what… Evaluating unsupervised?
How to include business constraints?
How to benchmark different settings (features, distances,
clustering algorithms, etc.)?

16
Idea
HSSSSSSS
PSPSPSPSP
HSSSSSSSS
HPPPPPPPP
SSSSSSSS
HSPSPSPSP
HSPPPPPPP
HSSSSSSS
PSPSPSPSP
HSSSSSSSS
HPPPPPPPP
SSSSSSSS
HSPSPSPSP
HSPPPPPPP
● 3 patterns:
- Mainly visiting
product pages
- Mainly visiting search
results
- Looping between
product pages and
search results
● 3 categories
○ P = Product page
○ S = Search results
○ H = Home page

17
We need to validate the clustering “health”
Toy scenarios where we have clear expectations on what the result of
the clustering should be + run functional tests around them.
4 types of session generators (artificial data):
- sessions focused on one specific page
- sessions having a given probability for each page group
- “cycling“ sessions always coming back to the same sequence
- sessions containing a specific pattern

18
Health check results complex health check -> areas of improvement
basic health check -> mandatory
Health check
difficulty
Features A,
distance a,
clustering i
Features A,
distance b,
clustering i
… Features B,
distance c,
clustering j
…
…
…
…
…

Case study:
Anomaly Detection for Alerting
“Can I get a manually labelled dataset?”

20
Example: number of users with API errors
Anomaly detection in time series
Alert!
Main goal
Alert clients (in real time) if they have issues on their platform

21
Example: number of users with API errors
Seasonal pattern
The data

23
But, how do we know if our model is:
● Raising true alerts?
● Not raising false alerts?
When metric values are beyond the bounds → alert
The model = seasonality + bounds

24
Not scalable!
Evaluation #0: visual inspection

25
Seasonality model:
forecasting error (MAE) = 49.25
Evaluation #1: forecasting error

26
Last value model:
forecasting error (MAE) = 35.24

27
Raised alerts of last value model vs. seasonal model
Last value:
Seasonal:
False alarm!
.

28
Do we want to alert on those?
Evaluation #2: anomalous points classification

29
When something happens, it lasts more than 5 min
Example: Iphone launch
Evaluation #3: anomalous periods

30
Negative periods
● Model is correct if there are not
detections → TN, otherwise FP
Positive periods
● Model is correct if it has >1
detection → TP, otherwise FN
Annotated examples for evaluation

31
Annotated examples for evaluation
Negative periods
● Model is correct if there are not
detections → TN, otherwise FP
Positive periods
● Model is correct if it has >1
detection → TP, otherwise FN

32
Recall Precision
Last value
Seasonal
→ quantifiable results!
We don’t
miss any
anomalies
We don’t
raise false
alarms
Model

34
Academia Industry
Well defined
problem
Evaluation
ML Model
Labels
Business
problem
No Evaluation
ML Model
No/partial
labels

Evaluation strategies for dealing with partially labelled or unlabelled data

Recommended

Recommended

More Related Content

Similar to Evaluation strategies for dealing with partially labelled or unlabelled data

Similar to Evaluation strategies for dealing with partially labelled or unlabelled data (20)

More from Paris Women in Machine Learning and Data Science

More from Paris Women in Machine Learning and Data Science (20)

Recently uploaded

Recently uploaded (20)

Evaluation strategies for dealing with partially labelled or unlabelled data