Sharone Dayan, Machine Learning Engineer and Daria Stefic, Data Scientist, both from Contentsquare, delve into evaluation strategies for dealing with partially labelled or unlabelled data.
2. About us
Sharone Dayan
Machine Learning Engineer
@ Contentsquare
sharone.dayan@contentsquare.com
Daria Stefic, PhD
Data Scientist
@ Contentsquare
daria.stefic@contentsquare.com
3. Agenda
1. Global Picture
Evaluation methods
2. Case studies
a. Purchase Intent Prediction
b. Unsupervised Segment Discovery
c. Anomaly Detection for Alerting
3. Key takeaways
Academia VS Industry
5. Do we have
labels?
Semi-supervised/
Unsupervised
Can I use public
datasets?
Can I generate
artificial data?
Can I design a
proxy?
Supervised
YES NO or FEW
Can I get a
manually labelled
dataset?
Unsupervised
Segment Discovery
Purchase Intent
Prediction
Anomaly detection for
Alerting
Evaluation methods
8. 8
Main goal
“Who could have converted?”
non-buyers buyers
missed purchase segment
purchase intent segment
Detect non converting users who
had converting intentions based on
their behaviour (e.g. interaction with
product details, add to cart)
Design choices:
● Focus on anonymous users
● Focus on retail clients
● No difference between single-item
and multi-item purchase
● Offline prediction
9. 9
If we had labels for purchase intent…
…we could:
1. directly train a classifier in a supervised
way to recognise intent
2. evaluate our classifier with standard
classification metrics (e.g. f1-score) directly
on these labels
3. compare different solutions in an unbiased
way
…but:
The only labels we have for purchase intent are
from converting sessions
10. 10
Positive Negative
Positive
Negative
We don’t want any
converters predicted as
‘Not intended to purchase‘
Predicted intent
Actual
conversion
Purchase Intent Evaluation
We want some
non-converters predicted
as ‘Intended to purchase’
15. 15
Wait, what… Evaluating unsupervised?
How to include business constraints?
How to benchmark different settings (features, distances,
clustering algorithms, etc.)?
17. 17
We need to validate the clustering “health”
Toy scenarios where we have clear expectations on what the result of
the clustering should be + run functional tests around them.
4 types of session generators (artificial data):
- sessions focused on one specific page
- sessions having a given probability for each page group
- “cycling“ sessions always coming back to the same sequence
- sessions containing a specific pattern
18. 18
Health check results complex health check -> areas of improvement
basic health check -> mandatory
Health check
difficulty
Features A,
distance a,
clustering i
Features A,
distance b,
clustering i
… Features B,
distance c,
clustering j
…
…
…
…
…
20. 20
Example: number of users with API errors
Anomaly detection in time series
Alert!
Main goal
Alert clients (in real time) if they have issues on their platform
23. 23
But, how do we know if our model is:
● Raising true alerts?
● Not raising false alerts?
When metric values are beyond the bounds → alert
The model = seasonality + bounds
27. 27
Raised alerts of last value model vs. seasonal model
Last value:
Seasonal:
Evaluation #1: forecasting error
False alarm!
.
28. 28
Do we want to alert on those?
Evaluation #2: anomalous points classification
29. 29
When something happens, it lasts more than 5 min
Example: Iphone launch
Evaluation #3: anomalous periods
30. 30
Negative periods
● Model is correct if there are not
detections → TN, otherwise FP
Positive periods
● Model is correct if it has >1
detection → TP, otherwise FN
Annotated examples for evaluation
Evaluation #3: anomalous periods
31. 31
Annotated examples for evaluation
Evaluation #3: anomalous periods
Negative periods
● Model is correct if there are not
detections → TN, otherwise FP
Positive periods
● Model is correct if it has >1
detection → TP, otherwise FN