Learning from
Positive and Unlabeled Data
jessa.bekker@cs.kuleuven.be
people.cs.kuleuven.be/~jessa.bekker
Jessa Bekker
These minions
have diabetes
Machine Learning Reality
2
Please check the
others, Dr. Nefario
These
minions have
diabetes
Please check
the others, Dr.
Nefario
Learning with Positive and Unlabeled Data
Or….
We can use the data as is,
keeping in mind that the undiagnosed
minions might still have diabetes.
3
4
Binary Classification
Healthy minion
Minion with
diabetes
4
Binary Classification
Healthy minion
Minion with
diabetes
Classifier: positive
Classifier: negative
4
Binary Classification
Healthy minion
Minion with
diabetes
Classifier: positive
Classifier: negative
Negative labelPositive label
Supervised
5
What we want
Supervised
5
Positive and
Unlabeled (PU)What we want
What we get in practice
Positive and Unlabeled Data is Everywhere
Medical records Incomplete gene/protein databasesBookmarks/likes
6
Positive and Unlabeled Data is Everywhere
Medical records Incomplete gene/protein databasesBookmarks/likes
7
Positive and Unlabeled Data is Everywhere
Medical records Incomplete gene/protein databasesBookmarks/likes
7
Tom
Age: 25
Sex: male
Known issues:
• Low vision
• Hot tibia
Jessa
Age: 27
Sex: female
Known issues:
• Lumbago
• Mono
Vincent
Age: 26
Sex: male
Known issues:
/
Positive and Unlabeled Data is Everywhere
Medical records Incomplete gene/protein databasesBookmarks/likes
8
Positive and Unlabeled Data is Everywhere
Medical records Incomplete gene/protein databasesBookmarks/likes
8
Positive and Unlabeled Data is Everywhere
Medical records Incomplete gene/protein databasesBookmarks/likes
9
Positive and Unlabeled Data is Everywhere
Medical records Incomplete gene/protein databasesBookmarks/likes
9
Positive and Unlabeled Data is Everywhere
Medical records Incomplete gene/protein databasesBookmarks/likes
9
All have undesirable side effects 
Positive and Unlabeled Data is Everywhere
Medical records Incomplete gene/protein databasesBookmarks/likes
9
All have undesirable side effects 
Complete database
Positive and Unlabeled Data is Everywhere
Medical records Incomplete gene/protein databasesBookmarks/likes
9
All have undesirable side effects 
Complete database
Positive and Unlabeled Data is Everywhere
Medical records Incomplete gene/protein databasesBookmarks/likes
10
What do I know?
• PhD Student @ Machine Learning Research Group, KU Leuven
• Fundamental research on learning with Positive and Unlabeled Data
• Estimating the Class Prior in Positive and Unlabeled Data through Decision
Tree Induction. AAAI, 2018.
• Positive and Unlabeled Relational Classification through Label Frequency
Estimation. ILP, 2017. (Most promising paper award)
• Ongoing work…
jessa.bekker@cs.kuleuven.be
people.cs.kuleuven.be/~jessa.bekker
Outline
1. The Easy Case
2. The Hard Case
3. The Extremely Hard Case
12
The Easy Case:
(Linear) Separability
13
14
14
15
16
17
(Linear) Separability
Method 1: Biased Learning
Unlabeled = Negative
Strongly penalize wrongly classified positive examples
18
19
20
-1
-1
-1
-1 -1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1 -1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1 -1
-1
-1
-1
-1
-1
-1-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-45
-1
21
-1
-1
-1
-1 -1
-1
-1
-1
-1
-1
-1
-1
-1
-1 -1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1 -1
-1
-1
-1
-1
-1
-1000
-1031
22
-1
-1
-1
-1 -1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1 -1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1 -1
-1
-1
-1
-1
-1
-1-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-51
-1
-1
-1
-1-1
-1
-1
(Linear) Separability
Method 2: Two-Step Strategy Techniques
1. Find reliable negatives
2. Traditional semi-supervised learning
23
24
y
x0 1 2 3 4 5 6
5
4
3
2
1
0
Positive features
𝑥 ≤ 2
𝑦 ≥ 2
34
24
y
x0 1 2 3 4 5 6
5
4
3
2
1
0
Positive features
𝑥 ≤ 2
𝑦 ≥ 2
35
2536
2537
Semi-supervised
learning magic
The hard case:
Not separable but labels are
selected completely at random
26
27
27
100%
50%
0%
28
28
29
Non-Separability Demands Probabilities
30
50%
100%
0%
Non-Separability Demands Probabilities
30
50%
100%
0%
Non-Separability Demands Probabilities
30
50%
100%
0%
Non-Separability is Difficult
31
Selected Completely At Random Assumption
Observed positive examples are
selected completely at random
from the positive set.
32
Selected
Completely
At Random
33
Not Selected
Completely
At Random
Pr 𝑙𝑎𝑏𝑒𝑙𝑒𝑑 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒, 𝑙𝑒𝑓𝑡 > Pr(𝑙𝑎𝑏𝑒𝑙𝑒𝑑|𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒, 𝑟𝑖𝑔ℎ𝑡)Pr 𝑙𝑎𝑏𝑒𝑙𝑒𝑑 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒, 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 = 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡
0%
50%
100%
0%
Relative Probabilities with
Selected Completely at Random
34
0%
50%
100%
0%
Relative Probabilities with
Selected Completely at Random
34
0%
50%
100%
0%
25%
50%
Relative Probabilities with
Selected Completely at Random
34
Learn Naive Classifier, then Scale
Naive classifier predicts probability of being labeled
Scale
Option 1: So that proportion of positives is correct
=> Need to know proportion of positives!
Option 2: So that the maximum probability is 1
35
The extremely hard case:
Not separable but labels are
selected conditionally at random
36
Selected Conditionally At Random Assumption
Observed positive examples are
selected conditionally at random
from the positive set,
conditioned on the attributes.
The probability of a positive example to be selected
is a function of (some of) the attributes in the data,
called propensity score.
37
Selected
Completely
At Random
38
Selected
Conditionally
At Random
Pr 𝑙𝑎𝑏𝑒𝑙𝑒𝑑 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒, 𝑙𝑒𝑓𝑡 > Pr(𝑙𝑎𝑏𝑒𝑙𝑒𝑑|𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒, 𝑟𝑖𝑔ℎ𝑡)Pr 𝑙𝑎𝑏𝑒𝑙𝑒𝑑 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒, 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 = 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡
Selected
Completely
At Random
38
Selected
Conditionally
At Random
Pr 𝑙𝑎𝑏𝑒𝑙𝑒𝑑 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒, 𝑙𝑒𝑓𝑡 > Pr(𝑙𝑎𝑏𝑒𝑙𝑒𝑑|𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒, 𝑟𝑖𝑔ℎ𝑡)Pr 𝑙𝑎𝑏𝑒𝑙𝑒𝑑 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒, 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 = 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡
Selected
Completely
At Random
38
Selected
Conditionally
At Random
Pr 𝑙𝑎𝑏𝑒𝑙𝑒𝑑 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒, 𝑙𝑒𝑓𝑡 > Pr(𝑙𝑎𝑏𝑒𝑙𝑒𝑑|𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒, 𝑟𝑖𝑔ℎ𝑡)Pr 𝑙𝑎𝑏𝑒𝑙𝑒𝑑 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒, 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 = 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡
For each
amount of :
Random
Learn Naive Classifier, then scale
Naive classifier predicts probability of being labeled
Scale
Use propensity score function
39
Learn Naive Classifier, then scale
Naive classifier predicts probability of being labeled
Scale
Use propensity score function
=> Need to know propensity score function!
39
Learn Classifier and Propensity Score
Simultaneously
Use available knowledge
• Attributes in propensity score function
• Proportion of positives
• Domain knowledge that classifier must adhere to
40
Learn Classifier and Propensity Score
Simultaneously
Use available knowledge
• Attributes in propensity score function
• Proportion of positives
• Domain knowledge that classifier must adhere to
40
Conclusions
• PU learning is very useful in practice
• We need assumptions to learn from PU data
• Linearly separable
• Selected completely at random
 Scale probabilities
• Use proportion of positives
• Maximum scale
• Selected conditionally at random
Use propensity score
• Ongoing work
41

Learning from positive and unlabeled data