3. 3
Machine learning is everywhere
2
There are ML algorithms in:
– Crawler
– Indexer
– Ranker
– Data mining systems
– Frontend
Most of them are supervised:
– Require training set
– Judgement is expensive
– Ranker training set: 1M documents, 50K queries
4. 4
Usual problems of a training set
3
Points of class A
Points of class B
Unlabelled points
5. 5
Usual problems of a training set
3
Points of class A
Points of class B
Unlabelled points
Problems:
– Unlabelled points between classes
6. 6
Usual problems of a training set
3
Points of class A
Points of class B
Unlabelled points
Problems:
– Unlabelled points between classes
– Imbalance of labelled points
7. 7
Usual problems of a training set
3
Points of class A
Points of class B
Unlabelled points
Problems:
– Unlabelled points between classes
– Imbalance of labelled points
– There is unsampled cluster
8. 8
Usual problems of a training set
3
Points of class A
Points of class B
Unlabelled points
Idea of Active Learning:
We fix these problems by
smart construction of training set
We save assessor's resources.
9. 9
Uncertainty sampling
4
Take instances about which it is least certain how to label.
Problem:
- Requires posterior distribution P(Y|x)
9
Points of class A
Points of class B
Unlabelled points
1
2
Select point 1
11. 11
QBag algorithm
Input: T – initial labelled training set
С – size of the committee
A – learning algorithm
U – set of unlabeled objects
Output: T' – extended training set
1. Uniformly resample T, obtain T1
...TС
, where |Ti
| < |T|
2. For each Ti
build model Mi
using A
3. Select x* = min x∈U
| |Mi
(x)=1| - |Mi
(x)=0| |
4. Pass x* to assessor and update T
5. Repeat from 1until convergence
6 K.Dwyer, R.Holte, Decision Tree Instability and Active Learning, 2007
13. Density sampling
8
Idea: Balance dense/sparse regions of the input space
Dense Sparse
Not sampled
1313
Points of class A
Points of class B
Unlabeled points
14. 14
Clustering of our training set for ranking
9
Navigation
High relevant
Medium relevant
Low relevant
Irrelevant
404
Self-organizing map
- cell is cluster
- color is relevance
17. 17
SOM-balancing algorithm
12
1. Build clustering C for training set
2. Compute average density densityavg
3. For each cluster c ∈C
4. If density(c) > densityavg
5. Limit number of sample in c by N
18. 18
SOM-balancing results
13
Results:
- Training set size: 350K documents
- Map: 300x300 clusters, N=10
- Compression: 18%
- Quality:
DCG Original: 17.20
DCG Compressed: 17.26
Problem:
- Compression level is small
19. 19
SOM+QBag for learning to rank
15
Clustering for initial training set construction
1. Build clustering using random sampling of documents
2. Mark all clusters as unused
3. Select query that covers maximum of unused clusters
4. For each cluster covered by documents from query
5. Select 1 document and send to assessor
6. Mark the cluster as used
7. Repeat from line 3 until select M queries
20. 20
SOM+QBag for learning to rank
16
Application of QBag
1. Build committee of models for QBag
2. Build clustering C for current training set
3. Mark all clusters as unused
4. For each query from a pool of new queries
5. For each selected by QBag pair (d1
, d2
)
6. c1
= cluster(d1
), c2
= cluster(d2
)
7. If c1
is unused OR c2
is unused
8. Send d1
and d2
to assessors
9. Set c1
and c2
as used
10. Set all clusters as unused