Active Learning to Rank

•

0 likes•994 views

Alexey Voropaev

Presentation from ECIR 2013

Technology Education

22
Search Mail.Ru
– www.go.mail.ru
– Search for:
• Russian web
• Images
• Video
• etc.
– 9% of market share
1

3
Machine learning is everywhere
2
There are ML algorithms in:
– Crawler
– Indexer
– Ranker
– Data mining systems
– Frontend
Most of them are supervised:
– Require training set
– Judgement is expensive
– Ranker training set: 1M documents, 50K queries

4
Usual problems of a training set
3
Points of class A
Points of class B
Unlabelled points

5
Usual problems of a training set
3
Points of class A
Points of class B
Unlabelled points
Problems:
– Unlabelled points between classes

6
Usual problems of a training set
3
Points of class A
Points of class B
Unlabelled points
Problems:
– Unlabelled points between classes
– Imbalance of labelled points

7
Usual problems of a training set
3
Points of class A
Points of class B
Unlabelled points
Problems:
– Unlabelled points between classes
– Imbalance of labelled points
– There is unsampled cluster

8
Usual problems of a training set
3
Points of class A
Points of class B
Unlabelled points
Idea of Active Learning:
We fix these problems by
smart construction of training set
We save assessor's resources.

9
Uncertainty sampling
4
Take instances about which it is least certain how to label.
Problem:
- Requires posterior distribution P(Y|x)
9
Points of class A
Points of class B
Unlabelled points
1
2
Select point 1

10
Query-By-Committee
5
Take instances with maximum different votes.
10
Points of class A
Points of class B
Unlabeled points
1
2
Select point 2

11
QBag algorithm
Input: T – initial labelled training set
С – size of the committee
A – learning algorithm
U – set of unlabeled objects
Output: T' – extended training set
1. Uniformly resample T, obtain T1
...TС
, where |Ti
| < |T|
2. For each Ti
build model Mi
using A
3. Select x* = min x∈U
| |Mi
(x)=1| - |Mi
(x)=0| |
4. Pass x* to assessor and update T
5. Repeat from 1until convergence
6 K.Dwyer, R.Holte, Decision Tree Instability and Active Learning, 2007

12K.Dwyer, R.Holte, Decision Tree Instability and Active Learning, 2007
QBag quality
7

Density sampling
8
Idea: Balance dense/sparse regions of the input space
Dense Sparse
Not sampled
1313
Points of class A
Points of class B
Unlabeled points

14
Clustering of our training set for ranking
9
Navigation
High relevant
Medium relevant
Low relevant
Irrelevant
404
Self-organizing map
- cell is cluster
- color is relevance

15
High density
Low density
Density of the clustering
10

16
Old documents
New documents
Result of sparse regions sampling
11

17
SOM-balancing algorithm
12
1. Build clustering C for training set
2. Compute average density densityavg
3. For each cluster c ∈C
4. If density(c) > densityavg
5. Limit number of sample in c by N

18
SOM-balancing results
13
Results:
- Training set size: 350K documents
- Map: 300x300 clusters, N=10
- Compression: 18%
- Quality:
DCG Original: 17.20
DCG Compressed: 17.26
Problem:
- Compression level is small

19
SOM+QBag for learning to rank
15
Clustering for initial training set construction
1. Build clustering using random sampling of documents
2. Mark all clusters as unused
3. Select query that covers maximum of unused clusters
4. For each cluster covered by documents from query
5. Select 1 document and send to assessor
6. Mark the cluster as used
7. Repeat from line 3 until select M queries

20
SOM+QBag for learning to rank
16
Application of QBag
1. Build committee of models for QBag
2. Build clustering C for current training set
3. Mark all clusters as unused
4. For each query from a pool of new queries
5. For each selected by QBag pair (d1
, d2
)
6. c1
= cluster(d1
), c2
= cluster(d2
)
7. If c1
is unused OR c2
is unused
8. Send d1
and d2
to assessors
9. Set c1
and c2
as used
10. Set all clusters as unused

21
SOM+QBag for learning to rank: results
17
All data: 300k documents
Test set: 300k docs

22
Our search quality vs main competitors
18

23
Thank you!
Reference:
- http://active-learning.net/
- Burr Settles. Active Learning.
Synthesis Lectures on Artificial Intelligence
and Machine Learning, June 2012

Viewers also liked

технопарк открытиеDmitry Voloshin

Сеть электронных бибилиотек Vivaldi (ЭБС) с аннотациейEDISON Software Development Centre

MongoDB 3.0.0 vs 2.6.x vs 2.4.x Benchmark承翰蔡

Очир Абушинов - Применение fuzz-тестированияSQALab

Java худеет. Спроси меня как.Nikita Lipsky

Innovations in MobilityCisco Canada

Rambler.iOS #8: Чистые unit-тестыRAMBLER&Co

Облако в Badoo год спустя - работа над ошибками, Юрий Насретдинов (Badoo)Badoo Development

Workshop: Cisco Mobility Express Solution: Simple by DesignRobb Boyd

#MBLTdev: Современная аутентификация (PayPal)e-Legion

#MBLTdev: Практический пример переиспользования кода. Как повысить качество и...e-Legion

#MBLTdev: Kotlin для Android, или лёгкий способ перестать программировать на ...e-Legion

2 урок. Участники арбитражного рынкаMobio

4 Урок. Офферы и вертикалиMobio

Android Vector drawableOleg Osipenko

"Ускорение сборки большого проекта на Objective-C + Swift" Иван Бондарь (Avito)AvitoTech

Kernel Recipes 2014 - The Linux graphics stack and Nouveau driverAnne Nicolas

HTTP/2: What no one is telling youFastly

Devconf15Михаил Тюрин

pgconf.ru 2015 avito postgresqlМихаил Тюрин

Viewers also liked (20)

технопарк открытие

Сеть электронных бибилиотек Vivaldi (ЭБС) с аннотацией

MongoDB 3.0.0 vs 2.6.x vs 2.4.x Benchmark

Очир Абушинов - Применение fuzz-тестирования

Java худеет. Спроси меня как.

Innovations in Mobility

Rambler.iOS #8: Чистые unit-тесты

Облако в Badoo год спустя - работа над ошибками, Юрий Насретдинов (Badoo)

Workshop: Cisco Mobility Express Solution: Simple by Design

#MBLTdev: Современная аутентификация (PayPal)

#MBLTdev: Практический пример переиспользования кода. Как повысить качество и...

#MBLTdev: Kotlin для Android, или лёгкий способ перестать программировать на ...

2 урок. Участники арбитражного рынка

4 Урок. Офферы и вертикали

Android Vector drawable

"Ускорение сборки большого проекта на Objective-C + Swift" Иван Бондарь (Avito)

Kernel Recipes 2014 - The Linux graphics stack and Nouveau driver

HTTP/2: What no one is telling you

Devconf15

pgconf.ru 2015 avito postgresql

Similar to Active Learning to Rank

林守德/Practical Issues in Machine Learning台灣資料科學年會

Performance of Go on Multicore SystemsNo J

ensembles_emptytemplate_v2Shrayes Ramesh

Troubleshooting Deep Neural Networks - Full Stack Deep LearningSergey Karayev

An early look at the LDBC Social Network Benchmark's Business Intelligence wo...Gábor Szárnyas

Parcel Lot Division with cGANMatthew To

ACL 2018 RecapNAVER Engineering

Semi-automatic ground truth generation using unsupervised clustering and limi...SOYEON KIM

Yulia Honcharenko "Application of metric learning for logo recognition"Fwdays

Module 1.pdfMoniaDigra

To bag, or to boost? A question of balanceAlex Henderson

Resnet.pdfYanhuaSi

Intro to ML.pptxBHAGYAPRASADBUGGE

MAchine learningJayrajSingh9

PPT-3.pptShibaprasad Sen

Machine Learning Machine Learnin Machine Learninggghsskchutta

17.pptPragatiSharma250152

Transfer Learning: Breve introducción a modelos pre-entrenados.Fernando Constantino

PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsJinwon Lee

BAS 250 Lecture 8Wake Tech BAS

Similar to Active Learning to Rank (20)

林守德/Practical Issues in Machine Learning

Performance of Go on Multicore Systems

ensembles_emptytemplate_v2

Troubleshooting Deep Neural Networks - Full Stack Deep Learning

An early look at the LDBC Social Network Benchmark's Business Intelligence wo...

Parcel Lot Division with cGAN

ACL 2018 Recap

Semi-automatic ground truth generation using unsupervised clustering and limi...

Yulia Honcharenko "Application of metric learning for logo recognition"

Module 1.pdf

To bag, or to boost? A question of balance

Resnet.pdf

Intro to ML.pptx

MAchine learning

PPT-3.ppt

Machine Learning Machine Learnin Machine Learningg

17.ppt

Transfer Learning: Breve introducción a modelos pre-entrenados.

PR-231: A Simple Framework for Contrastive Learning of Visual Representations

BAS 250 Lecture 8

Recently uploaded

Artificial intelligence in cctv survelliance.pptxhariprasad279825

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

WordPress Websites for Engineers: Elevate Your Brandgvaughan

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

Search Engine Optimization SEO PDF for 2024.pdfRankYa

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

Recently uploaded (20)

Artificial intelligence in cctv survelliance.pptx

Are Multi-Cloud and Serverless Good or Bad?

DevoxxFR 2024 Reproducible Builds with Apache Maven

Unraveling Multimodality with Large Language Models.pdf

What's New in Teams Calling, Meetings and Devices March 2024

Streamlining Python Development: A Guide to a Modern Project Setup

Anypoint Exchange: It’s Not Just a Repo!

Ensuring Technical Readiness For Copilot in Microsoft 365

Vector Databases 101 - An introduction to the world of Vector Databases

Nell’iperspazio con Rocket: il Framework Web di Rust!

WordPress Websites for Engineers: Elevate Your Brand

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)

DevEX - reference for building teams, processes, and platforms

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

Powerpoint exploring the locations used in television show Time Clash

Vertex AI Gemini Prompt Engineering Tips

Search Engine Optimization SEO PDF for 2024.pdf

DMCC Future of Trade Web3 - Special Edition

Active Learning to Rank

1. Alexey Voropaev Active Learning to Rank

2. 22 Search Mail.Ru – www.go.mail.ru – Search for: • Russian web • Images • Video • etc. – 9% of market share 1

3. 3 Machine learning is everywhere 2 There are ML algorithms in: – Crawler – Indexer – Ranker – Data mining systems – Frontend Most of them are supervised: – Require training set – Judgement is expensive – Ranker training set: 1M documents, 50K queries

4. 4 Usual problems of a training set 3 Points of class A Points of class B Unlabelled points

5. 5 Usual problems of a training set 3 Points of class A Points of class B Unlabelled points Problems: – Unlabelled points between classes

6. 6 Usual problems of a training set 3 Points of class A Points of class B Unlabelled points Problems: – Unlabelled points between classes – Imbalance of labelled points

7. 7 Usual problems of a training set 3 Points of class A Points of class B Unlabelled points Problems: – Unlabelled points between classes – Imbalance of labelled points – There is unsampled cluster

8. 8 Usual problems of a training set 3 Points of class A Points of class B Unlabelled points Idea of Active Learning: We fix these problems by smart construction of training set We save assessor's resources.

9. 9 Uncertainty sampling 4 Take instances about which it is least certain how to label. Problem: - Requires posterior distribution P(Y|x) 9 Points of class A Points of class B Unlabelled points 1 2 Select point 1

10. 10 Query-By-Committee 5 Take instances with maximum different votes. 10 Points of class A Points of class B Unlabeled points 1 2 Select point 2

11. 11 QBag algorithm Input: T – initial labelled training set С – size of the committee A – learning algorithm U – set of unlabeled objects Output: T' – extended training set 1. Uniformly resample T, obtain T1 ...TС , where |Ti | < |T| 2. For each Ti build model Mi using A 3. Select x* = min x∈U | |Mi (x)=1| - |Mi (x)=0| | 4. Pass x* to assessor and update T 5. Repeat from 1until convergence 6 K.Dwyer, R.Holte, Decision Tree Instability and Active Learning, 2007

12. 12K.Dwyer, R.Holte, Decision Tree Instability and Active Learning, 2007 QBag quality 7

13. Density sampling 8 Idea: Balance dense/sparse regions of the input space Dense Sparse Not sampled 1313 Points of class A Points of class B Unlabeled points

14. 14 Clustering of our training set for ranking 9 Navigation High relevant Medium relevant Low relevant Irrelevant 404 Self-organizing map - cell is cluster - color is relevance

15. 15 High density Low density Density of the clustering 10

16. 16 Old documents New documents Result of sparse regions sampling 11

17. 17 SOM-balancing algorithm 12 1. Build clustering C for training set 2. Compute average density densityavg 3. For each cluster c ∈C 4. If density(c) > densityavg 5. Limit number of sample in c by N

18. 18 SOM-balancing results 13 Results: - Training set size: 350K documents - Map: 300x300 clusters, N=10 - Compression: 18% - Quality: DCG Original: 17.20 DCG Compressed: 17.26 Problem: - Compression level is small

19. 19 SOM+QBag for learning to rank 15 Clustering for initial training set construction 1. Build clustering using random sampling of documents 2. Mark all clusters as unused 3. Select query that covers maximum of unused clusters 4. For each cluster covered by documents from query 5. Select 1 document and send to assessor 6. Mark the cluster as used 7. Repeat from line 3 until select M queries

20. 20 SOM+QBag for learning to rank 16 Application of QBag 1. Build committee of models for QBag 2. Build clustering C for current training set 3. Mark all clusters as unused 4. For each query from a pool of new queries 5. For each selected by QBag pair (d1 , d2 ) 6. c1 = cluster(d1 ), c2 = cluster(d2 ) 7. If c1 is unused OR c2 is unused 8. Send d1 and d2 to assessors 9. Set c1 and c2 as used 10. Set all clusters as unused

21. 21 SOM+QBag for learning to rank: results 17 All data: 300k documents Test set: 300k docs

22. 22 Our search quality vs main competitors 18

23. 23 Thank you! Reference: - http://active-learning.net/ - Burr Settles. Active Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, June 2012

Active Learning to Rank

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Active Learning to Rank

Similar to Active Learning to Rank (20)

Recently uploaded

Recently uploaded (20)

Active Learning to Rank