SlideShare a Scribd company logo
1 of 23
Alexey Voropaev
Active Learning
to Rank
22
Search Mail.Ru
– www.go.mail.ru
– Search for:
• Russian web
• Images
• Video
• etc.
– 9% of market share
1
3
Machine learning is everywhere
2
There are ML algorithms in:
– Crawler
– Indexer
– Ranker
– Data mining systems
– Frontend
Most of them are supervised:
– Require training set
– Judgement is expensive
– Ranker training set: 1M documents, 50K queries
4
Usual problems of a training set
3
Points of class A
Points of class B
Unlabelled points
5
Usual problems of a training set
3
Points of class A
Points of class B
Unlabelled points
Problems:
– Unlabelled points between classes
6
Usual problems of a training set
3
Points of class A
Points of class B
Unlabelled points
Problems:
– Unlabelled points between classes
– Imbalance of labelled points
7
Usual problems of a training set
3
Points of class A
Points of class B
Unlabelled points
Problems:
– Unlabelled points between classes
– Imbalance of labelled points
– There is unsampled cluster
8
Usual problems of a training set
3
Points of class A
Points of class B
Unlabelled points
Idea of Active Learning:
We fix these problems by
smart construction of training set
We save assessor's resources.
9
Uncertainty sampling
4
Take instances about which it is least certain how to label.
Problem:
- Requires posterior distribution P(Y|x)
9
Points of class A
Points of class B
Unlabelled points
1
2
Select point 1
10
Query-By-Committee
5
Take instances with maximum different votes.
10
Points of class A
Points of class B
Unlabeled points
1
2
Select point 2
11
QBag algorithm
Input: T – initial labelled training set
С – size of the committee
A – learning algorithm
U – set of unlabeled objects
Output: T' – extended training set
1. Uniformly resample T, obtain T1
...TС
, where |Ti
| < |T|
2. For each Ti
build model Mi
using A
3. Select x* = min x∈U
| |Mi
(x)=1| - |Mi
(x)=0| |
4. Pass x* to assessor and update T
5. Repeat from 1until convergence
6 K.Dwyer, R.Holte, Decision Tree Instability and Active Learning, 2007
12K.Dwyer, R.Holte, Decision Tree Instability and Active Learning, 2007
QBag quality
7
Density sampling
8
Idea: Balance dense/sparse regions of the input space
Dense Sparse
Not sampled
1313
Points of class A
Points of class B
Unlabeled points
14
Clustering of our training set for ranking
9
Navigation
High relevant
Medium relevant
Low relevant
Irrelevant
404
Self-organizing map
- cell is cluster
- color is relevance
15
High density
Low density
Density of the clustering
10
16
Old documents
New documents
Result of sparse regions sampling
11
17
SOM-balancing algorithm
12
1. Build clustering C for training set
2. Compute average density densityavg
3. For each cluster c ∈C
4. If density(c) > densityavg
5. Limit number of sample in c by N
18
SOM-balancing results
13
Results:
- Training set size: 350K documents
- Map: 300x300 clusters, N=10
- Compression: 18%
- Quality:
DCG Original: 17.20
DCG Compressed: 17.26
Problem:
- Compression level is small
19
SOM+QBag for learning to rank
15
Clustering for initial training set construction
1. Build clustering using random sampling of documents
2. Mark all clusters as unused
3. Select query that covers maximum of unused clusters
4. For each cluster covered by documents from query
5. Select 1 document and send to assessor
6. Mark the cluster as used
7. Repeat from line 3 until select M queries
20
SOM+QBag for learning to rank
16
Application of QBag
1. Build committee of models for QBag
2. Build clustering C for current training set
3. Mark all clusters as unused
4. For each query from a pool of new queries
5. For each selected by QBag pair (d1
, d2
)
6. c1
= cluster(d1
), c2
= cluster(d2
)
7. If c1
is unused OR c2
is unused
8. Send d1
and d2
to assessors
9. Set c1
and c2
as used
10. Set all clusters as unused
21
SOM+QBag for learning to rank: results
17
All data: 300k documents
Test set: 300k docs
22
Our search quality vs main competitors
18
23
Thank you!
Reference:
- http://active-learning.net/
- Burr Settles. Active Learning.
Synthesis Lectures on Artificial Intelligence
and Machine Learning, June 2012

More Related Content

Viewers also liked

технопарк открытие
технопарк открытиетехнопарк открытие
технопарк открытиеDmitry Voloshin
 
Сеть электронных бибилиотек Vivaldi (ЭБС) с аннотацией
Сеть электронных бибилиотек Vivaldi (ЭБС) с аннотациейСеть электронных бибилиотек Vivaldi (ЭБС) с аннотацией
Сеть электронных бибилиотек Vivaldi (ЭБС) с аннотациейEDISON Software Development Centre
 
MongoDB 3.0.0 vs 2.6.x vs 2.4.x Benchmark
MongoDB 3.0.0 vs 2.6.x vs 2.4.x BenchmarkMongoDB 3.0.0 vs 2.6.x vs 2.4.x Benchmark
MongoDB 3.0.0 vs 2.6.x vs 2.4.x Benchmark承翰 蔡
 
Очир Абушинов - Применение fuzz-тестирования
Очир Абушинов - Применение fuzz-тестированияОчир Абушинов - Применение fuzz-тестирования
Очир Абушинов - Применение fuzz-тестированияSQALab
 
Java худеет. Спроси меня как.
Java худеет. Спроси меня как.Java худеет. Спроси меня как.
Java худеет. Спроси меня как.Nikita Lipsky
 
Innovations in Mobility
Innovations in MobilityInnovations in Mobility
Innovations in MobilityCisco Canada
 
Rambler.iOS #8: Чистые unit-тесты
Rambler.iOS #8: Чистые unit-тестыRambler.iOS #8: Чистые unit-тесты
Rambler.iOS #8: Чистые unit-тестыRAMBLER&Co
 
Облако в Badoo год спустя - работа над ошибками, Юрий Насретдинов (Badoo)
Облако в Badoo год спустя - работа над ошибками, Юрий Насретдинов (Badoo)Облако в Badoo год спустя - работа над ошибками, Юрий Насретдинов (Badoo)
Облако в Badoo год спустя - работа над ошибками, Юрий Насретдинов (Badoo)Badoo Development
 
Workshop: Cisco Mobility Express Solution: Simple by Design
Workshop: Cisco Mobility Express Solution: Simple by DesignWorkshop: Cisco Mobility Express Solution: Simple by Design
Workshop: Cisco Mobility Express Solution: Simple by DesignRobb Boyd
 
#MBLTdev: Современная аутентификация (PayPal)
#MBLTdev: Современная аутентификация (PayPal)#MBLTdev: Современная аутентификация (PayPal)
#MBLTdev: Современная аутентификация (PayPal)e-Legion
 
#MBLTdev: Практический пример переиспользования кода. Как повысить качество и...
#MBLTdev: Практический пример переиспользования кода. Как повысить качество и...#MBLTdev: Практический пример переиспользования кода. Как повысить качество и...
#MBLTdev: Практический пример переиспользования кода. Как повысить качество и...e-Legion
 
#MBLTdev: Kotlin для Android, или лёгкий способ перестать программировать на ...
#MBLTdev: Kotlin для Android, или лёгкий способ перестать программировать на ...#MBLTdev: Kotlin для Android, или лёгкий способ перестать программировать на ...
#MBLTdev: Kotlin для Android, или лёгкий способ перестать программировать на ...e-Legion
 
2 урок. Участники арбитражного рынка
2 урок. Участники арбитражного рынка2 урок. Участники арбитражного рынка
2 урок. Участники арбитражного рынкаMobio
 
4 Урок. Офферы и вертикали
4 Урок. Офферы и вертикали4 Урок. Офферы и вертикали
4 Урок. Офферы и вертикалиMobio
 
Android Vector drawable
Android Vector drawableAndroid Vector drawable
Android Vector drawableOleg Osipenko
 
"Ускорение сборки большого проекта на Objective-C + Swift" Иван Бондарь (Avito)
"Ускорение сборки большого проекта на Objective-C + Swift" Иван Бондарь (Avito)"Ускорение сборки большого проекта на Objective-C + Swift" Иван Бондарь (Avito)
"Ускорение сборки большого проекта на Objective-C + Swift" Иван Бондарь (Avito)AvitoTech
 
Kernel Recipes 2014 - The Linux graphics stack and Nouveau driver
Kernel Recipes 2014 - The Linux graphics stack and Nouveau driverKernel Recipes 2014 - The Linux graphics stack and Nouveau driver
Kernel Recipes 2014 - The Linux graphics stack and Nouveau driverAnne Nicolas
 
HTTP/2: What no one is telling you
HTTP/2: What no one is telling youHTTP/2: What no one is telling you
HTTP/2: What no one is telling youFastly
 

Viewers also liked (20)

технопарк открытие
технопарк открытиетехнопарк открытие
технопарк открытие
 
Сеть электронных бибилиотек Vivaldi (ЭБС) с аннотацией
Сеть электронных бибилиотек Vivaldi (ЭБС) с аннотациейСеть электронных бибилиотек Vivaldi (ЭБС) с аннотацией
Сеть электронных бибилиотек Vivaldi (ЭБС) с аннотацией
 
MongoDB 3.0.0 vs 2.6.x vs 2.4.x Benchmark
MongoDB 3.0.0 vs 2.6.x vs 2.4.x BenchmarkMongoDB 3.0.0 vs 2.6.x vs 2.4.x Benchmark
MongoDB 3.0.0 vs 2.6.x vs 2.4.x Benchmark
 
Очир Абушинов - Применение fuzz-тестирования
Очир Абушинов - Применение fuzz-тестированияОчир Абушинов - Применение fuzz-тестирования
Очир Абушинов - Применение fuzz-тестирования
 
Java худеет. Спроси меня как.
Java худеет. Спроси меня как.Java худеет. Спроси меня как.
Java худеет. Спроси меня как.
 
Innovations in Mobility
Innovations in MobilityInnovations in Mobility
Innovations in Mobility
 
Rambler.iOS #8: Чистые unit-тесты
Rambler.iOS #8: Чистые unit-тестыRambler.iOS #8: Чистые unit-тесты
Rambler.iOS #8: Чистые unit-тесты
 
Облако в Badoo год спустя - работа над ошибками, Юрий Насретдинов (Badoo)
Облако в Badoo год спустя - работа над ошибками, Юрий Насретдинов (Badoo)Облако в Badoo год спустя - работа над ошибками, Юрий Насретдинов (Badoo)
Облако в Badoo год спустя - работа над ошибками, Юрий Насретдинов (Badoo)
 
Workshop: Cisco Mobility Express Solution: Simple by Design
Workshop: Cisco Mobility Express Solution: Simple by DesignWorkshop: Cisco Mobility Express Solution: Simple by Design
Workshop: Cisco Mobility Express Solution: Simple by Design
 
#MBLTdev: Современная аутентификация (PayPal)
#MBLTdev: Современная аутентификация (PayPal)#MBLTdev: Современная аутентификация (PayPal)
#MBLTdev: Современная аутентификация (PayPal)
 
#MBLTdev: Практический пример переиспользования кода. Как повысить качество и...
#MBLTdev: Практический пример переиспользования кода. Как повысить качество и...#MBLTdev: Практический пример переиспользования кода. Как повысить качество и...
#MBLTdev: Практический пример переиспользования кода. Как повысить качество и...
 
#MBLTdev: Kotlin для Android, или лёгкий способ перестать программировать на ...
#MBLTdev: Kotlin для Android, или лёгкий способ перестать программировать на ...#MBLTdev: Kotlin для Android, или лёгкий способ перестать программировать на ...
#MBLTdev: Kotlin для Android, или лёгкий способ перестать программировать на ...
 
2 урок. Участники арбитражного рынка
2 урок. Участники арбитражного рынка2 урок. Участники арбитражного рынка
2 урок. Участники арбитражного рынка
 
4 Урок. Офферы и вертикали
4 Урок. Офферы и вертикали4 Урок. Офферы и вертикали
4 Урок. Офферы и вертикали
 
Android Vector drawable
Android Vector drawableAndroid Vector drawable
Android Vector drawable
 
"Ускорение сборки большого проекта на Objective-C + Swift" Иван Бондарь (Avito)
"Ускорение сборки большого проекта на Objective-C + Swift" Иван Бондарь (Avito)"Ускорение сборки большого проекта на Objective-C + Swift" Иван Бондарь (Avito)
"Ускорение сборки большого проекта на Objective-C + Swift" Иван Бондарь (Avito)
 
Kernel Recipes 2014 - The Linux graphics stack and Nouveau driver
Kernel Recipes 2014 - The Linux graphics stack and Nouveau driverKernel Recipes 2014 - The Linux graphics stack and Nouveau driver
Kernel Recipes 2014 - The Linux graphics stack and Nouveau driver
 
HTTP/2: What no one is telling you
HTTP/2: What no one is telling youHTTP/2: What no one is telling you
HTTP/2: What no one is telling you
 
Devconf15
Devconf15Devconf15
Devconf15
 
pgconf.ru 2015 avito postgresql
pgconf.ru 2015 avito postgresqlpgconf.ru 2015 avito postgresql
pgconf.ru 2015 avito postgresql
 

Similar to Active Learning to Rank

林守德/Practical Issues in Machine Learning
林守德/Practical Issues in Machine Learning林守德/Practical Issues in Machine Learning
林守德/Practical Issues in Machine Learning台灣資料科學年會
 
Performance of Go on Multicore Systems
Performance of Go on Multicore SystemsPerformance of Go on Multicore Systems
Performance of Go on Multicore SystemsNo J
 
ensembles_emptytemplate_v2
ensembles_emptytemplate_v2ensembles_emptytemplate_v2
ensembles_emptytemplate_v2Shrayes Ramesh
 
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep LearningTroubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep LearningSergey Karayev
 
An early look at the LDBC Social Network Benchmark's Business Intelligence wo...
An early look at the LDBC Social Network Benchmark's Business Intelligence wo...An early look at the LDBC Social Network Benchmark's Business Intelligence wo...
An early look at the LDBC Social Network Benchmark's Business Intelligence wo...Gábor Szárnyas
 
Parcel Lot Division with cGAN
Parcel Lot Division with cGANParcel Lot Division with cGAN
Parcel Lot Division with cGANMatthew To
 
Semi-automatic ground truth generation using unsupervised clustering and limi...
Semi-automatic ground truth generation using unsupervised clustering and limi...Semi-automatic ground truth generation using unsupervised clustering and limi...
Semi-automatic ground truth generation using unsupervised clustering and limi...SOYEON KIM
 
Yulia Honcharenko "Application of metric learning for logo recognition"
Yulia Honcharenko "Application of metric learning for logo recognition"Yulia Honcharenko "Application of metric learning for logo recognition"
Yulia Honcharenko "Application of metric learning for logo recognition"Fwdays
 
To bag, or to boost? A question of balance
To bag, or to boost? A question of balanceTo bag, or to boost? A question of balance
To bag, or to boost? A question of balanceAlex Henderson
 
Resnet.pdf
Resnet.pdfResnet.pdf
Resnet.pdfYanhuaSi
 
Machine Learning Machine Learnin Machine Learningg
Machine Learning Machine Learnin Machine LearninggMachine Learning Machine Learnin Machine Learningg
Machine Learning Machine Learnin Machine Learninggghsskchutta
 
Transfer Learning: Breve introducción a modelos pre-entrenados.
Transfer Learning: Breve introducción a modelos pre-entrenados.Transfer Learning: Breve introducción a modelos pre-entrenados.
Transfer Learning: Breve introducción a modelos pre-entrenados.Fernando Constantino
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsJinwon Lee
 

Similar to Active Learning to Rank (20)

林守德/Practical Issues in Machine Learning
林守德/Practical Issues in Machine Learning林守德/Practical Issues in Machine Learning
林守德/Practical Issues in Machine Learning
 
Performance of Go on Multicore Systems
Performance of Go on Multicore SystemsPerformance of Go on Multicore Systems
Performance of Go on Multicore Systems
 
ensembles_emptytemplate_v2
ensembles_emptytemplate_v2ensembles_emptytemplate_v2
ensembles_emptytemplate_v2
 
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep LearningTroubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
 
An early look at the LDBC Social Network Benchmark's Business Intelligence wo...
An early look at the LDBC Social Network Benchmark's Business Intelligence wo...An early look at the LDBC Social Network Benchmark's Business Intelligence wo...
An early look at the LDBC Social Network Benchmark's Business Intelligence wo...
 
Parcel Lot Division with cGAN
Parcel Lot Division with cGANParcel Lot Division with cGAN
Parcel Lot Division with cGAN
 
ACL 2018 Recap
ACL 2018 RecapACL 2018 Recap
ACL 2018 Recap
 
Semi-automatic ground truth generation using unsupervised clustering and limi...
Semi-automatic ground truth generation using unsupervised clustering and limi...Semi-automatic ground truth generation using unsupervised clustering and limi...
Semi-automatic ground truth generation using unsupervised clustering and limi...
 
Yulia Honcharenko "Application of metric learning for logo recognition"
Yulia Honcharenko "Application of metric learning for logo recognition"Yulia Honcharenko "Application of metric learning for logo recognition"
Yulia Honcharenko "Application of metric learning for logo recognition"
 
Module 1.pdf
Module 1.pdfModule 1.pdf
Module 1.pdf
 
To bag, or to boost? A question of balance
To bag, or to boost? A question of balanceTo bag, or to boost? A question of balance
To bag, or to boost? A question of balance
 
Resnet.pdf
Resnet.pdfResnet.pdf
Resnet.pdf
 
Intro to ML.pptx
Intro to ML.pptxIntro to ML.pptx
Intro to ML.pptx
 
MAchine learning
MAchine learningMAchine learning
MAchine learning
 
PPT-3.ppt
PPT-3.pptPPT-3.ppt
PPT-3.ppt
 
Machine Learning Machine Learnin Machine Learningg
Machine Learning Machine Learnin Machine LearninggMachine Learning Machine Learnin Machine Learningg
Machine Learning Machine Learnin Machine Learningg
 
17.ppt
17.ppt17.ppt
17.ppt
 
Transfer Learning: Breve introducción a modelos pre-entrenados.
Transfer Learning: Breve introducción a modelos pre-entrenados.Transfer Learning: Breve introducción a modelos pre-entrenados.
Transfer Learning: Breve introducción a modelos pre-entrenados.
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
 
BAS 250 Lecture 8
BAS 250 Lecture 8BAS 250 Lecture 8
BAS 250 Lecture 8
 

Recently uploaded

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 

Recently uploaded (20)

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

Active Learning to Rank

  • 2. 22 Search Mail.Ru – www.go.mail.ru – Search for: • Russian web • Images • Video • etc. – 9% of market share 1
  • 3. 3 Machine learning is everywhere 2 There are ML algorithms in: – Crawler – Indexer – Ranker – Data mining systems – Frontend Most of them are supervised: – Require training set – Judgement is expensive – Ranker training set: 1M documents, 50K queries
  • 4. 4 Usual problems of a training set 3 Points of class A Points of class B Unlabelled points
  • 5. 5 Usual problems of a training set 3 Points of class A Points of class B Unlabelled points Problems: – Unlabelled points between classes
  • 6. 6 Usual problems of a training set 3 Points of class A Points of class B Unlabelled points Problems: – Unlabelled points between classes – Imbalance of labelled points
  • 7. 7 Usual problems of a training set 3 Points of class A Points of class B Unlabelled points Problems: – Unlabelled points between classes – Imbalance of labelled points – There is unsampled cluster
  • 8. 8 Usual problems of a training set 3 Points of class A Points of class B Unlabelled points Idea of Active Learning: We fix these problems by smart construction of training set We save assessor's resources.
  • 9. 9 Uncertainty sampling 4 Take instances about which it is least certain how to label. Problem: - Requires posterior distribution P(Y|x) 9 Points of class A Points of class B Unlabelled points 1 2 Select point 1
  • 10. 10 Query-By-Committee 5 Take instances with maximum different votes. 10 Points of class A Points of class B Unlabeled points 1 2 Select point 2
  • 11. 11 QBag algorithm Input: T – initial labelled training set С – size of the committee A – learning algorithm U – set of unlabeled objects Output: T' – extended training set 1. Uniformly resample T, obtain T1 ...TС , where |Ti | < |T| 2. For each Ti build model Mi using A 3. Select x* = min x∈U | |Mi (x)=1| - |Mi (x)=0| | 4. Pass x* to assessor and update T 5. Repeat from 1until convergence 6 K.Dwyer, R.Holte, Decision Tree Instability and Active Learning, 2007
  • 12. 12K.Dwyer, R.Holte, Decision Tree Instability and Active Learning, 2007 QBag quality 7
  • 13. Density sampling 8 Idea: Balance dense/sparse regions of the input space Dense Sparse Not sampled 1313 Points of class A Points of class B Unlabeled points
  • 14. 14 Clustering of our training set for ranking 9 Navigation High relevant Medium relevant Low relevant Irrelevant 404 Self-organizing map - cell is cluster - color is relevance
  • 15. 15 High density Low density Density of the clustering 10
  • 16. 16 Old documents New documents Result of sparse regions sampling 11
  • 17. 17 SOM-balancing algorithm 12 1. Build clustering C for training set 2. Compute average density densityavg 3. For each cluster c ∈C 4. If density(c) > densityavg 5. Limit number of sample in c by N
  • 18. 18 SOM-balancing results 13 Results: - Training set size: 350K documents - Map: 300x300 clusters, N=10 - Compression: 18% - Quality: DCG Original: 17.20 DCG Compressed: 17.26 Problem: - Compression level is small
  • 19. 19 SOM+QBag for learning to rank 15 Clustering for initial training set construction 1. Build clustering using random sampling of documents 2. Mark all clusters as unused 3. Select query that covers maximum of unused clusters 4. For each cluster covered by documents from query 5. Select 1 document and send to assessor 6. Mark the cluster as used 7. Repeat from line 3 until select M queries
  • 20. 20 SOM+QBag for learning to rank 16 Application of QBag 1. Build committee of models for QBag 2. Build clustering C for current training set 3. Mark all clusters as unused 4. For each query from a pool of new queries 5. For each selected by QBag pair (d1 , d2 ) 6. c1 = cluster(d1 ), c2 = cluster(d2 ) 7. If c1 is unused OR c2 is unused 8. Send d1 and d2 to assessors 9. Set c1 and c2 as used 10. Set all clusters as unused
  • 21. 21 SOM+QBag for learning to rank: results 17 All data: 300k documents Test set: 300k docs
  • 22. 22 Our search quality vs main competitors 18
  • 23. 23 Thank you! Reference: - http://active-learning.net/ - Burr Settles. Active Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, June 2012