SlideShare a Scribd company logo
1 of 19
Download to read offline
Introduction Quantification The proposed approach Experiment Framework Conclusion
Efficient Model Selection for Regularized
Classification by Exploiting Unlabeled Data
Georgios Balikas1 Ioannis Partalas2 Eric Gaussier1 Rohit
Babbar3 Massih-Reza Amini1
1University Grenoble, Alpes
2Viseo R&D
3Max-Plank Institute for Intelligent Systems
Intelligent Data Analysis 2015, Saint-´Etienne
1/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
Introduction Quantification The proposed approach Experiment Framework Conclusion
Outline
1 Introduction
2 Quantification
3 The proposed approach
4 Experiment Framework
5 Conclusion
2/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
Introduction Quantification The proposed approach Experiment Framework Conclusion
Model selection for text classification
Doc1
DocN
d1 ∈ Rd
dN ∈ Rd
Feature
Extraction
Select hθ ∈ H.
θ: hyper-parameters
ˆR(θ) ∈ R
Learning
θ ?
The task
Efficiently select the hyper-parameter value which minimizes the
generalization error (using the empirical error as a proxy).
3/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
Introduction Quantification The proposed approach Experiment Framework Conclusion
Traditional Model Selection Methods
Valid. Train Train Train Train
Train Valid. Train Train Train
Train Train Train Train Valid.
Figure : 5-fold Cross Validation
Train Valid.
Figure : Hold-out
Extensions of the above such as Leave-one-out, etc.
M. Mohri et al.
Foundations of Machine Learning, MIT press 2012
4/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
Introduction Quantification The proposed approach Experiment Framework Conclusion
The issues
In large scale problems:
Resource intensive: ∼ 106 − 108 free parameters. Optimized
k-CV can take up to several days.
Power law distribution of
examples. Only a few
instances for small
classes, splitting them
results in loss of
information.
Labeled Documents/class
R. Babbar, I. Partalas, E. Gaussier, M-R. Amini
Re-ranking approach to classification in large-scale power-law distributed
category systems, SIGIR 2014
5/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
Introduction Quantification The proposed approach Experiment Framework Conclusion
Our contribution
We propose a bound that motivates efficient model selection.
Leverages unlabeled data for model selection
Performs on par (if not better) with traditional methods
Is k times faster than k-cross validation.
6/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
Introduction Quantification The proposed approach Experiment Framework Conclusion
Quantification
Definition
In many classification scenarios, the real goal is determining the
prevalence of each class in the test, a task called quantification.
Given a dataset:
How many people liked the new iPhone?
How many instances belong to yi class?
A. Esuli and F. Sebastiani
Optimizing text quantifiers for multivariate loss functions, arXiv preprint
arXiv:1502.05491
7/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
Introduction Quantification The proposed approach Experiment Framework Conclusion
Quantification using general purpose learners
Classify and Count
Aggregative method
Classify each instance
first
Count instances/class
Probabilistic Classify and Count
Non-aggregative method
Get scores/probabilities for each
instance
Sum over probabilities/class
G. Forman
Counting positives accurately despite inaccurate classification, ECML 2005
8/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
Introduction Quantification The proposed approach Experiment Framework Conclusion
Our setting
Mono-label, multi-class classification
Observations x ∈ X ⊆ Rd , labels y ∈ Y, |Y | > 2
(x, y) i.i.d. according to a fixed, unknown D over X × Y
Strain = {(x(i), y(i))}N
i=1, S = {(x(i))}M
i=N+1
Regularized classification: ˆw = arg min Remp(w) + λReg(w)
hθ ∈ H, e.g., for SVMs the θ = λ from a set λvalues
ˆpy , p
C(S)
y : prior on Strain, estimated using quantification on S
9/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
Introduction Quantification The proposed approach Experiment Framework Conclusion
Accuracy bound
Theorem
Let S = {(x(j))}M
j=1 be a set generated i.i.d. with respect to DX , py the true prior
probability for category y ∈ Y and
Ny
N
ˆpy its empirical estimate obtained on Strain.
We consider here a classifier C trained on Strain and we assume that the quantification
method used is accurate in the sense that:
∃ , min{py , ˆpy , p
C(S)
y }, ∀y ∈ Y : |p
C(S)
y −
M
C(S)
y
|S|
| ≤
Let B
C(S)
A , be defined as:
y∈Y
min{ˆpy × |S|, p
C(S)
y × |S|}
|S|
B
C(S)
A
Then for any δ ∈]0, 1], with probability at least (1 − δ):
AC(S)
≤ B
C(S)
A + |Y|(
log |Y| + log 1
δ
2N
+ )
10/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
Introduction Quantification The proposed approach Experiment Framework Conclusion
Intuition
Estimated prob. of y on |S|
prior prob. of y
B
C(S)
A
y∈Y
min{ ˆpy × |S|, p
C(S)
y × |S|}
|S|
In a power-law distributed category systems this is an upper
bound:
– ˆpy will be used for large classes due to false positives, and
– p
C(S)
y will be used for small classes due to false negatives.
11/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
Introduction Quantification The proposed approach Experiment Framework Conclusion
Model selection using the bound
Training Data
Estimate class priors
Quantification on unseen data
12/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
Introduction Quantification The proposed approach Experiment Framework Conclusion
Model selection using the bound
Training Data
Estimate class priors
Quantification on unseen data
for λ in λvalues do
Train on Strain
Estimate p
C(S)
y on S
end for
12/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
Introduction Quantification The proposed approach Experiment Framework Conclusion
Model selection using the bound
Training Data
Estimate class priors
Quantification on unseen data
Calculate the Bound
Select hyper-parameter value
12/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
Introduction Quantification The proposed approach Experiment Framework Conclusion
Datesets
Dataset #Training #Quantification #Test #Features # Parameters
dmoz250 1,542 2,401 1,023 55,610 13,9M
dmoz500 2,137 3,042 1,356 77,274 38,6M
dmoz1000 6,806 10,785 4,510 138,879 138,8M
dmoz1500 9,039 14,002 5,958 170,828 256,2M
dmoz2500 12,832 19,188 8,342 212,073 530,1M
– Similar experimental settings on wikipedia data
– SVMs and Log. Regression, λ ∈ {10−4, . . . , 104}
– 5-CV, Held out (70%-30%), BoundUN, BoundTest
13/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
Introduction Quantification The proposed approach Experiment Framework Conclusion
Results (1/2)
10−4 10−3 10−2 10−1 1 10 102 103
λ values
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Accuracy
5-CV
H out
MaF
CC
PCC
Figure : MaF measure optimization for wiki1500 for SVM.
14/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
Introduction Quantification The proposed approach Experiment Framework Conclusion
Results (2/2)
BoundUn BoundTest Hold-out 5-CV
Dataset Acc MaF Acc MaF Acc MaF Acc MaF
dmoz250 .8260 .6242 .8270 .6243 .8260 (±.0000) .6242 (±.0000) .8260 .6242
dmoz500 .7227 .5584 .7227 .5584 .7221 (±.0005) .5558 (±.0022) .7220 .5562
dmoz1000 .7302 .4883 .7302 .4892 .7301 (±.0001) .4835 (±.0155) .7299 .4883
dmoz1500 .7132 .4715 .7132 .4715 .6958 (±.0457) .4065 (±.0998) .7132 .4715
dmoz2500 .6352 .4301 .6350 .4306 .6350 (±.0001) .3949 (±.0686) .6352 .4301
wiki1500 for SVM on 4 cores: BoundUn (302 sec), 5-CV (1310 sec).
15/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
Introduction Quantification The proposed approach Experiment Framework Conclusion
Conclusions
Performs equally well or better than traditional model
selection methods for model selection.
Is k times faster than k-CV.
It requires unlabeled data from the same distribution as the
training data.
16/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
Introduction Quantification The proposed approach Experiment Framework Conclusion
Thank you
Georgios Balikas
georgios.balikas@imag.fr
Ioannis Partalas
ioannis.partalas@viseo.com
Eric Gaussier
eric.gaussier@imag.fr
Rohit Babbar
rohit.babbar@gmail.com
Massih-Reza Amini
massih-reza.amini@imag.fr
This work is partially supported by the CIFRE N 28/2015 and by
the LabEx PERSYVAL Lab ANR-11-LABX-0025.
17/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data

More Related Content

Similar to IDA 2015: Efficient model selection for regularized classification by exploiting unlabeled data

IRJET- Expert Independent Bayesian Data Fusion and Decision Making Model for ...
IRJET- Expert Independent Bayesian Data Fusion and Decision Making Model for ...IRJET- Expert Independent Bayesian Data Fusion and Decision Making Model for ...
IRJET- Expert Independent Bayesian Data Fusion and Decision Making Model for ...IRJET Journal
 
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...cscpconf
 
Enhancing the performance of Naive Bayesian Classifier using Information Gain...
Enhancing the performance of Naive Bayesian Classifier using Information Gain...Enhancing the performance of Naive Bayesian Classifier using Information Gain...
Enhancing the performance of Naive Bayesian Classifier using Information Gain...Rafiul Sabbir
 
A new model for iris data set classification based on linear support vector m...
A new model for iris data set classification based on linear support vector m...A new model for iris data set classification based on linear support vector m...
A new model for iris data set classification based on linear support vector m...IJECEIAES
 
Implications of Ceiling Effects in Defect Predictors
Implications of Ceiling Effects in Defect PredictorsImplications of Ceiling Effects in Defect Predictors
Implications of Ceiling Effects in Defect Predictorsgregoryg
 
Business Analytics using R.ppt
Business Analytics using R.pptBusiness Analytics using R.ppt
Business Analytics using R.pptRohit Raj
 
Machine Learning using biased data
Machine Learning using biased dataMachine Learning using biased data
Machine Learning using biased dataArnaud de Myttenaere
 
Off-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfOff-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfPo-Chuan Chen
 
Unit-4 classification
Unit-4 classificationUnit-4 classification
Unit-4 classificationLokarchanaD
 
Deep_Learning__INAF_baroncelli.pdf
Deep_Learning__INAF_baroncelli.pdfDeep_Learning__INAF_baroncelli.pdf
Deep_Learning__INAF_baroncelli.pdfasdfasdf214078
 
Data.Mining.C.6(II).classification and prediction
Data.Mining.C.6(II).classification and predictionData.Mining.C.6(II).classification and prediction
Data.Mining.C.6(II).classification and predictionMargaret Wang
 
IRJET- Big Data and Bayes Theorem used Analyze the Student’s Performance in E...
IRJET- Big Data and Bayes Theorem used Analyze the Student’s Performance in E...IRJET- Big Data and Bayes Theorem used Analyze the Student’s Performance in E...
IRJET- Big Data and Bayes Theorem used Analyze the Student’s Performance in E...IRJET Journal
 
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET Journal
 
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET Journal
 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learningTonmoy Bhagawati
 
Parallelisation of the PC Algorithm (CAEPIA2015)
Parallelisation of the PC Algorithm (CAEPIA2015)Parallelisation of the PC Algorithm (CAEPIA2015)
Parallelisation of the PC Algorithm (CAEPIA2015)AMIDST Toolbox
 
NEURAL Network Design Training
NEURAL Network Design  TrainingNEURAL Network Design  Training
NEURAL Network Design TrainingESCOM
 
Learning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data萍華 楊
 

Similar to IDA 2015: Efficient model selection for regularized classification by exploiting unlabeled data (20)

IRJET- Expert Independent Bayesian Data Fusion and Decision Making Model for ...
IRJET- Expert Independent Bayesian Data Fusion and Decision Making Model for ...IRJET- Expert Independent Bayesian Data Fusion and Decision Making Model for ...
IRJET- Expert Independent Bayesian Data Fusion and Decision Making Model for ...
 
Supervised algorithms
Supervised algorithmsSupervised algorithms
Supervised algorithms
 
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
 
Enhancing the performance of Naive Bayesian Classifier using Information Gain...
Enhancing the performance of Naive Bayesian Classifier using Information Gain...Enhancing the performance of Naive Bayesian Classifier using Information Gain...
Enhancing the performance of Naive Bayesian Classifier using Information Gain...
 
A new model for iris data set classification based on linear support vector m...
A new model for iris data set classification based on linear support vector m...A new model for iris data set classification based on linear support vector m...
A new model for iris data set classification based on linear support vector m...
 
Implications of Ceiling Effects in Defect Predictors
Implications of Ceiling Effects in Defect PredictorsImplications of Ceiling Effects in Defect Predictors
Implications of Ceiling Effects in Defect Predictors
 
Business Analytics using R.ppt
Business Analytics using R.pptBusiness Analytics using R.ppt
Business Analytics using R.ppt
 
Machine Learning using biased data
Machine Learning using biased dataMachine Learning using biased data
Machine Learning using biased data
 
Off-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfOff-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdf
 
Unit-4 classification
Unit-4 classificationUnit-4 classification
Unit-4 classification
 
Deep_Learning__INAF_baroncelli.pdf
Deep_Learning__INAF_baroncelli.pdfDeep_Learning__INAF_baroncelli.pdf
Deep_Learning__INAF_baroncelli.pdf
 
Data.Mining.C.6(II).classification and prediction
Data.Mining.C.6(II).classification and predictionData.Mining.C.6(II).classification and prediction
Data.Mining.C.6(II).classification and prediction
 
IRJET- Big Data and Bayes Theorem used Analyze the Student’s Performance in E...
IRJET- Big Data and Bayes Theorem used Analyze the Student’s Performance in E...IRJET- Big Data and Bayes Theorem used Analyze the Student’s Performance in E...
IRJET- Big Data and Bayes Theorem used Analyze the Student’s Performance in E...
 
Unit 3.pptx
Unit 3.pptxUnit 3.pptx
Unit 3.pptx
 
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification Algorithms
 
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification Algorithms
 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learning
 
Parallelisation of the PC Algorithm (CAEPIA2015)
Parallelisation of the PC Algorithm (CAEPIA2015)Parallelisation of the PC Algorithm (CAEPIA2015)
Parallelisation of the PC Algorithm (CAEPIA2015)
 
NEURAL Network Design Training
NEURAL Network Design  TrainingNEURAL Network Design  Training
NEURAL Network Design Training
 
Learning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data
 

Recently uploaded

DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTSDBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTSSnehalVinod
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Klinik Aborsi
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格q6pzkpark
 
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...LuisMiguelPaz5
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?RemarkSemacio
 
👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...
👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...
👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...vershagrag
 
DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1sinhaabhiyanshu
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样jk0tkvfv
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesBoston Institute of Analytics
 
Pentesting_AI and security challenges of AI
Pentesting_AI and security challenges of AIPentesting_AI and security challenges of AI
Pentesting_AI and security challenges of AIf6x4zqzk86
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATIONLakpaYanziSherpa
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...varanasisatyanvesh
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeBoston Institute of Analytics
 
ℂall Girls In Navi Mumbai Hire Me Neha 9910780858 Top Class ℂall Girl Serviℂe...
ℂall Girls In Navi Mumbai Hire Me Neha 9910780858 Top Class ℂall Girl Serviℂe...ℂall Girls In Navi Mumbai Hire Me Neha 9910780858 Top Class ℂall Girl Serviℂe...
ℂall Girls In Navi Mumbai Hire Me Neha 9910780858 Top Class ℂall Girl Serviℂe...Amara arora$V15
 

Recently uploaded (20)

DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTSDBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?
 
👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...
👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...
👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...
 
DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
 
Abortion pills in Jeddah |+966572737505 | get cytotec
Abortion pills in Jeddah |+966572737505 | get cytotecAbortion pills in Jeddah |+966572737505 | get cytotec
Abortion pills in Jeddah |+966572737505 | get cytotec
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get CytotecAbortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
 
Pentesting_AI and security challenges of AI
Pentesting_AI and security challenges of AIPentesting_AI and security challenges of AI
Pentesting_AI and security challenges of AI
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
 
ℂall Girls In Navi Mumbai Hire Me Neha 9910780858 Top Class ℂall Girl Serviℂe...
ℂall Girls In Navi Mumbai Hire Me Neha 9910780858 Top Class ℂall Girl Serviℂe...ℂall Girls In Navi Mumbai Hire Me Neha 9910780858 Top Class ℂall Girl Serviℂe...
ℂall Girls In Navi Mumbai Hire Me Neha 9910780858 Top Class ℂall Girl Serviℂe...
 

IDA 2015: Efficient model selection for regularized classification by exploiting unlabeled data

  • 1. Introduction Quantification The proposed approach Experiment Framework Conclusion Efficient Model Selection for Regularized Classification by Exploiting Unlabeled Data Georgios Balikas1 Ioannis Partalas2 Eric Gaussier1 Rohit Babbar3 Massih-Reza Amini1 1University Grenoble, Alpes 2Viseo R&D 3Max-Plank Institute for Intelligent Systems Intelligent Data Analysis 2015, Saint-´Etienne 1/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
  • 2. Introduction Quantification The proposed approach Experiment Framework Conclusion Outline 1 Introduction 2 Quantification 3 The proposed approach 4 Experiment Framework 5 Conclusion 2/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
  • 3. Introduction Quantification The proposed approach Experiment Framework Conclusion Model selection for text classification Doc1 DocN d1 ∈ Rd dN ∈ Rd Feature Extraction Select hθ ∈ H. θ: hyper-parameters ˆR(θ) ∈ R Learning θ ? The task Efficiently select the hyper-parameter value which minimizes the generalization error (using the empirical error as a proxy). 3/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
  • 4. Introduction Quantification The proposed approach Experiment Framework Conclusion Traditional Model Selection Methods Valid. Train Train Train Train Train Valid. Train Train Train Train Train Train Train Valid. Figure : 5-fold Cross Validation Train Valid. Figure : Hold-out Extensions of the above such as Leave-one-out, etc. M. Mohri et al. Foundations of Machine Learning, MIT press 2012 4/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
  • 5. Introduction Quantification The proposed approach Experiment Framework Conclusion The issues In large scale problems: Resource intensive: ∼ 106 − 108 free parameters. Optimized k-CV can take up to several days. Power law distribution of examples. Only a few instances for small classes, splitting them results in loss of information. Labeled Documents/class R. Babbar, I. Partalas, E. Gaussier, M-R. Amini Re-ranking approach to classification in large-scale power-law distributed category systems, SIGIR 2014 5/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
  • 6. Introduction Quantification The proposed approach Experiment Framework Conclusion Our contribution We propose a bound that motivates efficient model selection. Leverages unlabeled data for model selection Performs on par (if not better) with traditional methods Is k times faster than k-cross validation. 6/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
  • 7. Introduction Quantification The proposed approach Experiment Framework Conclusion Quantification Definition In many classification scenarios, the real goal is determining the prevalence of each class in the test, a task called quantification. Given a dataset: How many people liked the new iPhone? How many instances belong to yi class? A. Esuli and F. Sebastiani Optimizing text quantifiers for multivariate loss functions, arXiv preprint arXiv:1502.05491 7/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
  • 8. Introduction Quantification The proposed approach Experiment Framework Conclusion Quantification using general purpose learners Classify and Count Aggregative method Classify each instance first Count instances/class Probabilistic Classify and Count Non-aggregative method Get scores/probabilities for each instance Sum over probabilities/class G. Forman Counting positives accurately despite inaccurate classification, ECML 2005 8/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
  • 9. Introduction Quantification The proposed approach Experiment Framework Conclusion Our setting Mono-label, multi-class classification Observations x ∈ X ⊆ Rd , labels y ∈ Y, |Y | > 2 (x, y) i.i.d. according to a fixed, unknown D over X × Y Strain = {(x(i), y(i))}N i=1, S = {(x(i))}M i=N+1 Regularized classification: ˆw = arg min Remp(w) + λReg(w) hθ ∈ H, e.g., for SVMs the θ = λ from a set λvalues ˆpy , p C(S) y : prior on Strain, estimated using quantification on S 9/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
  • 10. Introduction Quantification The proposed approach Experiment Framework Conclusion Accuracy bound Theorem Let S = {(x(j))}M j=1 be a set generated i.i.d. with respect to DX , py the true prior probability for category y ∈ Y and Ny N ˆpy its empirical estimate obtained on Strain. We consider here a classifier C trained on Strain and we assume that the quantification method used is accurate in the sense that: ∃ , min{py , ˆpy , p C(S) y }, ∀y ∈ Y : |p C(S) y − M C(S) y |S| | ≤ Let B C(S) A , be defined as: y∈Y min{ˆpy × |S|, p C(S) y × |S|} |S| B C(S) A Then for any δ ∈]0, 1], with probability at least (1 − δ): AC(S) ≤ B C(S) A + |Y|( log |Y| + log 1 δ 2N + ) 10/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
  • 11. Introduction Quantification The proposed approach Experiment Framework Conclusion Intuition Estimated prob. of y on |S| prior prob. of y B C(S) A y∈Y min{ ˆpy × |S|, p C(S) y × |S|} |S| In a power-law distributed category systems this is an upper bound: – ˆpy will be used for large classes due to false positives, and – p C(S) y will be used for small classes due to false negatives. 11/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
  • 12. Introduction Quantification The proposed approach Experiment Framework Conclusion Model selection using the bound Training Data Estimate class priors Quantification on unseen data 12/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
  • 13. Introduction Quantification The proposed approach Experiment Framework Conclusion Model selection using the bound Training Data Estimate class priors Quantification on unseen data for λ in λvalues do Train on Strain Estimate p C(S) y on S end for 12/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
  • 14. Introduction Quantification The proposed approach Experiment Framework Conclusion Model selection using the bound Training Data Estimate class priors Quantification on unseen data Calculate the Bound Select hyper-parameter value 12/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
  • 15. Introduction Quantification The proposed approach Experiment Framework Conclusion Datesets Dataset #Training #Quantification #Test #Features # Parameters dmoz250 1,542 2,401 1,023 55,610 13,9M dmoz500 2,137 3,042 1,356 77,274 38,6M dmoz1000 6,806 10,785 4,510 138,879 138,8M dmoz1500 9,039 14,002 5,958 170,828 256,2M dmoz2500 12,832 19,188 8,342 212,073 530,1M – Similar experimental settings on wikipedia data – SVMs and Log. Regression, λ ∈ {10−4, . . . , 104} – 5-CV, Held out (70%-30%), BoundUN, BoundTest 13/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
  • 16. Introduction Quantification The proposed approach Experiment Framework Conclusion Results (1/2) 10−4 10−3 10−2 10−1 1 10 102 103 λ values 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Accuracy 5-CV H out MaF CC PCC Figure : MaF measure optimization for wiki1500 for SVM. 14/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
  • 17. Introduction Quantification The proposed approach Experiment Framework Conclusion Results (2/2) BoundUn BoundTest Hold-out 5-CV Dataset Acc MaF Acc MaF Acc MaF Acc MaF dmoz250 .8260 .6242 .8270 .6243 .8260 (±.0000) .6242 (±.0000) .8260 .6242 dmoz500 .7227 .5584 .7227 .5584 .7221 (±.0005) .5558 (±.0022) .7220 .5562 dmoz1000 .7302 .4883 .7302 .4892 .7301 (±.0001) .4835 (±.0155) .7299 .4883 dmoz1500 .7132 .4715 .7132 .4715 .6958 (±.0457) .4065 (±.0998) .7132 .4715 dmoz2500 .6352 .4301 .6350 .4306 .6350 (±.0001) .3949 (±.0686) .6352 .4301 wiki1500 for SVM on 4 cores: BoundUn (302 sec), 5-CV (1310 sec). 15/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
  • 18. Introduction Quantification The proposed approach Experiment Framework Conclusion Conclusions Performs equally well or better than traditional model selection methods for model selection. Is k times faster than k-CV. It requires unlabeled data from the same distribution as the training data. 16/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
  • 19. Introduction Quantification The proposed approach Experiment Framework Conclusion Thank you Georgios Balikas georgios.balikas@imag.fr Ioannis Partalas ioannis.partalas@viseo.com Eric Gaussier eric.gaussier@imag.fr Rohit Babbar rohit.babbar@gmail.com Massih-Reza Amini massih-reza.amini@imag.fr This work is partially supported by the CIFRE N 28/2015 and by the LabEx PERSYVAL Lab ANR-11-LABX-0025. 17/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data