SlideShare a Scribd company logo
1 of 21
Download to read offline
Clustering Crowds
Hiroshi Kajino1, Yuta Tsuboi2, Hisashi Kashima1
1: The University of Tokyo
2: IBM Research - Tokyo
July 16th, 2013 1AAAI-13
*H. Kajino and H. Kashima were supported by the FIRST program.
Outline
• Motivation and Problem Setting
Quality control problem of crowdsourcing
• Existing Method
Learning from a crowd-generated training set
• Proposed Method
Focusing on the similarity between workers
• Experimental Results
Robust estimation can be realized
• Conclusion
July 16th, 2013 AAAI-13 2
Outline
• Motivation and Problem Setting
Quality control problem of crowdsourcing
• Existing Method
Learning from a crowd-generated training set
• Proposed Method
Focusing on the similarity between workers
• Experimental Results
Robust estimation can be realized
• Conclusion
July 16th, 2013 AAAI-13 3
Crowdsourcing
• Crowdsourcing: system to access large crowds
Pros: process human intelligence tasks at low cost
Cons: abilities of workers are unknown
⇒ Quality of results is not guaranteed
July 16th, 2013 AAAI-13 4
Able to access large, but unknown manpower
WorkerRequester
2. Return results
1. Request tasks
3. Pay rewards
Task in Machine Learning Community
• Task: picture = bird ?
Pros: easy construct a large training set at low cost
Cons: quality of labels is not guaranteed
July 16th, 2013 AAAI-13 5
Large, but low-quality training set can be obtained easily
Difficult
Easy
Superior Inferior True labels
(Unobservable)
Yes Yes No Yes
No
No
No No
Yes No Yes
No
Task in Machine Learning Community
• Task: picture = bird ?
Pros: easy construct a large training set at low cost
Cons: quality of labels is not guaranteed
July 16th, 2013 AAAI-13 6
Large, but low-quality training set can be obtained easily
Difficult
Easy
Superior Inferior True labels
(Unobservable)
Yes Yes No Yes
No
No
No No
Yes No Yes
No
Overcome this difficulty
Problem Setting
• Input
– Feature vector : xi ∈RD (i=1,…,I)
– Worker : j ∈{1,2,…,J}
– Crowd label: yij ∈{0,1}
• Output
– classifier w0 ∈ RD (logistic regression model)
Note: we do not use the ground truths
• Common Approach:
1. Model the relation between w0 and {yij}
2. Inferring the model to obtain w0
July 16th, 2013 AAAI-13 7
Estimate a classifier from crowd-generated data
Bird or not
w0
Outline
• Motivation and Problem Setting
Quality control problem of crowdsourcing
• Existing Method
Learning from a crowd-generated training set
• Proposed Method
Focusing on the similarity between workers
• Experimental Results
Robust estimation can be realized
• Conclusion
July 16th, 2013 AAAI-13 8
• Personal Classifier (PC) Method [Kajino+,12]
– Worker j = classifier wj (= w0 + (noise))
July 16th, 2013 AAAI-13 9
Aggregate “personal classifiers” to obtain w0
personal classifiers
w0 yi2
yi1
true classifier
crowd labels
w1
w2
w3 yi3
N(w0 | 0, η-1I)
j=2
j=3
j=1
prior
distribution
known
unknown
Existing Method
• Parameter estimation = MAP estimation
– Parameters: w0, W={wj}J
j=1
– Solve the convex optimization problem:
July 16th, 2013 AAAI-13 10
Parameter estimation = optimizing a convex function
min
w0, W
(logistic loss)
Existing Method
priormodel-relationloss for PCs
Existing Method: Discussion
• Personal Classifier Method [Kajino+, 2012]
#(parameters / worker) = D
Pros: global optimum
Cons: bad performance in case of few data per worker
• Clustered Personal Classifier Method
Pros: global optimum & moderate performance
Key: fuse similar workers to decrease the degree of freedom
July 16th, 2013 AAAI-13 11
Estimation can be unstable for the PC method
Proposed
Outline
• Motivation and Problem Setting
Quality control problem of crowdsourcing
• Existing Method
Learning from a crowd-generated training set
• Proposed Method
Focusing on the similarity between workers
• Experimental Results
Robust estimation can be realized
• Conclusion
July 16th, 2013 AAAI-13 12
Proposed Method: Idea
• Analysis on workers [Welinder+, 2010]
“Notice how the annotators’ decision planes fall roughly
into three clusters”
– Clustering workers is a reasonable idea
(phrase & picture are cited from The multidimensional wisdom of crowds by Welinder+, NIPS 2010)
July 16th, 2013 AAAI-13 13
Similarity between workers can be observed in real data
Proposed Method: Formulation
• Clustered Personal Classifier (CPC) Method
– Model-relation term finds and fuses similar workers
→ Cut down the degree of freedom
(μ controls the strength of clustering)
July 16th, 2013 AAAI-13 14
Fuse similar workers to cut down the degree of freedom
(cf. for the PC method)
where forcing wj = wk
model-relation
Outline
• Motivation and Problem Setting
Quality control problem of crowdsourcing
• Existing Method
Learning from a crowd-generated training set
• Proposed Method
Focusing on the similarity between workers
• Experimental Results
Robust estimation can be realized
• Conclusion
July 16th, 2013 AAAI-13 15
Experiments on Synthetic Data
• Synthetic Data (J=I=10, spammers (random worker) & experts)
(L) (Dimension)=2: PC method = CPC method
(R) (Dimension)=10: PC method < CPC method
July 16th, 2013 AAAI-13 16
Robust performance on a small data set
Percentage of spammers Percentage of spammers
Proposed
Existing
better
Experiments on Real Data: Performance
• Performance Test on Real Data [Finin+,10]
– NER task (each word is named entity or not)
– (Dimension)=161,901, #(instances)=17,747, #(workers)=42
July 16th, 2013 AAAI-13 17
Proposed method outperforms other methods
Precision Recall F-measure
CPC method 0.647 0.716 0.680
PC method 0.637 0.721 0.677
LC method 0.625 0.732 0.675
AOC method 0.680 0.670 0.675
MV method 0.686 0.651 0.668
Existing
Method
Proposed
Experiments on Real Data: Clustering
• Hierarchical clustering on workers by increasing μ
• Outlier worker can be detected without “honey pots”
July 16th, 2013 AAAI-13 18
Clustering result indicates the existence of an outlier worker
Precision: 0.454
Recall: 0.857
Strength of clustering (=μ) -->
Outline
• Motivation and Problem Setting
Quality control problem of crowdsourcing
• Existing Method
Learning from a crowd-generated training set
• Proposed Method
Focusing on the similarity between workers
• Experimental Results
Robust estimation can be realized
• Conclusion
July 16th, 2013 AAAI-13 19
Conclusion
• Problem Setting
– Learning from redundant, variable-quality training data
• Problem of the PC Method
– #(parameters) is relatively large
– Unstable when data for one worker are small
• Proposed Method (CPC Method)
– Cut the degree of freedom by fusing similar workers
• Experimental Results
– More robust estimation in case of small data sets
– Valid as a method of “mining” workers
July 16th, 2013 AAAI-13 20
Introducing similarities between workers is beneficial
July 16th, 2013 AAAI-13 21

More Related Content

Similar to Clustering Crowds for Robust Estimation

Simulation-Based Modeling and Evaluation of Incentive Schemes in Crowdsourcin...
Simulation-Based Modeling and Evaluation of Incentive Schemes in Crowdsourcin...Simulation-Based Modeling and Evaluation of Incentive Schemes in Crowdsourcin...
Simulation-Based Modeling and Evaluation of Incentive Schemes in Crowdsourcin...Ognjen Scekic
 
Impact of Recruitment & Selection Processes on Employee Performance: A Study ...
Impact of Recruitment & Selection Processes on Employee Performance: A Study ...Impact of Recruitment & Selection Processes on Employee Performance: A Study ...
Impact of Recruitment & Selection Processes on Employee Performance: A Study ...Sheheryar Alvi
 
IRJET- Evaluation Technique of Student Performance in various Courses
IRJET- Evaluation Technique of Student Performance in various CoursesIRJET- Evaluation Technique of Student Performance in various Courses
IRJET- Evaluation Technique of Student Performance in various CoursesIRJET Journal
 
Information Retrieval 08
Information Retrieval 08 Information Retrieval 08
Information Retrieval 08 Jeet Das
 
Analysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry SystemAnalysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry SystemIJSRD
 
Analysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry SystemAnalysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry SystemIJSRD
 
Data Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analyticsData Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analyticsAkin Osman Kazakci
 
Ron's muri presentation
Ron's muri presentationRon's muri presentation
Ron's muri presentationgowinraj
 
IRJET- Tracking and Predicting Student Performance using Machine Learning
IRJET- Tracking and Predicting Student Performance using Machine LearningIRJET- Tracking and Predicting Student Performance using Machine Learning
IRJET- Tracking and Predicting Student Performance using Machine LearningIRJET Journal
 
Machine Learning with Python- Methods for Machine Learning.pptx
Machine Learning with Python- Methods for Machine Learning.pptxMachine Learning with Python- Methods for Machine Learning.pptx
Machine Learning with Python- Methods for Machine Learning.pptxiaeronlineexm
 
Computational Advertising-The LinkedIn Way
Computational Advertising-The LinkedIn WayComputational Advertising-The LinkedIn Way
Computational Advertising-The LinkedIn Wayyingfeng
 
Tech meetup Data Driven - Codemotion
Tech meetup Data Driven - Codemotion Tech meetup Data Driven - Codemotion
Tech meetup Data Driven - Codemotion antimo musone
 
Predire il futuro con Machine Learning & Big Data
Predire il futuro con Machine Learning & Big DataPredire il futuro con Machine Learning & Big Data
Predire il futuro con Machine Learning & Big DataData Driven Innovation
 
Hr Analytics final.pptx
Hr Analytics final.pptxHr Analytics final.pptx
Hr Analytics final.pptxSatwikBhat4
 
Identification of important features and data mining classification technique...
Identification of important features and data mining classification technique...Identification of important features and data mining classification technique...
Identification of important features and data mining classification technique...IJECEIAES
 

Similar to Clustering Crowds for Robust Estimation (20)

Simulation-Based Modeling and Evaluation of Incentive Schemes in Crowdsourcin...
Simulation-Based Modeling and Evaluation of Incentive Schemes in Crowdsourcin...Simulation-Based Modeling and Evaluation of Incentive Schemes in Crowdsourcin...
Simulation-Based Modeling and Evaluation of Incentive Schemes in Crowdsourcin...
 
Impact of Recruitment & Selection Processes on Employee Performance: A Study ...
Impact of Recruitment & Selection Processes on Employee Performance: A Study ...Impact of Recruitment & Selection Processes on Employee Performance: A Study ...
Impact of Recruitment & Selection Processes on Employee Performance: A Study ...
 
IRJET- Evaluation Technique of Student Performance in various Courses
IRJET- Evaluation Technique of Student Performance in various CoursesIRJET- Evaluation Technique of Student Performance in various Courses
IRJET- Evaluation Technique of Student Performance in various Courses
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
Information Retrieval 08
Information Retrieval 08 Information Retrieval 08
Information Retrieval 08
 
Analysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry SystemAnalysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry System
 
Analysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry SystemAnalysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry System
 
Data Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analyticsData Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analytics
 
User Personality and the New User Problem in a Context-­‐Aware POI Recommende...
User Personality and the New User Problem in a Context-­‐Aware POI Recommende...User Personality and the New User Problem in a Context-­‐Aware POI Recommende...
User Personality and the New User Problem in a Context-­‐Aware POI Recommende...
 
Big Data Challenges and Solutions
Big Data Challenges and SolutionsBig Data Challenges and Solutions
Big Data Challenges and Solutions
 
Ron's muri presentation
Ron's muri presentationRon's muri presentation
Ron's muri presentation
 
IRJET- Tracking and Predicting Student Performance using Machine Learning
IRJET- Tracking and Predicting Student Performance using Machine LearningIRJET- Tracking and Predicting Student Performance using Machine Learning
IRJET- Tracking and Predicting Student Performance using Machine Learning
 
Machine Learning with Python- Methods for Machine Learning.pptx
Machine Learning with Python- Methods for Machine Learning.pptxMachine Learning with Python- Methods for Machine Learning.pptx
Machine Learning with Python- Methods for Machine Learning.pptx
 
Computational Advertising-The LinkedIn Way
Computational Advertising-The LinkedIn WayComputational Advertising-The LinkedIn Way
Computational Advertising-The LinkedIn Way
 
De carlo rizk 2010 icelw
De carlo rizk 2010 icelwDe carlo rizk 2010 icelw
De carlo rizk 2010 icelw
 
Tech meetup Data Driven - Codemotion
Tech meetup Data Driven - Codemotion Tech meetup Data Driven - Codemotion
Tech meetup Data Driven - Codemotion
 
Predire il futuro con Machine Learning & Big Data
Predire il futuro con Machine Learning & Big DataPredire il futuro con Machine Learning & Big Data
Predire il futuro con Machine Learning & Big Data
 
How to Conduct a Training Needs Analysis (TNA)
How to Conduct a Training Needs Analysis (TNA)How to Conduct a Training Needs Analysis (TNA)
How to Conduct a Training Needs Analysis (TNA)
 
Hr Analytics final.pptx
Hr Analytics final.pptxHr Analytics final.pptx
Hr Analytics final.pptx
 
Identification of important features and data mining classification technique...
Identification of important features and data mining classification technique...Identification of important features and data mining classification technique...
Identification of important features and data mining classification technique...
 

More from Hiroshi Kajino

Graph generation using a graph grammar
Graph generation using a graph grammarGraph generation using a graph grammar
Graph generation using a graph grammarHiroshi Kajino
 
化学構造式のためのハイパーグラフ文法(JSAI2018)
化学構造式のためのハイパーグラフ文法(JSAI2018)化学構造式のためのハイパーグラフ文法(JSAI2018)
化学構造式のためのハイパーグラフ文法(JSAI2018)Hiroshi Kajino
 
能動学習による多関係データセットの構築(IBIS2015 博士課程招待講演)
能動学習による多関係データセットの構築(IBIS2015 博士課程招待講演)能動学習による多関係データセットの構築(IBIS2015 博士課程招待講演)
能動学習による多関係データセットの構築(IBIS2015 博士課程招待講演)Hiroshi Kajino
 
Active Learning for Multi-relational Data Construction
Active Learning for Multi-relational Data ConstructionActive Learning for Multi-relational Data Construction
Active Learning for Multi-relational Data ConstructionHiroshi Kajino
 
能動学習による多関係データセットの構築
能動学習による多関係データセットの構築能動学習による多関係データセットの構築
能動学習による多関係データセットの構築Hiroshi Kajino
 
Preserving Worker Privacy in Crowdsourcing
Preserving Worker Privacy in CrowdsourcingPreserving Worker Privacy in Crowdsourcing
Preserving Worker Privacy in CrowdsourcingHiroshi Kajino
 
プライバシ保護クラウドソーシング
プライバシ保護クラウドソーシングプライバシ保護クラウドソーシング
プライバシ保護クラウドソーシングHiroshi Kajino
 

More from Hiroshi Kajino (9)

Graph generation using a graph grammar
Graph generation using a graph grammarGraph generation using a graph grammar
Graph generation using a graph grammar
 
化学構造式のためのハイパーグラフ文法(JSAI2018)
化学構造式のためのハイパーグラフ文法(JSAI2018)化学構造式のためのハイパーグラフ文法(JSAI2018)
化学構造式のためのハイパーグラフ文法(JSAI2018)
 
能動学習による多関係データセットの構築(IBIS2015 博士課程招待講演)
能動学習による多関係データセットの構築(IBIS2015 博士課程招待講演)能動学習による多関係データセットの構築(IBIS2015 博士課程招待講演)
能動学習による多関係データセットの構築(IBIS2015 博士課程招待講演)
 
Active Learning for Multi-relational Data Construction
Active Learning for Multi-relational Data ConstructionActive Learning for Multi-relational Data Construction
Active Learning for Multi-relational Data Construction
 
能動学習による多関係データセットの構築
能動学習による多関係データセットの構築能動学習による多関係データセットの構築
能動学習による多関係データセットの構築
 
Preserving Worker Privacy in Crowdsourcing
Preserving Worker Privacy in CrowdsourcingPreserving Worker Privacy in Crowdsourcing
Preserving Worker Privacy in Crowdsourcing
 
プライバシ保護クラウドソーシング
プライバシ保護クラウドソーシングプライバシ保護クラウドソーシング
プライバシ保護クラウドソーシング
 
20130605-JSAI2013
20130605-JSAI201320130605-JSAI2013
20130605-JSAI2013
 
20130304-DEIM2013
20130304-DEIM201320130304-DEIM2013
20130304-DEIM2013
 

Recently uploaded

Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxUnboundStockton
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitolTechU
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxJiesonDelaCerna
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentInMediaRes1
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 

Recently uploaded (20)

Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docx
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptx
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptx
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media Component
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 

Clustering Crowds for Robust Estimation

  • 1. Clustering Crowds Hiroshi Kajino1, Yuta Tsuboi2, Hisashi Kashima1 1: The University of Tokyo 2: IBM Research - Tokyo July 16th, 2013 1AAAI-13 *H. Kajino and H. Kashima were supported by the FIRST program.
  • 2. Outline • Motivation and Problem Setting Quality control problem of crowdsourcing • Existing Method Learning from a crowd-generated training set • Proposed Method Focusing on the similarity between workers • Experimental Results Robust estimation can be realized • Conclusion July 16th, 2013 AAAI-13 2
  • 3. Outline • Motivation and Problem Setting Quality control problem of crowdsourcing • Existing Method Learning from a crowd-generated training set • Proposed Method Focusing on the similarity between workers • Experimental Results Robust estimation can be realized • Conclusion July 16th, 2013 AAAI-13 3
  • 4. Crowdsourcing • Crowdsourcing: system to access large crowds Pros: process human intelligence tasks at low cost Cons: abilities of workers are unknown ⇒ Quality of results is not guaranteed July 16th, 2013 AAAI-13 4 Able to access large, but unknown manpower WorkerRequester 2. Return results 1. Request tasks 3. Pay rewards
  • 5. Task in Machine Learning Community • Task: picture = bird ? Pros: easy construct a large training set at low cost Cons: quality of labels is not guaranteed July 16th, 2013 AAAI-13 5 Large, but low-quality training set can be obtained easily Difficult Easy Superior Inferior True labels (Unobservable) Yes Yes No Yes No No No No Yes No Yes No
  • 6. Task in Machine Learning Community • Task: picture = bird ? Pros: easy construct a large training set at low cost Cons: quality of labels is not guaranteed July 16th, 2013 AAAI-13 6 Large, but low-quality training set can be obtained easily Difficult Easy Superior Inferior True labels (Unobservable) Yes Yes No Yes No No No No Yes No Yes No Overcome this difficulty
  • 7. Problem Setting • Input – Feature vector : xi ∈RD (i=1,…,I) – Worker : j ∈{1,2,…,J} – Crowd label: yij ∈{0,1} • Output – classifier w0 ∈ RD (logistic regression model) Note: we do not use the ground truths • Common Approach: 1. Model the relation between w0 and {yij} 2. Inferring the model to obtain w0 July 16th, 2013 AAAI-13 7 Estimate a classifier from crowd-generated data Bird or not w0
  • 8. Outline • Motivation and Problem Setting Quality control problem of crowdsourcing • Existing Method Learning from a crowd-generated training set • Proposed Method Focusing on the similarity between workers • Experimental Results Robust estimation can be realized • Conclusion July 16th, 2013 AAAI-13 8
  • 9. • Personal Classifier (PC) Method [Kajino+,12] – Worker j = classifier wj (= w0 + (noise)) July 16th, 2013 AAAI-13 9 Aggregate “personal classifiers” to obtain w0 personal classifiers w0 yi2 yi1 true classifier crowd labels w1 w2 w3 yi3 N(w0 | 0, η-1I) j=2 j=3 j=1 prior distribution known unknown Existing Method
  • 10. • Parameter estimation = MAP estimation – Parameters: w0, W={wj}J j=1 – Solve the convex optimization problem: July 16th, 2013 AAAI-13 10 Parameter estimation = optimizing a convex function min w0, W (logistic loss) Existing Method priormodel-relationloss for PCs
  • 11. Existing Method: Discussion • Personal Classifier Method [Kajino+, 2012] #(parameters / worker) = D Pros: global optimum Cons: bad performance in case of few data per worker • Clustered Personal Classifier Method Pros: global optimum & moderate performance Key: fuse similar workers to decrease the degree of freedom July 16th, 2013 AAAI-13 11 Estimation can be unstable for the PC method Proposed
  • 12. Outline • Motivation and Problem Setting Quality control problem of crowdsourcing • Existing Method Learning from a crowd-generated training set • Proposed Method Focusing on the similarity between workers • Experimental Results Robust estimation can be realized • Conclusion July 16th, 2013 AAAI-13 12
  • 13. Proposed Method: Idea • Analysis on workers [Welinder+, 2010] “Notice how the annotators’ decision planes fall roughly into three clusters” – Clustering workers is a reasonable idea (phrase & picture are cited from The multidimensional wisdom of crowds by Welinder+, NIPS 2010) July 16th, 2013 AAAI-13 13 Similarity between workers can be observed in real data
  • 14. Proposed Method: Formulation • Clustered Personal Classifier (CPC) Method – Model-relation term finds and fuses similar workers → Cut down the degree of freedom (μ controls the strength of clustering) July 16th, 2013 AAAI-13 14 Fuse similar workers to cut down the degree of freedom (cf. for the PC method) where forcing wj = wk model-relation
  • 15. Outline • Motivation and Problem Setting Quality control problem of crowdsourcing • Existing Method Learning from a crowd-generated training set • Proposed Method Focusing on the similarity between workers • Experimental Results Robust estimation can be realized • Conclusion July 16th, 2013 AAAI-13 15
  • 16. Experiments on Synthetic Data • Synthetic Data (J=I=10, spammers (random worker) & experts) (L) (Dimension)=2: PC method = CPC method (R) (Dimension)=10: PC method < CPC method July 16th, 2013 AAAI-13 16 Robust performance on a small data set Percentage of spammers Percentage of spammers Proposed Existing better
  • 17. Experiments on Real Data: Performance • Performance Test on Real Data [Finin+,10] – NER task (each word is named entity or not) – (Dimension)=161,901, #(instances)=17,747, #(workers)=42 July 16th, 2013 AAAI-13 17 Proposed method outperforms other methods Precision Recall F-measure CPC method 0.647 0.716 0.680 PC method 0.637 0.721 0.677 LC method 0.625 0.732 0.675 AOC method 0.680 0.670 0.675 MV method 0.686 0.651 0.668 Existing Method Proposed
  • 18. Experiments on Real Data: Clustering • Hierarchical clustering on workers by increasing μ • Outlier worker can be detected without “honey pots” July 16th, 2013 AAAI-13 18 Clustering result indicates the existence of an outlier worker Precision: 0.454 Recall: 0.857 Strength of clustering (=μ) -->
  • 19. Outline • Motivation and Problem Setting Quality control problem of crowdsourcing • Existing Method Learning from a crowd-generated training set • Proposed Method Focusing on the similarity between workers • Experimental Results Robust estimation can be realized • Conclusion July 16th, 2013 AAAI-13 19
  • 20. Conclusion • Problem Setting – Learning from redundant, variable-quality training data • Problem of the PC Method – #(parameters) is relatively large – Unstable when data for one worker are small • Proposed Method (CPC Method) – Cut the degree of freedom by fusing similar workers • Experimental Results – More robust estimation in case of small data sets – Valid as a method of “mining” workers July 16th, 2013 AAAI-13 20 Introducing similarities between workers is beneficial
  • 21. July 16th, 2013 AAAI-13 21