SlideShare a Scribd company logo
Clustering Crowds
Hiroshi Kajino1, Yuta Tsuboi2, Hisashi Kashima1
1: The University of Tokyo
2: IBM Research - Tokyo
July 16th, 2013 1AAAI-13
*H. Kajino and H. Kashima were supported by the FIRST program.
Outline
• Motivation and Problem Setting
Quality control problem of crowdsourcing
• Existing Method
Learning from a crowd-generated training set
• Proposed Method
Focusing on the similarity between workers
• Experimental Results
Robust estimation can be realized
• Conclusion
July 16th, 2013 AAAI-13 2
Outline
• Motivation and Problem Setting
Quality control problem of crowdsourcing
• Existing Method
Learning from a crowd-generated training set
• Proposed Method
Focusing on the similarity between workers
• Experimental Results
Robust estimation can be realized
• Conclusion
July 16th, 2013 AAAI-13 3
Crowdsourcing
• Crowdsourcing: system to access large crowds
Pros: process human intelligence tasks at low cost
Cons: abilities of workers are unknown
⇒ Quality of results is not guaranteed
July 16th, 2013 AAAI-13 4
Able to access large, but unknown manpower
WorkerRequester
2. Return results
1. Request tasks
3. Pay rewards
Task in Machine Learning Community
• Task: picture = bird ?
Pros: easy construct a large training set at low cost
Cons: quality of labels is not guaranteed
July 16th, 2013 AAAI-13 5
Large, but low-quality training set can be obtained easily
Difficult
Easy
Superior Inferior True labels
(Unobservable)
Yes Yes No Yes
No
No
No No
Yes No Yes
No
Task in Machine Learning Community
• Task: picture = bird ?
Pros: easy construct a large training set at low cost
Cons: quality of labels is not guaranteed
July 16th, 2013 AAAI-13 6
Large, but low-quality training set can be obtained easily
Difficult
Easy
Superior Inferior True labels
(Unobservable)
Yes Yes No Yes
No
No
No No
Yes No Yes
No
Overcome this difficulty
Problem Setting
• Input
– Feature vector : xi ∈RD (i=1,…,I)
– Worker : j ∈{1,2,…,J}
– Crowd label: yij ∈{0,1}
• Output
– classifier w0 ∈ RD (logistic regression model)
Note: we do not use the ground truths
• Common Approach:
1. Model the relation between w0 and {yij}
2. Inferring the model to obtain w0
July 16th, 2013 AAAI-13 7
Estimate a classifier from crowd-generated data
Bird or not
w0
Outline
• Motivation and Problem Setting
Quality control problem of crowdsourcing
• Existing Method
Learning from a crowd-generated training set
• Proposed Method
Focusing on the similarity between workers
• Experimental Results
Robust estimation can be realized
• Conclusion
July 16th, 2013 AAAI-13 8
• Personal Classifier (PC) Method [Kajino+,12]
– Worker j = classifier wj (= w0 + (noise))
July 16th, 2013 AAAI-13 9
Aggregate “personal classifiers” to obtain w0
personal classifiers
w0 yi2
yi1
true classifier
crowd labels
w1
w2
w3 yi3
N(w0 | 0, η-1I)
j=2
j=3
j=1
prior
distribution
known
unknown
Existing Method
• Parameter estimation = MAP estimation
– Parameters: w0, W={wj}J
j=1
– Solve the convex optimization problem:
July 16th, 2013 AAAI-13 10
Parameter estimation = optimizing a convex function
min
w0, W
(logistic loss)
Existing Method
priormodel-relationloss for PCs
Existing Method: Discussion
• Personal Classifier Method [Kajino+, 2012]
#(parameters / worker) = D
Pros: global optimum
Cons: bad performance in case of few data per worker
• Clustered Personal Classifier Method
Pros: global optimum & moderate performance
Key: fuse similar workers to decrease the degree of freedom
July 16th, 2013 AAAI-13 11
Estimation can be unstable for the PC method
Proposed
Outline
• Motivation and Problem Setting
Quality control problem of crowdsourcing
• Existing Method
Learning from a crowd-generated training set
• Proposed Method
Focusing on the similarity between workers
• Experimental Results
Robust estimation can be realized
• Conclusion
July 16th, 2013 AAAI-13 12
Proposed Method: Idea
• Analysis on workers [Welinder+, 2010]
“Notice how the annotators’ decision planes fall roughly
into three clusters”
– Clustering workers is a reasonable idea
(phrase & picture are cited from The multidimensional wisdom of crowds by Welinder+, NIPS 2010)
July 16th, 2013 AAAI-13 13
Similarity between workers can be observed in real data
Proposed Method: Formulation
• Clustered Personal Classifier (CPC) Method
– Model-relation term finds and fuses similar workers
→ Cut down the degree of freedom
(μ controls the strength of clustering)
July 16th, 2013 AAAI-13 14
Fuse similar workers to cut down the degree of freedom
(cf. for the PC method)
where forcing wj = wk
model-relation
Outline
• Motivation and Problem Setting
Quality control problem of crowdsourcing
• Existing Method
Learning from a crowd-generated training set
• Proposed Method
Focusing on the similarity between workers
• Experimental Results
Robust estimation can be realized
• Conclusion
July 16th, 2013 AAAI-13 15
Experiments on Synthetic Data
• Synthetic Data (J=I=10, spammers (random worker) & experts)
(L) (Dimension)=2: PC method = CPC method
(R) (Dimension)=10: PC method < CPC method
July 16th, 2013 AAAI-13 16
Robust performance on a small data set
Percentage of spammers Percentage of spammers
Proposed
Existing
better
Experiments on Real Data: Performance
• Performance Test on Real Data [Finin+,10]
– NER task (each word is named entity or not)
– (Dimension)=161,901, #(instances)=17,747, #(workers)=42
July 16th, 2013 AAAI-13 17
Proposed method outperforms other methods
Precision Recall F-measure
CPC method 0.647 0.716 0.680
PC method 0.637 0.721 0.677
LC method 0.625 0.732 0.675
AOC method 0.680 0.670 0.675
MV method 0.686 0.651 0.668
Existing
Method
Proposed
Experiments on Real Data: Clustering
• Hierarchical clustering on workers by increasing μ
• Outlier worker can be detected without “honey pots”
July 16th, 2013 AAAI-13 18
Clustering result indicates the existence of an outlier worker
Precision: 0.454
Recall: 0.857
Strength of clustering (=μ) -->
Outline
• Motivation and Problem Setting
Quality control problem of crowdsourcing
• Existing Method
Learning from a crowd-generated training set
• Proposed Method
Focusing on the similarity between workers
• Experimental Results
Robust estimation can be realized
• Conclusion
July 16th, 2013 AAAI-13 19
Conclusion
• Problem Setting
– Learning from redundant, variable-quality training data
• Problem of the PC Method
– #(parameters) is relatively large
– Unstable when data for one worker are small
• Proposed Method (CPC Method)
– Cut the degree of freedom by fusing similar workers
• Experimental Results
– More robust estimation in case of small data sets
– Valid as a method of “mining” workers
July 16th, 2013 AAAI-13 20
Introducing similarities between workers is beneficial
July 16th, 2013 AAAI-13 21

More Related Content

Similar to 20130716 aaai13-short

Simulation-Based Modeling and Evaluation of Incentive Schemes in Crowdsourcin...
Simulation-Based Modeling and Evaluation of Incentive Schemes in Crowdsourcin...Simulation-Based Modeling and Evaluation of Incentive Schemes in Crowdsourcin...
Simulation-Based Modeling and Evaluation of Incentive Schemes in Crowdsourcin...
Ognjen Scekic
 
Impact of Recruitment & Selection Processes on Employee Performance: A Study ...
Impact of Recruitment & Selection Processes on Employee Performance: A Study ...Impact of Recruitment & Selection Processes on Employee Performance: A Study ...
Impact of Recruitment & Selection Processes on Employee Performance: A Study ...
Sheheryar Alvi
 
IRJET- Evaluation Technique of Student Performance in various Courses
IRJET- Evaluation Technique of Student Performance in various CoursesIRJET- Evaluation Technique of Student Performance in various Courses
IRJET- Evaluation Technique of Student Performance in various Courses
IRJET Journal
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
Oluwasegun Matthew
 
Information Retrieval 08
Information Retrieval 08 Information Retrieval 08
Information Retrieval 08
Jeet Das
 
Analysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry SystemAnalysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry System
IJSRD
 
Analysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry SystemAnalysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry System
IJSRD
 
Data Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analyticsData Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analytics
Akin Osman Kazakci
 
User Personality and the New User Problem in a Context-­‐Aware POI Recommende...
User Personality and the New User Problem in a Context-­‐Aware POI Recommende...User Personality and the New User Problem in a Context-­‐Aware POI Recommende...
User Personality and the New User Problem in a Context-­‐Aware POI Recommende...
International Federation for Information Technologies in Travel and Tourism (IFITT)
 
Big Data Challenges and Solutions
Big Data Challenges and SolutionsBig Data Challenges and Solutions
Ron's muri presentation
Ron's muri presentationRon's muri presentation
Ron's muri presentation
gowinraj
 
IRJET- Tracking and Predicting Student Performance using Machine Learning
IRJET- Tracking and Predicting Student Performance using Machine LearningIRJET- Tracking and Predicting Student Performance using Machine Learning
IRJET- Tracking and Predicting Student Performance using Machine Learning
IRJET Journal
 
Machine Learning with Python- Methods for Machine Learning.pptx
Machine Learning with Python- Methods for Machine Learning.pptxMachine Learning with Python- Methods for Machine Learning.pptx
Machine Learning with Python- Methods for Machine Learning.pptx
iaeronlineexm
 
Computational Advertising-The LinkedIn Way
Computational Advertising-The LinkedIn WayComputational Advertising-The LinkedIn Way
Computational Advertising-The LinkedIn Way
yingfeng
 
De carlo rizk 2010 icelw
De carlo rizk 2010 icelwDe carlo rizk 2010 icelw
De carlo rizk 2010 icelw
Ting Yuan, Ed.D.
 
Tech meetup Data Driven - Codemotion
Tech meetup Data Driven - Codemotion Tech meetup Data Driven - Codemotion
Tech meetup Data Driven - Codemotion
antimo musone
 
Predire il futuro con Machine Learning & Big Data
Predire il futuro con Machine Learning & Big DataPredire il futuro con Machine Learning & Big Data
Predire il futuro con Machine Learning & Big Data
Data Driven Innovation
 
How to Conduct a Training Needs Analysis (TNA)
How to Conduct a Training Needs Analysis (TNA)How to Conduct a Training Needs Analysis (TNA)
How to Conduct a Training Needs Analysis (TNA)
SkillBuilder LMS - BaseCorp Learning Systems
 
Hr Analytics final.pptx
Hr Analytics final.pptxHr Analytics final.pptx
Hr Analytics final.pptx
SatwikBhat4
 
Identification of important features and data mining classification technique...
Identification of important features and data mining classification technique...Identification of important features and data mining classification technique...
Identification of important features and data mining classification technique...
IJECEIAES
 

Similar to 20130716 aaai13-short (20)

Simulation-Based Modeling and Evaluation of Incentive Schemes in Crowdsourcin...
Simulation-Based Modeling and Evaluation of Incentive Schemes in Crowdsourcin...Simulation-Based Modeling and Evaluation of Incentive Schemes in Crowdsourcin...
Simulation-Based Modeling and Evaluation of Incentive Schemes in Crowdsourcin...
 
Impact of Recruitment & Selection Processes on Employee Performance: A Study ...
Impact of Recruitment & Selection Processes on Employee Performance: A Study ...Impact of Recruitment & Selection Processes on Employee Performance: A Study ...
Impact of Recruitment & Selection Processes on Employee Performance: A Study ...
 
IRJET- Evaluation Technique of Student Performance in various Courses
IRJET- Evaluation Technique of Student Performance in various CoursesIRJET- Evaluation Technique of Student Performance in various Courses
IRJET- Evaluation Technique of Student Performance in various Courses
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
Information Retrieval 08
Information Retrieval 08 Information Retrieval 08
Information Retrieval 08
 
Analysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry SystemAnalysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry System
 
Analysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry SystemAnalysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry System
 
Data Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analyticsData Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analytics
 
User Personality and the New User Problem in a Context-­‐Aware POI Recommende...
User Personality and the New User Problem in a Context-­‐Aware POI Recommende...User Personality and the New User Problem in a Context-­‐Aware POI Recommende...
User Personality and the New User Problem in a Context-­‐Aware POI Recommende...
 
Big Data Challenges and Solutions
Big Data Challenges and SolutionsBig Data Challenges and Solutions
Big Data Challenges and Solutions
 
Ron's muri presentation
Ron's muri presentationRon's muri presentation
Ron's muri presentation
 
IRJET- Tracking and Predicting Student Performance using Machine Learning
IRJET- Tracking and Predicting Student Performance using Machine LearningIRJET- Tracking and Predicting Student Performance using Machine Learning
IRJET- Tracking and Predicting Student Performance using Machine Learning
 
Machine Learning with Python- Methods for Machine Learning.pptx
Machine Learning with Python- Methods for Machine Learning.pptxMachine Learning with Python- Methods for Machine Learning.pptx
Machine Learning with Python- Methods for Machine Learning.pptx
 
Computational Advertising-The LinkedIn Way
Computational Advertising-The LinkedIn WayComputational Advertising-The LinkedIn Way
Computational Advertising-The LinkedIn Way
 
De carlo rizk 2010 icelw
De carlo rizk 2010 icelwDe carlo rizk 2010 icelw
De carlo rizk 2010 icelw
 
Tech meetup Data Driven - Codemotion
Tech meetup Data Driven - Codemotion Tech meetup Data Driven - Codemotion
Tech meetup Data Driven - Codemotion
 
Predire il futuro con Machine Learning & Big Data
Predire il futuro con Machine Learning & Big DataPredire il futuro con Machine Learning & Big Data
Predire il futuro con Machine Learning & Big Data
 
How to Conduct a Training Needs Analysis (TNA)
How to Conduct a Training Needs Analysis (TNA)How to Conduct a Training Needs Analysis (TNA)
How to Conduct a Training Needs Analysis (TNA)
 
Hr Analytics final.pptx
Hr Analytics final.pptxHr Analytics final.pptx
Hr Analytics final.pptx
 
Identification of important features and data mining classification technique...
Identification of important features and data mining classification technique...Identification of important features and data mining classification technique...
Identification of important features and data mining classification technique...
 

More from Hiroshi Kajino

Graph generation using a graph grammar
Graph generation using a graph grammarGraph generation using a graph grammar
Graph generation using a graph grammar
Hiroshi Kajino
 
化学構造式のためのハイパーグラフ文法(JSAI2018)
化学構造式のためのハイパーグラフ文法(JSAI2018)化学構造式のためのハイパーグラフ文法(JSAI2018)
化学構造式のためのハイパーグラフ文法(JSAI2018)
Hiroshi Kajino
 
能動学習による多関係データセットの構築(IBIS2015 博士課程招待講演)
能動学習による多関係データセットの構築(IBIS2015 博士課程招待講演)能動学習による多関係データセットの構築(IBIS2015 博士課程招待講演)
能動学習による多関係データセットの構築(IBIS2015 博士課程招待講演)
Hiroshi Kajino
 
Active Learning for Multi-relational Data Construction
Active Learning for Multi-relational Data ConstructionActive Learning for Multi-relational Data Construction
Active Learning for Multi-relational Data Construction
Hiroshi Kajino
 
能動学習による多関係データセットの構築
能動学習による多関係データセットの構築能動学習による多関係データセットの構築
能動学習による多関係データセットの構築
Hiroshi Kajino
 
Preserving Worker Privacy in Crowdsourcing
Preserving Worker Privacy in CrowdsourcingPreserving Worker Privacy in Crowdsourcing
Preserving Worker Privacy in Crowdsourcing
Hiroshi Kajino
 
プライバシ保護クラウドソーシング
プライバシ保護クラウドソーシングプライバシ保護クラウドソーシング
プライバシ保護クラウドソーシング
Hiroshi Kajino
 
20130605-JSAI2013
20130605-JSAI201320130605-JSAI2013
20130605-JSAI2013
Hiroshi Kajino
 

More from Hiroshi Kajino (9)

Graph generation using a graph grammar
Graph generation using a graph grammarGraph generation using a graph grammar
Graph generation using a graph grammar
 
化学構造式のためのハイパーグラフ文法(JSAI2018)
化学構造式のためのハイパーグラフ文法(JSAI2018)化学構造式のためのハイパーグラフ文法(JSAI2018)
化学構造式のためのハイパーグラフ文法(JSAI2018)
 
能動学習による多関係データセットの構築(IBIS2015 博士課程招待講演)
能動学習による多関係データセットの構築(IBIS2015 博士課程招待講演)能動学習による多関係データセットの構築(IBIS2015 博士課程招待講演)
能動学習による多関係データセットの構築(IBIS2015 博士課程招待講演)
 
Active Learning for Multi-relational Data Construction
Active Learning for Multi-relational Data ConstructionActive Learning for Multi-relational Data Construction
Active Learning for Multi-relational Data Construction
 
能動学習による多関係データセットの構築
能動学習による多関係データセットの構築能動学習による多関係データセットの構築
能動学習による多関係データセットの構築
 
Preserving Worker Privacy in Crowdsourcing
Preserving Worker Privacy in CrowdsourcingPreserving Worker Privacy in Crowdsourcing
Preserving Worker Privacy in Crowdsourcing
 
プライバシ保護クラウドソーシング
プライバシ保護クラウドソーシングプライバシ保護クラウドソーシング
プライバシ保護クラウドソーシング
 
20130605-JSAI2013
20130605-JSAI201320130605-JSAI2013
20130605-JSAI2013
 
20130304-DEIM2013
20130304-DEIM201320130304-DEIM2013
20130304-DEIM2013
 

Recently uploaded

Wound healing PPT
Wound healing PPTWound healing PPT
Wound healing PPT
Jyoti Chand
 
A Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two HeartsA Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two Hearts
Steve Thomason
 
MDP on air pollution of class 8 year 2024-2025
MDP on air pollution of class 8 year 2024-2025MDP on air pollution of class 8 year 2024-2025
MDP on air pollution of class 8 year 2024-2025
khuleseema60
 
HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.
deepaannamalai16
 
Data Structure using C by Dr. K Adisesha .ppsx
Data Structure using C by Dr. K Adisesha .ppsxData Structure using C by Dr. K Adisesha .ppsx
Data Structure using C by Dr. K Adisesha .ppsx
Prof. Dr. K. Adisesha
 
Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...
PsychoTech Services
 
BPSC-105 important questions for june term end exam
BPSC-105 important questions for june term end examBPSC-105 important questions for june term end exam
BPSC-105 important questions for june term end exam
sonukumargpnirsadhan
 
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumPhilippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
MJDuyan
 
Pharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brubPharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brub
danielkiash986
 
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
EduSkills OECD
 
How to Manage Reception Report in Odoo 17
How to Manage Reception Report in Odoo 17How to Manage Reception Report in Odoo 17
How to Manage Reception Report in Odoo 17
Celine George
 
How to Setup Default Value for a Field in Odoo 17
How to Setup Default Value for a Field in Odoo 17How to Setup Default Value for a Field in Odoo 17
How to Setup Default Value for a Field in Odoo 17
Celine George
 
Bossa N’ Roll Records by Ismael Vazquez.
Bossa N’ Roll Records by Ismael Vazquez.Bossa N’ Roll Records by Ismael Vazquez.
Bossa N’ Roll Records by Ismael Vazquez.
IsmaelVazquez38
 
NIPER 2024 MEMORY BASED QUESTIONS.ANSWERS TO NIPER 2024 QUESTIONS.NIPER JEE 2...
NIPER 2024 MEMORY BASED QUESTIONS.ANSWERS TO NIPER 2024 QUESTIONS.NIPER JEE 2...NIPER 2024 MEMORY BASED QUESTIONS.ANSWERS TO NIPER 2024 QUESTIONS.NIPER JEE 2...
NIPER 2024 MEMORY BASED QUESTIONS.ANSWERS TO NIPER 2024 QUESTIONS.NIPER JEE 2...
Payaamvohra1
 
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptx
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptxCapTechTalks Webinar Slides June 2024 Donovan Wright.pptx
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptx
CapitolTechU
 
Juneteenth Freedom Day 2024 David Douglas School District
Juneteenth Freedom Day 2024 David Douglas School DistrictJuneteenth Freedom Day 2024 David Douglas School District
Juneteenth Freedom Day 2024 David Douglas School District
David Douglas School District
 
Haunted Houses by H W Longfellow for class 10
Haunted Houses by H W Longfellow for class 10Haunted Houses by H W Longfellow for class 10
Haunted Houses by H W Longfellow for class 10
nitinpv4ai
 
Skimbleshanks-The-Railway-Cat by T S Eliot
Skimbleshanks-The-Railway-Cat by T S EliotSkimbleshanks-The-Railway-Cat by T S Eliot
Skimbleshanks-The-Railway-Cat by T S Eliot
nitinpv4ai
 
Simple-Present-Tense xxxxxxxxxxxxxxxxxxx
Simple-Present-Tense xxxxxxxxxxxxxxxxxxxSimple-Present-Tense xxxxxxxxxxxxxxxxxxx
Simple-Present-Tense xxxxxxxxxxxxxxxxxxx
RandolphRadicy
 
Bonku-Babus-Friend by Sathyajith Ray (9)
Bonku-Babus-Friend by Sathyajith Ray  (9)Bonku-Babus-Friend by Sathyajith Ray  (9)
Bonku-Babus-Friend by Sathyajith Ray (9)
nitinpv4ai
 

Recently uploaded (20)

Wound healing PPT
Wound healing PPTWound healing PPT
Wound healing PPT
 
A Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two HeartsA Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two Hearts
 
MDP on air pollution of class 8 year 2024-2025
MDP on air pollution of class 8 year 2024-2025MDP on air pollution of class 8 year 2024-2025
MDP on air pollution of class 8 year 2024-2025
 
HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.
 
Data Structure using C by Dr. K Adisesha .ppsx
Data Structure using C by Dr. K Adisesha .ppsxData Structure using C by Dr. K Adisesha .ppsx
Data Structure using C by Dr. K Adisesha .ppsx
 
Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...
 
BPSC-105 important questions for june term end exam
BPSC-105 important questions for june term end examBPSC-105 important questions for june term end exam
BPSC-105 important questions for june term end exam
 
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumPhilippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
 
Pharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brubPharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brub
 
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
 
How to Manage Reception Report in Odoo 17
How to Manage Reception Report in Odoo 17How to Manage Reception Report in Odoo 17
How to Manage Reception Report in Odoo 17
 
How to Setup Default Value for a Field in Odoo 17
How to Setup Default Value for a Field in Odoo 17How to Setup Default Value for a Field in Odoo 17
How to Setup Default Value for a Field in Odoo 17
 
Bossa N’ Roll Records by Ismael Vazquez.
Bossa N’ Roll Records by Ismael Vazquez.Bossa N’ Roll Records by Ismael Vazquez.
Bossa N’ Roll Records by Ismael Vazquez.
 
NIPER 2024 MEMORY BASED QUESTIONS.ANSWERS TO NIPER 2024 QUESTIONS.NIPER JEE 2...
NIPER 2024 MEMORY BASED QUESTIONS.ANSWERS TO NIPER 2024 QUESTIONS.NIPER JEE 2...NIPER 2024 MEMORY BASED QUESTIONS.ANSWERS TO NIPER 2024 QUESTIONS.NIPER JEE 2...
NIPER 2024 MEMORY BASED QUESTIONS.ANSWERS TO NIPER 2024 QUESTIONS.NIPER JEE 2...
 
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptx
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptxCapTechTalks Webinar Slides June 2024 Donovan Wright.pptx
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptx
 
Juneteenth Freedom Day 2024 David Douglas School District
Juneteenth Freedom Day 2024 David Douglas School DistrictJuneteenth Freedom Day 2024 David Douglas School District
Juneteenth Freedom Day 2024 David Douglas School District
 
Haunted Houses by H W Longfellow for class 10
Haunted Houses by H W Longfellow for class 10Haunted Houses by H W Longfellow for class 10
Haunted Houses by H W Longfellow for class 10
 
Skimbleshanks-The-Railway-Cat by T S Eliot
Skimbleshanks-The-Railway-Cat by T S EliotSkimbleshanks-The-Railway-Cat by T S Eliot
Skimbleshanks-The-Railway-Cat by T S Eliot
 
Simple-Present-Tense xxxxxxxxxxxxxxxxxxx
Simple-Present-Tense xxxxxxxxxxxxxxxxxxxSimple-Present-Tense xxxxxxxxxxxxxxxxxxx
Simple-Present-Tense xxxxxxxxxxxxxxxxxxx
 
Bonku-Babus-Friend by Sathyajith Ray (9)
Bonku-Babus-Friend by Sathyajith Ray  (9)Bonku-Babus-Friend by Sathyajith Ray  (9)
Bonku-Babus-Friend by Sathyajith Ray (9)
 

20130716 aaai13-short

  • 1. Clustering Crowds Hiroshi Kajino1, Yuta Tsuboi2, Hisashi Kashima1 1: The University of Tokyo 2: IBM Research - Tokyo July 16th, 2013 1AAAI-13 *H. Kajino and H. Kashima were supported by the FIRST program.
  • 2. Outline • Motivation and Problem Setting Quality control problem of crowdsourcing • Existing Method Learning from a crowd-generated training set • Proposed Method Focusing on the similarity between workers • Experimental Results Robust estimation can be realized • Conclusion July 16th, 2013 AAAI-13 2
  • 3. Outline • Motivation and Problem Setting Quality control problem of crowdsourcing • Existing Method Learning from a crowd-generated training set • Proposed Method Focusing on the similarity between workers • Experimental Results Robust estimation can be realized • Conclusion July 16th, 2013 AAAI-13 3
  • 4. Crowdsourcing • Crowdsourcing: system to access large crowds Pros: process human intelligence tasks at low cost Cons: abilities of workers are unknown ⇒ Quality of results is not guaranteed July 16th, 2013 AAAI-13 4 Able to access large, but unknown manpower WorkerRequester 2. Return results 1. Request tasks 3. Pay rewards
  • 5. Task in Machine Learning Community • Task: picture = bird ? Pros: easy construct a large training set at low cost Cons: quality of labels is not guaranteed July 16th, 2013 AAAI-13 5 Large, but low-quality training set can be obtained easily Difficult Easy Superior Inferior True labels (Unobservable) Yes Yes No Yes No No No No Yes No Yes No
  • 6. Task in Machine Learning Community • Task: picture = bird ? Pros: easy construct a large training set at low cost Cons: quality of labels is not guaranteed July 16th, 2013 AAAI-13 6 Large, but low-quality training set can be obtained easily Difficult Easy Superior Inferior True labels (Unobservable) Yes Yes No Yes No No No No Yes No Yes No Overcome this difficulty
  • 7. Problem Setting • Input – Feature vector : xi ∈RD (i=1,…,I) – Worker : j ∈{1,2,…,J} – Crowd label: yij ∈{0,1} • Output – classifier w0 ∈ RD (logistic regression model) Note: we do not use the ground truths • Common Approach: 1. Model the relation between w0 and {yij} 2. Inferring the model to obtain w0 July 16th, 2013 AAAI-13 7 Estimate a classifier from crowd-generated data Bird or not w0
  • 8. Outline • Motivation and Problem Setting Quality control problem of crowdsourcing • Existing Method Learning from a crowd-generated training set • Proposed Method Focusing on the similarity between workers • Experimental Results Robust estimation can be realized • Conclusion July 16th, 2013 AAAI-13 8
  • 9. • Personal Classifier (PC) Method [Kajino+,12] – Worker j = classifier wj (= w0 + (noise)) July 16th, 2013 AAAI-13 9 Aggregate “personal classifiers” to obtain w0 personal classifiers w0 yi2 yi1 true classifier crowd labels w1 w2 w3 yi3 N(w0 | 0, η-1I) j=2 j=3 j=1 prior distribution known unknown Existing Method
  • 10. • Parameter estimation = MAP estimation – Parameters: w0, W={wj}J j=1 – Solve the convex optimization problem: July 16th, 2013 AAAI-13 10 Parameter estimation = optimizing a convex function min w0, W (logistic loss) Existing Method priormodel-relationloss for PCs
  • 11. Existing Method: Discussion • Personal Classifier Method [Kajino+, 2012] #(parameters / worker) = D Pros: global optimum Cons: bad performance in case of few data per worker • Clustered Personal Classifier Method Pros: global optimum & moderate performance Key: fuse similar workers to decrease the degree of freedom July 16th, 2013 AAAI-13 11 Estimation can be unstable for the PC method Proposed
  • 12. Outline • Motivation and Problem Setting Quality control problem of crowdsourcing • Existing Method Learning from a crowd-generated training set • Proposed Method Focusing on the similarity between workers • Experimental Results Robust estimation can be realized • Conclusion July 16th, 2013 AAAI-13 12
  • 13. Proposed Method: Idea • Analysis on workers [Welinder+, 2010] “Notice how the annotators’ decision planes fall roughly into three clusters” – Clustering workers is a reasonable idea (phrase & picture are cited from The multidimensional wisdom of crowds by Welinder+, NIPS 2010) July 16th, 2013 AAAI-13 13 Similarity between workers can be observed in real data
  • 14. Proposed Method: Formulation • Clustered Personal Classifier (CPC) Method – Model-relation term finds and fuses similar workers → Cut down the degree of freedom (μ controls the strength of clustering) July 16th, 2013 AAAI-13 14 Fuse similar workers to cut down the degree of freedom (cf. for the PC method) where forcing wj = wk model-relation
  • 15. Outline • Motivation and Problem Setting Quality control problem of crowdsourcing • Existing Method Learning from a crowd-generated training set • Proposed Method Focusing on the similarity between workers • Experimental Results Robust estimation can be realized • Conclusion July 16th, 2013 AAAI-13 15
  • 16. Experiments on Synthetic Data • Synthetic Data (J=I=10, spammers (random worker) & experts) (L) (Dimension)=2: PC method = CPC method (R) (Dimension)=10: PC method < CPC method July 16th, 2013 AAAI-13 16 Robust performance on a small data set Percentage of spammers Percentage of spammers Proposed Existing better
  • 17. Experiments on Real Data: Performance • Performance Test on Real Data [Finin+,10] – NER task (each word is named entity or not) – (Dimension)=161,901, #(instances)=17,747, #(workers)=42 July 16th, 2013 AAAI-13 17 Proposed method outperforms other methods Precision Recall F-measure CPC method 0.647 0.716 0.680 PC method 0.637 0.721 0.677 LC method 0.625 0.732 0.675 AOC method 0.680 0.670 0.675 MV method 0.686 0.651 0.668 Existing Method Proposed
  • 18. Experiments on Real Data: Clustering • Hierarchical clustering on workers by increasing μ • Outlier worker can be detected without “honey pots” July 16th, 2013 AAAI-13 18 Clustering result indicates the existence of an outlier worker Precision: 0.454 Recall: 0.857 Strength of clustering (=μ) -->
  • 19. Outline • Motivation and Problem Setting Quality control problem of crowdsourcing • Existing Method Learning from a crowd-generated training set • Proposed Method Focusing on the similarity between workers • Experimental Results Robust estimation can be realized • Conclusion July 16th, 2013 AAAI-13 19
  • 20. Conclusion • Problem Setting – Learning from redundant, variable-quality training data • Problem of the PC Method – #(parameters) is relatively large – Unstable when data for one worker are small • Proposed Method (CPC Method) – Cut the degree of freedom by fusing similar workers • Experimental Results – More robust estimation in case of small data sets – Valid as a method of “mining” workers July 16th, 2013 AAAI-13 20 Introducing similarities between workers is beneficial
  • 21. July 16th, 2013 AAAI-13 21