CUbRIK research at CIKM 2012: Map to Humans and Reduce Error


Published on

evidence of research described in "Map to Humans and Reduce Error – Crowdsourcing for
Deduplication Applied to Digital Libraries" publication

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

CUbRIK research at CIKM 2012: Map to Humans and Reduce Error

  1. 1. Map to Humans and Reduce Error - Crowdsourcing for Deduplication Applied to Digital Libraries Mihai Georgescu, Dang Duc Pham, Claudiu S. Firan, Julien Gaugaz, Wolfgang Nejdl [Show Diff] [Full Text] [Show Diff] • Find duplicate entities based on metadata Crowdsourcing: Title: Comparing H euris tic, Evolutionary and Local Search Approaches to Scheduling Title: Comparing H euris tic, Evolutionary and Local Search Approaches to Scheduling. Crowd Soft Decision Authors : Soraya B. Rana, Adele E. H owe, L. Darrell W hitley, Keith E. Mathias • Focus on scientific publications in the Freesearch system Authors : Soraya Rana, Adele E. H owe, L. Darrell, W hitley Keith Mathias Venue: Proceedings of the Third International Conference on Artificial Intelligence Planning Sys tems , Menlo Park, CA Book: AIPS Pg. 174-181 [Contents ] Year: 1996 Aggregation of all individual votes Wi,j(k)ϵ{-1,1} Publis her: The AAAI Pres s Year: 1996 Language: Englis h Type: conference (inproceedings ) CSD ϵ{0,1} Language: Englis h • An automatic method and human labelers work together Type: conference After carefully reviewing the publications metadata pres ented to you, how would you 1 HIT = 5 Pairs towards improving their performance at identifying 5ct / HIT Abs tract: The choice of s earch algorithm can play a vital role in the s ucces s of a scheduling application. In this paper, we inves tigate the contribution of s earch algorithms in s olving a clas s ify the 2 publications referred: Judgment for publications pair: 1  weight i, j (k )Wi , j (k ) weight i , j ( k )  ck  real-world warehous e s cheduling problem. W e compare performance of three types of 3 ->5 Assignments kWi , j s cheduling algorithms : heuris tic, genetic algorithms and local s earch. o Duplicates CSDi , j  cv duplicate entities o Not Duplicates 2 vWi , j • Actively learn how to deduplicate from the crowd by optimizing the parameters of the automatic method Compute crowd Get Crowd decisions and worker Worker Confidence • MTurk HITs to get labeled data, while tackling the quality Labels for P cand issues of the crowdsourced work confidences • Asses how reliable are the individual workers when compared to the overall performance of the crowd • Simple measure: proportion of pairs that have the Identify pairs with High confidence same label as the one assigned by the crowd Automatic Method ADS = threshold±ε pairs => P train • Use an EM algorithm to iteratively compute the Sample and add to P cand = P cand - P train • DuplicatesScorer produces an ADS worker confidence P cand • DSParams={(fieldName, fieldWeight)} and threshold • Compute CSD • Compare ADS to threshold => ADϵ{1,0} • Update c k Identify duplicate Optimize DSParams and threshold to fit to the Crowd Decision Strategies: pairs from P train, P dupl Crowd Decision data in P train • MV: Majority Voting; All users are equal c k=1 • Iter: c k computed using the EM algorithm • Aggregated decision from all workers for a pair produces • Boost: c k computed using the EM algorithm using a CSD Initial Better boosted weights in the computation of CSD • Worker contribution to the CSDis proportional to the DSParams, DSParams, • Heur: Heuristic 3/3 or 4/5 confidence c k we have in him Threshold Threshold • Compare CDS to 0.5 => CDϵ{1,0} P cand = φ P dupl Duplicate Detection Strategies Crowd Decision and Optimization Strategies Experiment Setup 1.00 • 3 Batches : Compare CD to AD and optimize DSParams and 0.80 o 60 HITs with qualification test 0.60 o 60 HITs without qualification test threshold to maximize Accuracy •Just signatures 0.40 o 120 HITs without qualification test • Sign 0.20 Crowd Decision Strategies P •Just the DuplicatesScorer - A Compare ADS to CSD and optimize DSParams • DS/m 3 workers 5 workers • DS/o s ign s ign+DS/ m s ign+DS/ o R Optimization •minimize the sum of errors DS/ m strategies MV MV Iter Manual Boost Heur •First compute signatures and then base •minimize the sum of log of errors DS/ o CD-MV decision on DuplicatesScorer Accuracy 79.19 80.00 79.73 80.00 78.92 79.73 • sign + DS/m •maximize the Pearson correlation • sign + DS/o Sum-Err 76.49 79.46 79.46 79.46 79.46 79.19 sign sign+DS/m sign+DS/o DS/m DS/o CD-MV Compare CD to AD and optimize threshold to •Directly use Crowd Decision obtained via Majority Voting CD-MV R 0.20 0.20 0.20 0.67 0.56 0.97 Sum-log-err 71.89 78.11 78.38 78.92 80.27 76.76 maximize Accuracy A 0.77 0.77 0.77 0.70 0.79 0.83 Pearson 73.24 79.46 79.46 80.54 79.46 81.08 P 0.95 0.95 1.00 0.48 0.66 0.63Contact: Mihai Georgescuemail: dblp.kbs.uni-hannover.deL3S Research Center / Leibniz Universität HannoverAppelstrasse 4, 30167 Hannover, Germanyphone: +49 511 762-19715