Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

CUbRIK research at CIKM 2012: Map to Humans and Reduce Error

295 views

Published on

evidence of research described in "Map to Humans and Reduce Error – Crowdsourcing for
Deduplication Applied to Digital Libraries" publication

  • Be the first to comment

  • Be the first to like this

CUbRIK research at CIKM 2012: Map to Humans and Reduce Error

  1. 1. Map to Humans and Reduce Error - Crowdsourcing for Deduplication Applied to Digital Libraries Mihai Georgescu, Dang Duc Pham, Claudiu S. Firan, Julien Gaugaz, Wolfgang Nejdl [Show Diff] [Full Text] [Show Diff] • Find duplicate entities based on metadata Crowdsourcing: Title: Comparing H euris tic, Evolutionary and Local Search Approaches to Scheduling Title: Comparing H euris tic, Evolutionary and Local Search Approaches to Scheduling. Crowd Soft Decision Authors : Soraya B. Rana, Adele E. H owe, L. Darrell W hitley, Keith E. Mathias • Focus on scientific publications in the Freesearch system Authors : Soraya Rana, Adele E. H owe, L. Darrell, W hitley Keith Mathias Venue: Proceedings of the Third International Conference on Artificial Intelligence Planning Sys tems , Menlo Park, CA Book: AIPS Pg. 174-181 [Contents ] Year: 1996 Aggregation of all individual votes Wi,j(k)ϵ{-1,1} Publis her: The AAAI Pres s Year: 1996 Language: Englis h Type: conference (inproceedings ) CSD ϵ{0,1} Language: Englis h • An automatic method and human labelers work together Type: conference After carefully reviewing the publications metadata pres ented to you, how would you 1 HIT = 5 Pairs towards improving their performance at identifying 5ct / HIT Abs tract: The choice of s earch algorithm can play a vital role in the s ucces s of a scheduling application. In this paper, we inves tigate the contribution of s earch algorithms in s olving a clas s ify the 2 publications referred: Judgment for publications pair: 1  weight i, j (k )Wi , j (k ) weight i , j ( k )  ck  real-world warehous e s cheduling problem. W e compare performance of three types of 3 ->5 Assignments kWi , j s cheduling algorithms : heuris tic, genetic algorithms and local s earch. o Duplicates CSDi , j  cv duplicate entities o Not Duplicates 2 vWi , j • Actively learn how to deduplicate from the crowd by optimizing the parameters of the automatic method Compute crowd Get Crowd decisions and worker Worker Confidence • MTurk HITs to get labeled data, while tackling the quality Labels for P cand issues of the crowdsourced work confidences • Asses how reliable are the individual workers when compared to the overall performance of the crowd • Simple measure: proportion of pairs that have the Identify pairs with High confidence same label as the one assigned by the crowd Automatic Method ADS = threshold±ε pairs => P train • Use an EM algorithm to iteratively compute the Sample and add to P cand = P cand - P train • DuplicatesScorer produces an ADS worker confidence P cand • DSParams={(fieldName, fieldWeight)} and threshold • Compute CSD • Compare ADS to threshold => ADϵ{1,0} • Update c k Identify duplicate Optimize DSParams and threshold to fit to the Crowd Decision Strategies: pairs from P train, P dupl Crowd Decision data in P train • MV: Majority Voting; All users are equal c k=1 • Iter: c k computed using the EM algorithm • Aggregated decision from all workers for a pair produces • Boost: c k computed using the EM algorithm using a CSD Initial Better boosted weights in the computation of CSD • Worker contribution to the CSDis proportional to the DSParams, DSParams, • Heur: Heuristic 3/3 or 4/5 confidence c k we have in him Threshold Threshold • Compare CDS to 0.5 => CDϵ{1,0} P cand = φ P dupl Duplicate Detection Strategies Crowd Decision and Optimization Strategies Experiment Setup 1.00 • 3 Batches : Compare CD to AD and optimize DSParams and 0.80 o 60 HITs with qualification test 0.60 o 60 HITs without qualification test threshold to maximize Accuracy •Just signatures 0.40 o 120 HITs without qualification test • Sign 0.20 Crowd Decision Strategies P •Just the DuplicatesScorer - A Compare ADS to CSD and optimize DSParams • DS/m 3 workers 5 workers • DS/o s ign s ign+DS/ m s ign+DS/ o R Optimization •minimize the sum of errors DS/ m strategies MV MV Iter Manual Boost Heur •First compute signatures and then base •minimize the sum of log of errors DS/ o CD-MV decision on DuplicatesScorer Accuracy 79.19 80.00 79.73 80.00 78.92 79.73 • sign + DS/m •maximize the Pearson correlation • sign + DS/o Sum-Err 76.49 79.46 79.46 79.46 79.46 79.19 sign sign+DS/m sign+DS/o DS/m DS/o CD-MV Compare CD to AD and optimize threshold to •Directly use Crowd Decision obtained via Majority Voting CD-MV R 0.20 0.20 0.20 0.67 0.56 0.97 Sum-log-err 71.89 78.11 78.38 78.92 80.27 76.76 maximize Accuracy A 0.77 0.77 0.77 0.70 0.79 0.83 Pearson 73.24 79.46 79.46 80.54 79.46 81.08 P 0.95 0.95 1.00 0.48 0.66 0.63Contact: Mihai Georgescuemail: georgescu@L3S.de dblp.kbs.uni-hannover.deL3S Research Center / Leibniz Universität HannoverAppelstrasse 4, 30167 Hannover, Germanyphone: +49 511 762-19715 www.cubrikproject.eu

×