SlideShare a Scribd company logo
1 of 21
Download to read offline
iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF-1115210). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Reaching Consensus in Crowdsourced Transcription of BiocollectionsInformation 
Andréa Matsunaga (ammatsun@ufl.edu), Austin Mast, and José A.B. Fortes 
10th IEEE International Conference on e-Science 
October 23, 2014 
Guarujá, SP, Brazil
2 
Background and Motivation 
•Estimated 1-billion biological specimens in the US 
•iDigBio+ Thematic Collections Network (TCN) + Partners to Existing Network (PEN)
3 
Background and Motivation 
•TCNs and PENs performing digitization work 
•Generating images, transcribing information about what/when/where/who 
•If digitization took 1 second and if we performed in sequence: 
–1,000,000,000 seconds > 30 years 
•Parallelism: 
–Crowdsourcing!!
4 
Crowdsourcing Transcription Projects 
•NotesFromNature(http://www.notesfromnature.org/) 
–Zooniverseplatform 
•Once the user selects the region with the label, s/he can start transcribing and parsing information to a number of pre- defined fields 
•For a requester, a pre-defined number of transcriptions are returned
5 
Crowdsourcing Transcription Projects 
•ALA (http://volunteer.ala.org.au/) 
–Platform: Grails 
•User zooms in to read the label and parse to the custom pre- defined terms 
•Single worker followed by expert approval
6 
Crowdsourcing Transcription Projects 
•Symbiota(http://lbcc1.acis.ufl.edu/volunteer) 
–Platform: PHP 
•Ability to OCR and parse data 
•Single worker followed by expert approval
7 
The transcription task 
Location 
Scientific Name 
Scientific Author 
Collected by 
Habitat and description 
County 
Collector Number 
Collection Date 
State/Province
8 
Proposed Consensus Approach 
•Goal: 
–Reach consensus with minimum number of transcriptions 
•Method: 
–Control the number of workers per task 
–Apply lossless and/or lossyalgorithms per field 
Transcription task Queue 
Crowdsourcing 
Volunteer’s Transcription 
Pick task 
ti 
ti’ Past Transcriptions 
Lossless algorithms 
No 
Lossyalgorithms 
Consensus reached? 
Yes 
No 
Consensus reached? 
Yes 
Solution
9 
Lossless normalization algorithms 
Code 
Lossless functionality 
b 
Removes all extra whitespaces 
n 
Apply specific transformation functions on a per-field basis (e.g., to normalize section/township/range, proper names, and latitude/longitude) 
t 
Apply specific translation tables on a per-field basis to expand abbreviations (e.g., hy, hwy, and hiwayto highway) or to shorten expansions (e.g., Florida to FL)
10 
Lossynormalization algorithms 
Code 
Lossyfunctionality 
w 
Approximate comparison by ignoring all whitespace (e.g., “0–3” is equivalent to “0 –3”) 
c 
Case insensitive approximate comparison (e.g., “Road” and “road” are considered equivalent) 
s 
Consider two sequences equivalent when one is a substring of another or one sequence contains all words from the other sequence 
p 
Punctuation insensitive approximate comparison (e.g., separation of sentences with comma, semi-colon or period are considered equivalent) 
f 
Approximate fingerprint comparison ignoring the order of words in sentences 
l 
Approximate equivalency when sequences have Levenshteindistance within a configurable threshold (l2 indicates a maximum distance of 2)
11 
Alternative voting and consensus output 
Code 
Voting and consensus 
v 
Consensus is reached when there is a single group (set of matching answers) that has the most votes, instead of requiring strict majority vote among all answers 
a 
Outputs best available answers when consensus is not achieved 
n=4 
Majority voting requires 3 matching answers 
→Consensus not reached 
v: Blue set has most votes 
→Consensus reached 푛 2+1
12 
Experimental Setup 
•Notes from Nature 
•Herbarium specimens from a single institution 
•Configured to require 10 workers per task that yielded close-to- linear distribution due to empty tasks and skips 
•23,557 total transcriptions completed by at least 1,089 distinct workers 
0 
50 
100 
150 
200 
250 
300 
350 
400 
1 
56 
111 
166 
221 
276 
331 
386 
441 
496 
551 
606 
661 
716 
771 
826 
881 
936 
991 
1046 
Transcriptions per Worker 
Distinct Worker ID 
Transcriptions Performed by Individual Workers 
Total of 23557 transcriptions completed 
9950anonymous transcriptions 
1089 distinct known workers 
255 single transcription workers 
Top worker: 
381 transcriptions 
2.00% 
4.42% 
6.99% 
8.76% 
11.10% 
10.75% 
11.87% 
11.95% 
15.43% 
16.63% 
0.11% 
0% 
5% 
10% 
15% 
20% 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
Workers per Task 
Worker Distribution Per Task 
Field 
Uniq# 
Field 
Uniq# 
Country 
39 
Location 
16,161 
State/Province 
288 
Habitat 
15,134 
County 
655 
Collected by 
3,380 
Scientific name 
5,941 
Collector Number 
3,665 
Scientific author 
4,088 
Collection date 
2,287
13 
1.8% 
53.2% 
61.4% 
84.2% 
0% 
10% 
20% 
30% 
40% 
50% 
60% 
70% 
80% 
90% 
100% 
Consensus Achievement (%) 
Consensus Achievement Varying Algorithms 
nf 
bnt 
bntwcsp 
bntwcspf 
bntwcspfl2 
bntwcspfl2v 
Overall Performance 
•Full consensus improvement from 1.8% to 84.2% 
•Confirms intuition that country, state/province, county, collector numberand collector dateare “easy” 
•Lossless algorithms have small impact except for scientific authorand collected by 
•Being insensitive to whitespace, punctuation, and letter case as well as considering substrings, provide the greatest improvement when including lossyalgorithms in “difficult” fields
14 
Additional Verification 
•Consensus reached mainly with lossless algorithms 
•Low percentage of blank responses
15 
Consensus Accuracy 
•300 labels were transcribed by an expert 
•Expert had access to information across labels that workers did not have 
•Effect on the overall accuracy is minimal 
–0.9% drop for accepting cases that did not reach consensus 
–2.3% drop for minimizing the needed workforce 
92.0% 
91.1% 
59 
57 
54 
33 
28 
27 
37 
54 
55 
53 
59 
0 
50 
100 
150 
200 
250 
300 
0% 
20% 
40% 
60% 
80% 
100% 
Above 90% Thresold 
Character Score (%) 
Consensus Accuracy (Consensus Needed vs. Accept Best) 
Consensus Needed (%) 
Accept Best (%) 
Consensus Needed 
Accept Best 
92.0% 
89.7% 
0 
50 
100 
150 
200 
250 
300 
0% 
20% 
40% 
60% 
80% 
100% 
Above 90% Threshold 
Character Score 
Consensus Accuracy (Min. Workers vs. All Workers) 
All Workers (%) 
Min. Workers (%) 
All Workers 
Min. Workers2∗푚푎푡푐ℎ푒푠 푙푒푛푠1+푙푒푛(푠2)
16 
Workforce Savings 
•Can be as high as 55.8% for the distribution in the studied dataset 
•Good for a fixed setting: 3 workers 
•Controller advantage: 3 workers is just a good average, and our results show that there are cases where up to 9 workers were needed to reach overall consensus 
2.00 
2.75 
2.93 
3.11 
3.16 
3.30 
3.29 
3.25 
3.20 
3.00 
0 
2 
4 
6 
8 
10 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
Workers per Task 
Workforce Savings 
Original Number of Workers 
Minimum Number of Workers 
55.8% of jobs saved with controller
17 
Task Design Improvement recommendations 
•Restricted user interface improves consensus and accuracy 
–Caution to not restrict valid scenarios (e.g., partial dates, range of dates) 
–Broadly defined fields could be engineered to capture more parsed data (e.g., lat/long, TRS) 
•Exploring relationships between tasks 
–Enter collector number and collection date first 
–Update related record to have the same information 
•Additional training 
–Problems pronounced in separating scientific names from its authorship
18 
Additional Improvements to Consensus Algorithms 
•Code is modular and open; thus, opportunity for: 
–Custom dictionaries could be applied (general dictionaries led to a high number of false positives due to the amount of abbreviation and names) 
–Scientific name parsers 
–External contributions 
–https://github.com/idigbio-citsci- hackathon/CrowdConsensus 
•Merge matched outputs after lossyalgorithms are applied 
–R. E. Perdue, Jr. and K. Blum 
–Re Perdue Jr and K. Blum 
–R. R. Perdue Jr, K. Blum 
•Additional validation across fields (consistency) 
•Apply consensus controller on a per-field basis
19 
Recommendations Beyond Crowdsourcing 
•Leveraging and improving: 
–Optical Character Recognition (OCR) 
–Natural Language Processing (NLP) 
•2-way street scenarios: 
–Use crowdsourcing to select clean text for OCR 
–Use even poor OCR to guide tasks to the right crowd by creating clusters of tasks 
–Use NLP to parse verbatim data from the crowd 
–Improve NLP and OCR training with additional data from the crowd
20 
Additional parsed data 
•PhiloJIVE: http://phylojive.acis.ufl.edu
iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of 
Biodiversity Collections Program (Cooperative Agreement EF-1115210). Any opinions, findings, 
and conclusions or recommendations expressed in this material are those of the author(s) and 
do not necessarily reflect the views of the National Science Foundation. 
www.idigbio.org 
facebook.com/iDigBio 
twitter.com/iDigBio 
vimeo.com/idigbio 
idigbio.org/rss-feed.xml 
webcal://www.idigbio.org/events-calendar/export.ics 
Questions? 
Thank you!

More Related Content

Viewers also liked

Dentist Web Design - Website Marketing Checklist
Dentist Web Design - Website Marketing ChecklistDentist Web Design - Website Marketing Checklist
Dentist Web Design - Website Marketing ChecklistFiregang Dental Marketing
 
macro research -First draft WITH TABLE OF CONTENT
macro research -First draft WITH TABLE OF CONTENTmacro research -First draft WITH TABLE OF CONTENT
macro research -First draft WITH TABLE OF CONTENTFerdous Mohammed
 
Loadแนวข้อสอบ ผู้ช่วยนายช่างเขียนแบบ (ปวช.) องค์การบริหารส่วนตำบลไตรตรึงษ์
 Loadแนวข้อสอบ ผู้ช่วยนายช่างเขียนแบบ (ปวช.) องค์การบริหารส่วนตำบลไตรตรึงษ์ Loadแนวข้อสอบ ผู้ช่วยนายช่างเขียนแบบ (ปวช.) องค์การบริหารส่วนตำบลไตรตรึงษ์
Loadแนวข้อสอบ ผู้ช่วยนายช่างเขียนแบบ (ปวช.) องค์การบริหารส่วนตำบลไตรตรึงษ์ปลั๊ก พิมวิเศษ
 
CV-Armanu Engleza
CV-Armanu EnglezaCV-Armanu Engleza
CV-Armanu EnglezaIon Armanu
 

Viewers also liked (10)

Dentist Web Design - Website Marketing Checklist
Dentist Web Design - Website Marketing ChecklistDentist Web Design - Website Marketing Checklist
Dentist Web Design - Website Marketing Checklist
 
Informe tg-2015 b
Informe tg-2015 bInforme tg-2015 b
Informe tg-2015 b
 
Vinod report es 1
Vinod report es   1Vinod report es   1
Vinod report es 1
 
Interview master list as of 1-5-16
Interview master list as of 1-5-16Interview master list as of 1-5-16
Interview master list as of 1-5-16
 
macro research -First draft WITH TABLE OF CONTENT
macro research -First draft WITH TABLE OF CONTENTmacro research -First draft WITH TABLE OF CONTENT
macro research -First draft WITH TABLE OF CONTENT
 
PRASHANT C V
PRASHANT C VPRASHANT C V
PRASHANT C V
 
Loadแนวข้อสอบ ผู้ช่วยนายช่างเขียนแบบ (ปวช.) องค์การบริหารส่วนตำบลไตรตรึงษ์
 Loadแนวข้อสอบ ผู้ช่วยนายช่างเขียนแบบ (ปวช.) องค์การบริหารส่วนตำบลไตรตรึงษ์ Loadแนวข้อสอบ ผู้ช่วยนายช่างเขียนแบบ (ปวช.) องค์การบริหารส่วนตำบลไตรตรึงษ์
Loadแนวข้อสอบ ผู้ช่วยนายช่างเขียนแบบ (ปวช.) องค์การบริหารส่วนตำบลไตรตรึงษ์
 
The prodigal
The prodigalThe prodigal
The prodigal
 
CV-Armanu Engleza
CV-Armanu EnglezaCV-Armanu Engleza
CV-Armanu Engleza
 
Hypothesis and Test
Hypothesis and TestHypothesis and Test
Hypothesis and Test
 

Similar to Matsunaga crowdsourcing IEEE e-science 2014

AppTek-CLimateGPT-EvryWS20240308-v3.pptx
AppTek-CLimateGPT-EvryWS20240308-v3.pptxAppTek-CLimateGPT-EvryWS20240308-v3.pptx
AppTek-CLimateGPT-EvryWS20240308-v3.pptxGérard Chollet
 
Evaluating Classification Algorithms Applied To Data Streams Esteban Donato
Evaluating Classification Algorithms Applied To Data Streams   Esteban DonatoEvaluating Classification Algorithms Applied To Data Streams   Esteban Donato
Evaluating Classification Algorithms Applied To Data Streams Esteban DonatoEsteban Donato
 
Semantics in Sensor Networks
Semantics in Sensor NetworksSemantics in Sensor Networks
Semantics in Sensor NetworksOscar Corcho
 
Presentation ASLIB 2014_Ghoula
Presentation ASLIB 2014_GhoulaPresentation ASLIB 2014_Ghoula
Presentation ASLIB 2014_GhoulaNizar Ghoula
 
The data streaming processing paradigm and its use in modern fog architectures
The data streaming processing paradigm and its use in modern fog architecturesThe data streaming processing paradigm and its use in modern fog architectures
The data streaming processing paradigm and its use in modern fog architecturesVincenzo Gulisano
 
accessible-streaming-algorithms
accessible-streaming-algorithmsaccessible-streaming-algorithms
accessible-streaming-algorithmsFarhan Zaki
 
JAVA 2013 IEEE DATAMINING PROJECT Region based foldings in process discovery
JAVA 2013 IEEE DATAMINING PROJECT Region based foldings in process discoveryJAVA 2013 IEEE DATAMINING PROJECT Region based foldings in process discovery
JAVA 2013 IEEE DATAMINING PROJECT Region based foldings in process discoveryIEEEGLOBALSOFTTECHNOLOGIES
 
Modern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High PerformanceModern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High Performanceinside-BigData.com
 
Cyclone DDS Unleashed: Scalability in DDS and Dealing with Large Systems
Cyclone DDS Unleashed: Scalability in DDS and Dealing with Large SystemsCyclone DDS Unleashed: Scalability in DDS and Dealing with Large Systems
Cyclone DDS Unleashed: Scalability in DDS and Dealing with Large SystemsZettaScaleTechnology
 
Enhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort DataEnhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort DataBarry Smith
 
Mining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDTMining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDTDavide Gallitelli
 
Creating a dataset of peer review in computer science conferences published b...
Creating a dataset of peer review in computer science conferences published b...Creating a dataset of peer review in computer science conferences published b...
Creating a dataset of peer review in computer science conferences published b...Aliaksandr Birukou
 
Semantic Analysis to Compute Personality Traits from Social Media Posts
Semantic Analysis to Compute Personality Traits from Social Media PostsSemantic Analysis to Compute Personality Traits from Social Media Posts
Semantic Analysis to Compute Personality Traits from Social Media PostsGiulio Carducci
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Raja Chiky
 
Data analytics for engineers- introduction
Data analytics for engineers-  introductionData analytics for engineers-  introduction
Data analytics for engineers- introductionRINUSATHYAN
 
70 C o m m u n i C at i o n s o f t h E a C m j u.docx
70    C o m m u n i C at i o n s  o f  t h E  a C m       j u.docx70    C o m m u n i C at i o n s  o f  t h E  a C m       j u.docx
70 C o m m u n i C at i o n s o f t h E a C m j u.docxevonnehoggarth79783
 
Mediaeval 2013 Spoken Web Search results slides
Mediaeval 2013 Spoken Web Search results slidesMediaeval 2013 Spoken Web Search results slides
Mediaeval 2013 Spoken Web Search results slidesXavier Anguera
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilitiesIan Foster
 

Similar to Matsunaga crowdsourcing IEEE e-science 2014 (20)

AppTek-CLimateGPT-EvryWS20240308-v3.pptx
AppTek-CLimateGPT-EvryWS20240308-v3.pptxAppTek-CLimateGPT-EvryWS20240308-v3.pptx
AppTek-CLimateGPT-EvryWS20240308-v3.pptx
 
Evaluating Classification Algorithms Applied To Data Streams Esteban Donato
Evaluating Classification Algorithms Applied To Data Streams   Esteban DonatoEvaluating Classification Algorithms Applied To Data Streams   Esteban Donato
Evaluating Classification Algorithms Applied To Data Streams Esteban Donato
 
Semantics in Sensor Networks
Semantics in Sensor NetworksSemantics in Sensor Networks
Semantics in Sensor Networks
 
Presentation ASLIB 2014_Ghoula
Presentation ASLIB 2014_GhoulaPresentation ASLIB 2014_Ghoula
Presentation ASLIB 2014_Ghoula
 
ga-2.ppt
ga-2.pptga-2.ppt
ga-2.ppt
 
The data streaming processing paradigm and its use in modern fog architectures
The data streaming processing paradigm and its use in modern fog architecturesThe data streaming processing paradigm and its use in modern fog architectures
The data streaming processing paradigm and its use in modern fog architectures
 
accessible-streaming-algorithms
accessible-streaming-algorithmsaccessible-streaming-algorithms
accessible-streaming-algorithms
 
PHM 2013
PHM 2013PHM 2013
PHM 2013
 
JAVA 2013 IEEE DATAMINING PROJECT Region based foldings in process discovery
JAVA 2013 IEEE DATAMINING PROJECT Region based foldings in process discoveryJAVA 2013 IEEE DATAMINING PROJECT Region based foldings in process discovery
JAVA 2013 IEEE DATAMINING PROJECT Region based foldings in process discovery
 
Modern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High PerformanceModern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High Performance
 
Cyclone DDS Unleashed: Scalability in DDS and Dealing with Large Systems
Cyclone DDS Unleashed: Scalability in DDS and Dealing with Large SystemsCyclone DDS Unleashed: Scalability in DDS and Dealing with Large Systems
Cyclone DDS Unleashed: Scalability in DDS and Dealing with Large Systems
 
Enhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort DataEnhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort Data
 
Mining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDTMining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDT
 
Creating a dataset of peer review in computer science conferences published b...
Creating a dataset of peer review in computer science conferences published b...Creating a dataset of peer review in computer science conferences published b...
Creating a dataset of peer review in computer science conferences published b...
 
Semantic Analysis to Compute Personality Traits from Social Media Posts
Semantic Analysis to Compute Personality Traits from Social Media PostsSemantic Analysis to Compute Personality Traits from Social Media Posts
Semantic Analysis to Compute Personality Traits from Social Media Posts
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014
 
Data analytics for engineers- introduction
Data analytics for engineers-  introductionData analytics for engineers-  introduction
Data analytics for engineers- introduction
 
70 C o m m u n i C at i o n s o f t h E a C m j u.docx
70    C o m m u n i C at i o n s  o f  t h E  a C m       j u.docx70    C o m m u n i C at i o n s  o f  t h E  a C m       j u.docx
70 C o m m u n i C at i o n s o f t h E a C m j u.docx
 
Mediaeval 2013 Spoken Web Search results slides
Mediaeval 2013 Spoken Web Search results slidesMediaeval 2013 Spoken Web Search results slides
Mediaeval 2013 Spoken Web Search results slides
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilities
 

Recently uploaded

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 

Recently uploaded (20)

E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 

Matsunaga crowdsourcing IEEE e-science 2014

  • 1. iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF-1115210). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Reaching Consensus in Crowdsourced Transcription of BiocollectionsInformation Andréa Matsunaga (ammatsun@ufl.edu), Austin Mast, and José A.B. Fortes 10th IEEE International Conference on e-Science October 23, 2014 Guarujá, SP, Brazil
  • 2. 2 Background and Motivation •Estimated 1-billion biological specimens in the US •iDigBio+ Thematic Collections Network (TCN) + Partners to Existing Network (PEN)
  • 3. 3 Background and Motivation •TCNs and PENs performing digitization work •Generating images, transcribing information about what/when/where/who •If digitization took 1 second and if we performed in sequence: –1,000,000,000 seconds > 30 years •Parallelism: –Crowdsourcing!!
  • 4. 4 Crowdsourcing Transcription Projects •NotesFromNature(http://www.notesfromnature.org/) –Zooniverseplatform •Once the user selects the region with the label, s/he can start transcribing and parsing information to a number of pre- defined fields •For a requester, a pre-defined number of transcriptions are returned
  • 5. 5 Crowdsourcing Transcription Projects •ALA (http://volunteer.ala.org.au/) –Platform: Grails •User zooms in to read the label and parse to the custom pre- defined terms •Single worker followed by expert approval
  • 6. 6 Crowdsourcing Transcription Projects •Symbiota(http://lbcc1.acis.ufl.edu/volunteer) –Platform: PHP •Ability to OCR and parse data •Single worker followed by expert approval
  • 7. 7 The transcription task Location Scientific Name Scientific Author Collected by Habitat and description County Collector Number Collection Date State/Province
  • 8. 8 Proposed Consensus Approach •Goal: –Reach consensus with minimum number of transcriptions •Method: –Control the number of workers per task –Apply lossless and/or lossyalgorithms per field Transcription task Queue Crowdsourcing Volunteer’s Transcription Pick task ti ti’ Past Transcriptions Lossless algorithms No Lossyalgorithms Consensus reached? Yes No Consensus reached? Yes Solution
  • 9. 9 Lossless normalization algorithms Code Lossless functionality b Removes all extra whitespaces n Apply specific transformation functions on a per-field basis (e.g., to normalize section/township/range, proper names, and latitude/longitude) t Apply specific translation tables on a per-field basis to expand abbreviations (e.g., hy, hwy, and hiwayto highway) or to shorten expansions (e.g., Florida to FL)
  • 10. 10 Lossynormalization algorithms Code Lossyfunctionality w Approximate comparison by ignoring all whitespace (e.g., “0–3” is equivalent to “0 –3”) c Case insensitive approximate comparison (e.g., “Road” and “road” are considered equivalent) s Consider two sequences equivalent when one is a substring of another or one sequence contains all words from the other sequence p Punctuation insensitive approximate comparison (e.g., separation of sentences with comma, semi-colon or period are considered equivalent) f Approximate fingerprint comparison ignoring the order of words in sentences l Approximate equivalency when sequences have Levenshteindistance within a configurable threshold (l2 indicates a maximum distance of 2)
  • 11. 11 Alternative voting and consensus output Code Voting and consensus v Consensus is reached when there is a single group (set of matching answers) that has the most votes, instead of requiring strict majority vote among all answers a Outputs best available answers when consensus is not achieved n=4 Majority voting requires 3 matching answers →Consensus not reached v: Blue set has most votes →Consensus reached 푛 2+1
  • 12. 12 Experimental Setup •Notes from Nature •Herbarium specimens from a single institution •Configured to require 10 workers per task that yielded close-to- linear distribution due to empty tasks and skips •23,557 total transcriptions completed by at least 1,089 distinct workers 0 50 100 150 200 250 300 350 400 1 56 111 166 221 276 331 386 441 496 551 606 661 716 771 826 881 936 991 1046 Transcriptions per Worker Distinct Worker ID Transcriptions Performed by Individual Workers Total of 23557 transcriptions completed 9950anonymous transcriptions 1089 distinct known workers 255 single transcription workers Top worker: 381 transcriptions 2.00% 4.42% 6.99% 8.76% 11.10% 10.75% 11.87% 11.95% 15.43% 16.63% 0.11% 0% 5% 10% 15% 20% 1 2 3 4 5 6 7 8 9 10 11 Workers per Task Worker Distribution Per Task Field Uniq# Field Uniq# Country 39 Location 16,161 State/Province 288 Habitat 15,134 County 655 Collected by 3,380 Scientific name 5,941 Collector Number 3,665 Scientific author 4,088 Collection date 2,287
  • 13. 13 1.8% 53.2% 61.4% 84.2% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Consensus Achievement (%) Consensus Achievement Varying Algorithms nf bnt bntwcsp bntwcspf bntwcspfl2 bntwcspfl2v Overall Performance •Full consensus improvement from 1.8% to 84.2% •Confirms intuition that country, state/province, county, collector numberand collector dateare “easy” •Lossless algorithms have small impact except for scientific authorand collected by •Being insensitive to whitespace, punctuation, and letter case as well as considering substrings, provide the greatest improvement when including lossyalgorithms in “difficult” fields
  • 14. 14 Additional Verification •Consensus reached mainly with lossless algorithms •Low percentage of blank responses
  • 15. 15 Consensus Accuracy •300 labels were transcribed by an expert •Expert had access to information across labels that workers did not have •Effect on the overall accuracy is minimal –0.9% drop for accepting cases that did not reach consensus –2.3% drop for minimizing the needed workforce 92.0% 91.1% 59 57 54 33 28 27 37 54 55 53 59 0 50 100 150 200 250 300 0% 20% 40% 60% 80% 100% Above 90% Thresold Character Score (%) Consensus Accuracy (Consensus Needed vs. Accept Best) Consensus Needed (%) Accept Best (%) Consensus Needed Accept Best 92.0% 89.7% 0 50 100 150 200 250 300 0% 20% 40% 60% 80% 100% Above 90% Threshold Character Score Consensus Accuracy (Min. Workers vs. All Workers) All Workers (%) Min. Workers (%) All Workers Min. Workers2∗푚푎푡푐ℎ푒푠 푙푒푛푠1+푙푒푛(푠2)
  • 16. 16 Workforce Savings •Can be as high as 55.8% for the distribution in the studied dataset •Good for a fixed setting: 3 workers •Controller advantage: 3 workers is just a good average, and our results show that there are cases where up to 9 workers were needed to reach overall consensus 2.00 2.75 2.93 3.11 3.16 3.30 3.29 3.25 3.20 3.00 0 2 4 6 8 10 2 3 4 5 6 7 8 9 10 11 Workers per Task Workforce Savings Original Number of Workers Minimum Number of Workers 55.8% of jobs saved with controller
  • 17. 17 Task Design Improvement recommendations •Restricted user interface improves consensus and accuracy –Caution to not restrict valid scenarios (e.g., partial dates, range of dates) –Broadly defined fields could be engineered to capture more parsed data (e.g., lat/long, TRS) •Exploring relationships between tasks –Enter collector number and collection date first –Update related record to have the same information •Additional training –Problems pronounced in separating scientific names from its authorship
  • 18. 18 Additional Improvements to Consensus Algorithms •Code is modular and open; thus, opportunity for: –Custom dictionaries could be applied (general dictionaries led to a high number of false positives due to the amount of abbreviation and names) –Scientific name parsers –External contributions –https://github.com/idigbio-citsci- hackathon/CrowdConsensus •Merge matched outputs after lossyalgorithms are applied –R. E. Perdue, Jr. and K. Blum –Re Perdue Jr and K. Blum –R. R. Perdue Jr, K. Blum •Additional validation across fields (consistency) •Apply consensus controller on a per-field basis
  • 19. 19 Recommendations Beyond Crowdsourcing •Leveraging and improving: –Optical Character Recognition (OCR) –Natural Language Processing (NLP) •2-way street scenarios: –Use crowdsourcing to select clean text for OCR –Use even poor OCR to guide tasks to the right crowd by creating clusters of tasks –Use NLP to parse verbatim data from the crowd –Improve NLP and OCR training with additional data from the crowd
  • 20. 20 Additional parsed data •PhiloJIVE: http://phylojive.acis.ufl.edu
  • 21. iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF-1115210). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. www.idigbio.org facebook.com/iDigBio twitter.com/iDigBio vimeo.com/idigbio idigbio.org/rss-feed.xml webcal://www.idigbio.org/events-calendar/export.ics Questions? Thank you!