SlideShare a Scribd company logo
1 of 34
What Henderson Saw
   E XTRACTING OBSERVATIONS FROM CENTURY- OLD FIELD
                                           NOTEBOOKS




              Andrea ThomerUIUC, Gaurav VaidyaCU-B,
Robert GuralnickCU-B, David BloomUC-B & Laura RussellKU
or
From documents to datasets
 M INING THE JUNIUS HENDERSON FIELD NOTES FOR SPECIES
                                O CCURRENCE RECORDS



               Andrea ThomerUIUC, Gaurav VaidyaCU-B,
 Robert GuralnickCU-B, David BloomUC-B & Laura RussellKU
Field notes and Biodiversity science

• Field work is central to biodiversity work
• Field notes:
  • Are central to field work
  • Are typically stored in archives
  • But contain data
     • Data wants to be free!
Biodiversity science and “first person
                                     precision”
• We often forget that field notes store data

• Value of field notes is in the combination of
  qualitative/quantitative data (Kramer, 2011)

• Grinnell: “first person precision” (1912)

• How do we free the data, while also preserving the
  record of its context of production?
Junius Henderson

• A typical natural history “old-
  timer”
  • Had a mustache
  • wore suspenders
  • wrote snarky comments in his field
    notes about young
    whippersnappers and trains
  • Studied clams
Influential in small but lasting ways, but not well-known beyond Boulder
Henderson’s field notes

•   13 notebooks, 1 locality notebook
•   1672 pages of notes total
•   Prolific collector
•   numerous photographs
•   1905: Began field work for CU Museum
•   2000-2002: Transcribed by Dr. Peter Robinson
•   2006: NSIDC scanned the Henderson notebooks
•   2011-2012: annotation and data extraction
The Henderson Field Note Project

• Were looking for a low-tech digitization project
• Rob knew of the existence of the transcribed notes
• “What we can accomplish with five hours of work
  each?”
• Goals:
 • Make notes freely available
 • Try to engage volunteers on the internet
 • Produce one “neat thing” (a visualization, a map, etc)
Challenges in making notes available

•   No time!
•   No resources!
•   No time!
•   No repository!
•   No platform!
•   No time!
Solutions to challenges (ver. 1)

•   No sleeping!
•   Use free resources!
•   Guerrilla takeover of Wikisource!
•   Profit!
Wikisource

• Part of Wikimedia Foundation, as is Wikipedia
• Has its own “collections” or “accessions” policies
  • All docs from before 1923
  • Post-1922: Documentary sources, peer-reviewed scientific
    research, analytical & artistic works
• Support for “adding value” via
  transcription, translation, annotation, and more
Basic Project Steps

•   Upload notebooks to Wikisource
•   Match transcriptions to scans by hand
•   Create templates to support annotation
•   Advertise project; attract volunteers
•   Write simple script to extract annotations
•   Publish those via IPT installation as a DwC-A
•   Sleep
Basic Project Steps

•   Upload notebooks to Wikisource
•   Match transcriptions to scans by hand
•   Create templates to support annotation
•   Advertise project; attract volunteers
•   Write simple script to extract annotations
•   Publish those via IPT installation as a DwC-A
•   Sleep
Basic Project Steps

•   Upload notebooks to Wikisource
•   Match transcriptions to scans by hand
•   Create templates to support annotation
•   Advertise project; attract volunteers
•   Write simple script to extract annotations
•   Publish those via IPT installation as a DwC-A
•   Sleep
Annotation Templates

• Anyone can annotate the transcribed to tag
  elements
• Ex. “I saw a white-tailed jack rabbit” 
 “I saw a {{taxon|Lepus townsendii|white tailed jack rabbit}}.”
Annotation Templates
                          Note: “white
                           tailed jack
                             rabbit”
                          would work
                          here as well.


      {{taxon|Lepus townsendii|white tailed jack rabbit}}.



Type of annotation   Wikipedia link       verbatim text
Basic Project Steps

•   Upload notebooks to Wikisource
•   Match transcriptions to scans by hand
•   Create templates to support annotation
•   Advertise project; attract volunteers
•   Write simple script to extract annotations
•   Publish those via IPT installation as a DwC-A
•   Sleep
Basic Project Steps

•   Upload notebooks to Wikisource
•   Match transcriptions to scans by hand
•   Create templates to support annotation
•   Advertise project; attract volunteers
•   Write simple script to extract annotations
•   Publish those via IPT installation as a DwC-A
•   Sleep
Basic Project Steps

•   Upload notebooks to Wikisource
•   Match transcriptions to scans by hand
•   Create templates to support annotation
•   Advertise project; attract volunteers
•   Write simple script to extract annotations
•   Write complex scripts to extract annotations and
    compile them into occurrences
•   Extensively review occurrences
•   Taxonomic referencing
•   Publish those via IPT installation as a DwC-A
•   Sleep
Taxonomic Referencing

•   Remember that “Wikipedia link”?
•   We want to check if that is a valid taxonomic name
•   How?
•   Easy, right? Just check against a resolver!
Taxonomic Referencing

•   Remember that “Wikipedia link”?
•   We want to check if that is a valid taxonomic name
•   How?
•   Easy, right? Just check against a resolver!
•   Hard! Which resolver? How to verify?
         1) Check name against ITIS and EOL.
         2) Possible outcomes:
               a) Both concordant! YAY!
                      b) No results from both. Boo!
                      c) Discordant results. Need
         HUMANS!
         3) This was LOTS of work (thanks, Gaurav!)
Basic Project Steps

•   Upload notebooks to Wikisource
•   Match transcriptions to scans by hand
•   Create templates to support annotation
•   Advertise project; attract volunteers
•   Write simple script to extract annotations
•   Write complex scripts to extract annotations and
    compile them into occurrences
•   Extensively review occurrences
•   Taxonomic referencing
•   Publish those via IPT installation as a DwC-A
•   Sleep
Results!

   • 3 Notebooks posted and fully annotated
                             Notebook 1          Notebook 2         Notebook 3

Downloaded on
                        March 27, 2012       March 27, 2012     March 27, 2012
Pages processed
                           112 of 114           120 of 123         120 of 122
Number of entries
                            62 of 64             62 of 63           98 of 99
Number of annotations
                               632                 703               1007
Taxon annotations
                        349 (201 unique)     224 (125 unique)   514 (248 unique)
Place annotations
                        219 (115 unique)     419 (154 unique)   401 (139 unique)
Date annotations
                         64 (63 unique)       60 (59 unique)     92 (90 unique)
Dates in range
                        July 1905 to April    May 1907 to       January 1909 to
                              1907            October 1908      September 1909
Results!... With caveats

• 3 Notebooks posted and fully mostly annotated
• 1076 occurrences extracted
• A published Darwin Core Archive!
   • Most of our project’s Skype calls were about Dwc term use
• A ZooKeys paper (hopefully)
• A lot more questions….
What challenges remain?

• How do we georeference these occurrences?

• How to we maintain ties between DwC records and
  field notes?

• How do we assign unique identifiers to wiki tags?

• Is Wikisource the best place for this data?
Why this could work for you too:

• Wikimedia projects really are community driven
Why this could work for you too:

• Wikimedia projects really are community driven
• We can all be a part of this community – if we do
  the work
Why this could work for you too:

• Wikimedia projects really are community driven
• We can all be a part of this community – if we do
  the work
• Your lab, archive or library has as many or more
  potential contributors as our project
Why this could work for you too:

• Wikimedia projects really are community driven
• We can all be a part of this community – if we do
  the work
• Your lab, archive or library has as many or more
  potential contributors as our project
• There are many flexible transcription platforms in
  addition to Wikipedia
This entire project was only
 possible because people had
   been making small steps
towards digitization over the last
             10 years
Questions?

• References:
 • Grinnell J (1912) An Afternoon’s Field Notes. The
   Condor, 14(3), 104-107. Retrieved from
   http://www.jstor.org/stable/1362226.
 • Kramer KL (2011) The spoken and the unspoken. In M. R.
   Canfield (Ed.), Field Notes on Science & Nature.
   Cambridge, Massachusetts: Harvard University Press.


• For more about Henderson, see our blog!
  http://soyouthinkyoucandigitize.wordpress.com/cat
  egory/henderson-project/

More Related Content

Similar to From documents to datasets -- mining the Junius Henderson Field Notes for species occurrence records

Using Wikipedia for Research
Using Wikipedia for ResearchUsing Wikipedia for Research
Using Wikipedia for ResearchMandi Goodsett
 
The public library and wikipedia
The public library and wikipediaThe public library and wikipedia
The public library and wikipediadorohoward
 
OpenGLAM in museums: Linked Open Data and Wikipedia
OpenGLAM in museums: Linked Open Data and WikipediaOpenGLAM in museums: Linked Open Data and Wikipedia
OpenGLAM in museums: Linked Open Data and WikipediaGeorgina Goodlander
 
Wikidata Introductory Workshop
Wikidata Introductory WorkshopWikidata Introductory Workshop
Wikidata Introductory WorkshopBeat Estermann
 
Wikipedia -- the missing link in science outreach?
Wikipedia -- the missing link in science outreach?Wikipedia -- the missing link in science outreach?
Wikipedia -- the missing link in science outreach?mblso
 
Wikipedia & Cultural Heritage Institutions: Opportunities for Partnership
Wikipedia & Cultural Heritage Institutions: Opportunities for PartnershipWikipedia & Cultural Heritage Institutions: Opportunities for Partnership
Wikipedia & Cultural Heritage Institutions: Opportunities for Partnershipdorohoward
 
Question Answering - Application and Challenges
Question Answering - Application and ChallengesQuestion Answering - Application and Challenges
Question Answering - Application and ChallengesJens Lehmann
 
Building Digital Collections
Building Digital CollectionsBuilding Digital Collections
Building Digital CollectionsWiLS
 
Podcasting primer presentation
Podcasting primer presentationPodcasting primer presentation
Podcasting primer presentationChris Ubik
 
Wikimedia 재단과 MediaWiki 위키 소프트웨어 조사
Wikimedia 재단과 MediaWiki 위키 소프트웨어 조사Wikimedia 재단과 MediaWiki 위키 소프트웨어 조사
Wikimedia 재단과 MediaWiki 위키 소프트웨어 조사Chris
 
Building and Managing Online Communities
Building and Managing Online CommunitiesBuilding and Managing Online Communities
Building and Managing Online CommunitiesRose Holley
 
Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collection...
Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collection...Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collection...
Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collection...dri_ireland
 
World language : technology
World language : technologyWorld language : technology
World language : technologyhhs
 
Student to Author: Using Wikipedia to Improve Undergraduate Research & Writing
Student to Author: Using Wikipedia to Improve Undergraduate Research & WritingStudent to Author: Using Wikipedia to Improve Undergraduate Research & Writing
Student to Author: Using Wikipedia to Improve Undergraduate Research & WritingMargot
 
AURA Wiki - Knowledge Acquisition with a Semantic Wiki Application
AURA Wiki - Knowledge Acquisition with a Semantic Wiki ApplicationAURA Wiki - Knowledge Acquisition with a Semantic Wiki Application
AURA Wiki - Knowledge Acquisition with a Semantic Wiki ApplicationWilliam Smith
 
New Metaphors: Data Papers and Data Citations
New Metaphors: Data Papers and Data CitationsNew Metaphors: Data Papers and Data Citations
New Metaphors: Data Papers and Data CitationsJohn Kunze
 

Similar to From documents to datasets -- mining the Junius Henderson Field Notes for species occurrence records (20)

Wrangling Wikipedia
Wrangling WikipediaWrangling Wikipedia
Wrangling Wikipedia
 
Using Wikipedia for Research
Using Wikipedia for ResearchUsing Wikipedia for Research
Using Wikipedia for Research
 
The public library and wikipedia
The public library and wikipediaThe public library and wikipedia
The public library and wikipedia
 
OpenGLAM in museums: Linked Open Data and Wikipedia
OpenGLAM in museums: Linked Open Data and WikipediaOpenGLAM in museums: Linked Open Data and Wikipedia
OpenGLAM in museums: Linked Open Data and Wikipedia
 
Wikidata Introductory Workshop
Wikidata Introductory WorkshopWikidata Introductory Workshop
Wikidata Introductory Workshop
 
Wikipedia -- the missing link in science outreach?
Wikipedia -- the missing link in science outreach?Wikipedia -- the missing link in science outreach?
Wikipedia -- the missing link in science outreach?
 
Wikipedia & Cultural Heritage Institutions: Opportunities for Partnership
Wikipedia & Cultural Heritage Institutions: Opportunities for PartnershipWikipedia & Cultural Heritage Institutions: Opportunities for Partnership
Wikipedia & Cultural Heritage Institutions: Opportunities for Partnership
 
Question Answering - Application and Challenges
Question Answering - Application and ChallengesQuestion Answering - Application and Challenges
Question Answering - Application and Challenges
 
Building Digital Collections
Building Digital CollectionsBuilding Digital Collections
Building Digital Collections
 
Podcasting primer presentation
Podcasting primer presentationPodcasting primer presentation
Podcasting primer presentation
 
Maa
MaaMaa
Maa
 
Wikimedia 재단과 MediaWiki 위키 소프트웨어 조사
Wikimedia 재단과 MediaWiki 위키 소프트웨어 조사Wikimedia 재단과 MediaWiki 위키 소프트웨어 조사
Wikimedia 재단과 MediaWiki 위키 소프트웨어 조사
 
Wikidata & dbpedia
Wikidata & dbpediaWikidata & dbpedia
Wikidata & dbpedia
 
Building and Managing Online Communities
Building and Managing Online CommunitiesBuilding and Managing Online Communities
Building and Managing Online Communities
 
Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collection...
Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collection...Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collection...
Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collection...
 
iPads at NCS
iPads at NCSiPads at NCS
iPads at NCS
 
World language : technology
World language : technologyWorld language : technology
World language : technology
 
Student to Author: Using Wikipedia to Improve Undergraduate Research & Writing
Student to Author: Using Wikipedia to Improve Undergraduate Research & WritingStudent to Author: Using Wikipedia to Improve Undergraduate Research & Writing
Student to Author: Using Wikipedia to Improve Undergraduate Research & Writing
 
AURA Wiki - Knowledge Acquisition with a Semantic Wiki Application
AURA Wiki - Knowledge Acquisition with a Semantic Wiki ApplicationAURA Wiki - Knowledge Acquisition with a Semantic Wiki Application
AURA Wiki - Knowledge Acquisition with a Semantic Wiki Application
 
New Metaphors: Data Papers and Data Citations
New Metaphors: Data Papers and Data CitationsNew Metaphors: Data Papers and Data Citations
New Metaphors: Data Papers and Data Citations
 

Recently uploaded

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 

Recently uploaded (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 

From documents to datasets -- mining the Junius Henderson Field Notes for species occurrence records

  • 1. What Henderson Saw E XTRACTING OBSERVATIONS FROM CENTURY- OLD FIELD NOTEBOOKS Andrea ThomerUIUC, Gaurav VaidyaCU-B, Robert GuralnickCU-B, David BloomUC-B & Laura RussellKU
  • 2. or
  • 3. From documents to datasets M INING THE JUNIUS HENDERSON FIELD NOTES FOR SPECIES O CCURRENCE RECORDS Andrea ThomerUIUC, Gaurav VaidyaCU-B, Robert GuralnickCU-B, David BloomUC-B & Laura RussellKU
  • 4. Field notes and Biodiversity science • Field work is central to biodiversity work • Field notes: • Are central to field work • Are typically stored in archives • But contain data • Data wants to be free!
  • 5. Biodiversity science and “first person precision” • We often forget that field notes store data • Value of field notes is in the combination of qualitative/quantitative data (Kramer, 2011) • Grinnell: “first person precision” (1912) • How do we free the data, while also preserving the record of its context of production?
  • 6. Junius Henderson • A typical natural history “old- timer” • Had a mustache • wore suspenders • wrote snarky comments in his field notes about young whippersnappers and trains • Studied clams
  • 7. Influential in small but lasting ways, but not well-known beyond Boulder
  • 8. Henderson’s field notes • 13 notebooks, 1 locality notebook • 1672 pages of notes total • Prolific collector • numerous photographs • 1905: Began field work for CU Museum • 2000-2002: Transcribed by Dr. Peter Robinson • 2006: NSIDC scanned the Henderson notebooks • 2011-2012: annotation and data extraction
  • 9. The Henderson Field Note Project • Were looking for a low-tech digitization project • Rob knew of the existence of the transcribed notes • “What we can accomplish with five hours of work each?” • Goals: • Make notes freely available • Try to engage volunteers on the internet • Produce one “neat thing” (a visualization, a map, etc)
  • 10. Challenges in making notes available • No time! • No resources! • No time! • No repository! • No platform! • No time!
  • 11. Solutions to challenges (ver. 1) • No sleeping! • Use free resources! • Guerrilla takeover of Wikisource! • Profit!
  • 12. Wikisource • Part of Wikimedia Foundation, as is Wikipedia • Has its own “collections” or “accessions” policies • All docs from before 1923 • Post-1922: Documentary sources, peer-reviewed scientific research, analytical & artistic works • Support for “adding value” via transcription, translation, annotation, and more
  • 13. Basic Project Steps • Upload notebooks to Wikisource • Match transcriptions to scans by hand • Create templates to support annotation • Advertise project; attract volunteers • Write simple script to extract annotations • Publish those via IPT installation as a DwC-A • Sleep
  • 14. Basic Project Steps • Upload notebooks to Wikisource • Match transcriptions to scans by hand • Create templates to support annotation • Advertise project; attract volunteers • Write simple script to extract annotations • Publish those via IPT installation as a DwC-A • Sleep
  • 15.
  • 16. Basic Project Steps • Upload notebooks to Wikisource • Match transcriptions to scans by hand • Create templates to support annotation • Advertise project; attract volunteers • Write simple script to extract annotations • Publish those via IPT installation as a DwC-A • Sleep
  • 17. Annotation Templates • Anyone can annotate the transcribed to tag elements • Ex. “I saw a white-tailed jack rabbit”  “I saw a {{taxon|Lepus townsendii|white tailed jack rabbit}}.”
  • 18. Annotation Templates Note: “white tailed jack rabbit” would work here as well. {{taxon|Lepus townsendii|white tailed jack rabbit}}. Type of annotation Wikipedia link verbatim text
  • 19. Basic Project Steps • Upload notebooks to Wikisource • Match transcriptions to scans by hand • Create templates to support annotation • Advertise project; attract volunteers • Write simple script to extract annotations • Publish those via IPT installation as a DwC-A • Sleep
  • 20. Basic Project Steps • Upload notebooks to Wikisource • Match transcriptions to scans by hand • Create templates to support annotation • Advertise project; attract volunteers • Write simple script to extract annotations • Publish those via IPT installation as a DwC-A • Sleep
  • 21.
  • 22. Basic Project Steps • Upload notebooks to Wikisource • Match transcriptions to scans by hand • Create templates to support annotation • Advertise project; attract volunteers • Write simple script to extract annotations • Write complex scripts to extract annotations and compile them into occurrences • Extensively review occurrences • Taxonomic referencing • Publish those via IPT installation as a DwC-A • Sleep
  • 23. Taxonomic Referencing • Remember that “Wikipedia link”? • We want to check if that is a valid taxonomic name • How? • Easy, right? Just check against a resolver!
  • 24. Taxonomic Referencing • Remember that “Wikipedia link”? • We want to check if that is a valid taxonomic name • How? • Easy, right? Just check against a resolver! • Hard! Which resolver? How to verify? 1) Check name against ITIS and EOL. 2) Possible outcomes: a) Both concordant! YAY! b) No results from both. Boo! c) Discordant results. Need HUMANS! 3) This was LOTS of work (thanks, Gaurav!)
  • 25. Basic Project Steps • Upload notebooks to Wikisource • Match transcriptions to scans by hand • Create templates to support annotation • Advertise project; attract volunteers • Write simple script to extract annotations • Write complex scripts to extract annotations and compile them into occurrences • Extensively review occurrences • Taxonomic referencing • Publish those via IPT installation as a DwC-A • Sleep
  • 26. Results! • 3 Notebooks posted and fully annotated Notebook 1 Notebook 2 Notebook 3 Downloaded on March 27, 2012 March 27, 2012 March 27, 2012 Pages processed 112 of 114 120 of 123 120 of 122 Number of entries 62 of 64 62 of 63 98 of 99 Number of annotations 632 703 1007 Taxon annotations 349 (201 unique) 224 (125 unique) 514 (248 unique) Place annotations 219 (115 unique) 419 (154 unique) 401 (139 unique) Date annotations 64 (63 unique) 60 (59 unique) 92 (90 unique) Dates in range July 1905 to April May 1907 to January 1909 to 1907 October 1908 September 1909
  • 27. Results!... With caveats • 3 Notebooks posted and fully mostly annotated • 1076 occurrences extracted • A published Darwin Core Archive! • Most of our project’s Skype calls were about Dwc term use • A ZooKeys paper (hopefully) • A lot more questions….
  • 28. What challenges remain? • How do we georeference these occurrences? • How to we maintain ties between DwC records and field notes? • How do we assign unique identifiers to wiki tags? • Is Wikisource the best place for this data?
  • 29. Why this could work for you too: • Wikimedia projects really are community driven
  • 30. Why this could work for you too: • Wikimedia projects really are community driven • We can all be a part of this community – if we do the work
  • 31. Why this could work for you too: • Wikimedia projects really are community driven • We can all be a part of this community – if we do the work • Your lab, archive or library has as many or more potential contributors as our project
  • 32. Why this could work for you too: • Wikimedia projects really are community driven • We can all be a part of this community – if we do the work • Your lab, archive or library has as many or more potential contributors as our project • There are many flexible transcription platforms in addition to Wikipedia
  • 33. This entire project was only possible because people had been making small steps towards digitization over the last 10 years
  • 34. Questions? • References: • Grinnell J (1912) An Afternoon’s Field Notes. The Condor, 14(3), 104-107. Retrieved from http://www.jstor.org/stable/1362226. • Kramer KL (2011) The spoken and the unspoken. In M. R. Canfield (Ed.), Field Notes on Science & Nature. Cambridge, Massachusetts: Harvard University Press. • For more about Henderson, see our blog! http://soyouthinkyoucandigitize.wordpress.com/cat egory/henderson-project/

Editor's Notes

  1. “first person precision refers to the idiosyncratic, unatomizable narrative about nature — be it a drawing on a cave wall or a handwritten page in a field journal — gives specimens and observations context that may not readily fit into a spreadsheet, and which may form the nucleus of an important new insight or discovery. Thus, field notes are the product of both qualitative and quantitative methods, in which structured and unstructured data are intertwined
  2. A classic “neat old guy” – this is a phrase I just made up, but the point is that Henderson is like a lot of the people whose notes you likely keep; he was influential in lasting ways but is little known beyond his immediate sphere of influence (in this case, Boulder, CO and malacology); he was a dutiful scientist; we as LIS professionals are charged with preserving his legacy
  3. Poor man’s transcription platform