The document discusses extracting data from old field notebooks written by Junius Henderson between 1905-1909. Volunteers helped transcribe and annotate the notebooks on Wikisource to extract over 1000 species occurrence records. These records were published as a Darwin Core Archive and represent one of the first large-scale efforts to digitize and analyze historical field notes for scientific data. Challenges remain in georeferencing the locations and maintaining connections to the original field notes.
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
From documents to datasets -- mining the Junius Henderson Field Notes for species occurrence records
1. What Henderson Saw
E XTRACTING OBSERVATIONS FROM CENTURY- OLD FIELD
NOTEBOOKS
Andrea ThomerUIUC, Gaurav VaidyaCU-B,
Robert GuralnickCU-B, David BloomUC-B & Laura RussellKU
3. From documents to datasets
M INING THE JUNIUS HENDERSON FIELD NOTES FOR SPECIES
O CCURRENCE RECORDS
Andrea ThomerUIUC, Gaurav VaidyaCU-B,
Robert GuralnickCU-B, David BloomUC-B & Laura RussellKU
4. Field notes and Biodiversity science
• Field work is central to biodiversity work
• Field notes:
• Are central to field work
• Are typically stored in archives
• But contain data
• Data wants to be free!
5. Biodiversity science and “first person
precision”
• We often forget that field notes store data
• Value of field notes is in the combination of
qualitative/quantitative data (Kramer, 2011)
• Grinnell: “first person precision” (1912)
• How do we free the data, while also preserving the
record of its context of production?
6. Junius Henderson
• A typical natural history “old-
timer”
• Had a mustache
• wore suspenders
• wrote snarky comments in his field
notes about young
whippersnappers and trains
• Studied clams
8. Henderson’s field notes
• 13 notebooks, 1 locality notebook
• 1672 pages of notes total
• Prolific collector
• numerous photographs
• 1905: Began field work for CU Museum
• 2000-2002: Transcribed by Dr. Peter Robinson
• 2006: NSIDC scanned the Henderson notebooks
• 2011-2012: annotation and data extraction
9. The Henderson Field Note Project
• Were looking for a low-tech digitization project
• Rob knew of the existence of the transcribed notes
• “What we can accomplish with five hours of work
each?”
• Goals:
• Make notes freely available
• Try to engage volunteers on the internet
• Produce one “neat thing” (a visualization, a map, etc)
10. Challenges in making notes available
• No time!
• No resources!
• No time!
• No repository!
• No platform!
• No time!
11. Solutions to challenges (ver. 1)
• No sleeping!
• Use free resources!
• Guerrilla takeover of Wikisource!
• Profit!
12. Wikisource
• Part of Wikimedia Foundation, as is Wikipedia
• Has its own “collections” or “accessions” policies
• All docs from before 1923
• Post-1922: Documentary sources, peer-reviewed scientific
research, analytical & artistic works
• Support for “adding value” via
transcription, translation, annotation, and more
13. Basic Project Steps
• Upload notebooks to Wikisource
• Match transcriptions to scans by hand
• Create templates to support annotation
• Advertise project; attract volunteers
• Write simple script to extract annotations
• Publish those via IPT installation as a DwC-A
• Sleep
14. Basic Project Steps
• Upload notebooks to Wikisource
• Match transcriptions to scans by hand
• Create templates to support annotation
• Advertise project; attract volunteers
• Write simple script to extract annotations
• Publish those via IPT installation as a DwC-A
• Sleep
15.
16. Basic Project Steps
• Upload notebooks to Wikisource
• Match transcriptions to scans by hand
• Create templates to support annotation
• Advertise project; attract volunteers
• Write simple script to extract annotations
• Publish those via IPT installation as a DwC-A
• Sleep
17. Annotation Templates
• Anyone can annotate the transcribed to tag
elements
• Ex. “I saw a white-tailed jack rabbit”
“I saw a {{taxon|Lepus townsendii|white tailed jack rabbit}}.”
18. Annotation Templates
Note: “white
tailed jack
rabbit”
would work
here as well.
{{taxon|Lepus townsendii|white tailed jack rabbit}}.
Type of annotation Wikipedia link verbatim text
19. Basic Project Steps
• Upload notebooks to Wikisource
• Match transcriptions to scans by hand
• Create templates to support annotation
• Advertise project; attract volunteers
• Write simple script to extract annotations
• Publish those via IPT installation as a DwC-A
• Sleep
20. Basic Project Steps
• Upload notebooks to Wikisource
• Match transcriptions to scans by hand
• Create templates to support annotation
• Advertise project; attract volunteers
• Write simple script to extract annotations
• Publish those via IPT installation as a DwC-A
• Sleep
21.
22. Basic Project Steps
• Upload notebooks to Wikisource
• Match transcriptions to scans by hand
• Create templates to support annotation
• Advertise project; attract volunteers
• Write simple script to extract annotations
• Write complex scripts to extract annotations and
compile them into occurrences
• Extensively review occurrences
• Taxonomic referencing
• Publish those via IPT installation as a DwC-A
• Sleep
23. Taxonomic Referencing
• Remember that “Wikipedia link”?
• We want to check if that is a valid taxonomic name
• How?
• Easy, right? Just check against a resolver!
24. Taxonomic Referencing
• Remember that “Wikipedia link”?
• We want to check if that is a valid taxonomic name
• How?
• Easy, right? Just check against a resolver!
• Hard! Which resolver? How to verify?
1) Check name against ITIS and EOL.
2) Possible outcomes:
a) Both concordant! YAY!
b) No results from both. Boo!
c) Discordant results. Need
HUMANS!
3) This was LOTS of work (thanks, Gaurav!)
25. Basic Project Steps
• Upload notebooks to Wikisource
• Match transcriptions to scans by hand
• Create templates to support annotation
• Advertise project; attract volunteers
• Write simple script to extract annotations
• Write complex scripts to extract annotations and
compile them into occurrences
• Extensively review occurrences
• Taxonomic referencing
• Publish those via IPT installation as a DwC-A
• Sleep
26. Results!
• 3 Notebooks posted and fully annotated
Notebook 1 Notebook 2 Notebook 3
Downloaded on
March 27, 2012 March 27, 2012 March 27, 2012
Pages processed
112 of 114 120 of 123 120 of 122
Number of entries
62 of 64 62 of 63 98 of 99
Number of annotations
632 703 1007
Taxon annotations
349 (201 unique) 224 (125 unique) 514 (248 unique)
Place annotations
219 (115 unique) 419 (154 unique) 401 (139 unique)
Date annotations
64 (63 unique) 60 (59 unique) 92 (90 unique)
Dates in range
July 1905 to April May 1907 to January 1909 to
1907 October 1908 September 1909
27. Results!... With caveats
• 3 Notebooks posted and fully mostly annotated
• 1076 occurrences extracted
• A published Darwin Core Archive!
• Most of our project’s Skype calls were about Dwc term use
• A ZooKeys paper (hopefully)
• A lot more questions….
28. What challenges remain?
• How do we georeference these occurrences?
• How to we maintain ties between DwC records and
field notes?
• How do we assign unique identifiers to wiki tags?
• Is Wikisource the best place for this data?
29. Why this could work for you too:
• Wikimedia projects really are community driven
30. Why this could work for you too:
• Wikimedia projects really are community driven
• We can all be a part of this community – if we do
the work
31. Why this could work for you too:
• Wikimedia projects really are community driven
• We can all be a part of this community – if we do
the work
• Your lab, archive or library has as many or more
potential contributors as our project
32. Why this could work for you too:
• Wikimedia projects really are community driven
• We can all be a part of this community – if we do
the work
• Your lab, archive or library has as many or more
potential contributors as our project
• There are many flexible transcription platforms in
addition to Wikipedia
33. This entire project was only
possible because people had
been making small steps
towards digitization over the last
10 years
34. Questions?
• References:
• Grinnell J (1912) An Afternoon’s Field Notes. The
Condor, 14(3), 104-107. Retrieved from
http://www.jstor.org/stable/1362226.
• Kramer KL (2011) The spoken and the unspoken. In M. R.
Canfield (Ed.), Field Notes on Science & Nature.
Cambridge, Massachusetts: Harvard University Press.
• For more about Henderson, see our blog!
http://soyouthinkyoucandigitize.wordpress.com/cat
egory/henderson-project/
Editor's Notes
“first person precision refers to the idiosyncratic, unatomizable narrative about nature — be it a drawing on a cave wall or a handwritten page in a field journal — gives specimens and observations context that may not readily fit into a spreadsheet, and which may form the nucleus of an important new insight or discovery. Thus, field notes are the product of both qualitative and quantitative methods, in which structured and unstructured data are intertwined
A classic “neat old guy” – this is a phrase I just made up, but the point is that Henderson is like a lot of the people whose notes you likely keep; he was influential in lasting ways but is little known beyond his immediate sphere of influence (in this case, Boulder, CO and malacology); he was a dutiful scientist; we as LIS professionals are charged with preserving his legacy