Orientation EBC 2013: Digitising Natural History

Digitising Natural
History
Marieke van Erp
marieke.van.erp@vu.nl
1

New technology offers many new possibilities
• improves collection management
• opens up new avenues of research
• digital collection access
3
Why Digitise?

Digitisation at Naturalis
• goal is to have 7 million objects digitised by mid-2015
(out of 37 million) + robust infrastructure for
continuation of digitisation
• 3 million within Naturalis digitisation streets
• 4 million elsewhere
• other 30 million objects will be digitised at less detailed
level
4

• Leposoma Guianense, Sipaliwini, 4 km e. of
airport, near base camp, forest ground,
among leaves, 28-VIII-1968, 12.45 u. reg. nr.
13879
8

Genus
Species
Region
Location
Biotope
Date
Time
Reg #
Leposoma
Guianense
Sipaliwini
4 km e. of airport
near base camp, forest ground
among leaves
28-08-1968
12:45
13879
9
But what you really want...

• Leposoma Guianense, Sipaliwini, 4 km e. of airport,
near base camp, forest ground, among leaves, 28-
VIII-1968, 12.45 u. reg. nr. 13879
• ask a computer to learn to segment and classify
text snippets
10

• Manually annotate 500 text snippets (~3h)
• 300 for training
• 200 for testing
11

• 49,688 new database records (547,528
database cells) at ~84.57 accuracy
12

• 16,870 records describing characteristics and
history of animal specimens in a natural
history database
• 39 columns
• Dutch, English, German and Portuguese
• numeric and textual values (both atomic and
elaborate)
The Manually Created Reptiles and
Amphibians Database
13

column Name value
order
genus
country
biotope
collection date
type
determinator
deﬁned by
special remarks
Anura
Megophrys
Indonesia
in rain near road
01.02.1888
holotype
A. Dubois
(Linnaeus, 1758)
in bad condition, was eaten by
Leptodactylus rugosus (3023) at
night and thrown up again the next
morning when killed, partly digested
14

• a database provides structure
• computers are good at comparing values
• statistical methods can detect
inconsistencies
16

17
author determinator family genus country
preservation
method
(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry)
(Schlegel) G. vd. Boog Colubridae Geophis Indonesia -----
Schneider
M. S.
Hoogmoed
------ Bufo Suriname -----
(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol

preservation
method
(Schlegel) G. vd. Boog Colubridae Geophis Indonesia -----
Schneider
M. S.
Hoogmoed
18

preservation
method
(Schlegel) G. vd. Boog Colubridae ? Indonesia -----
Schneider
M. S.
Hoogmoed
actual value: Geophis
19

preservation
method
Schneider
M. S.
Hoogmoed
20

preservation
method
Schneider
M. S.
Hoogmoed
21

preservation
method
Schneider
M. S.
Hoogmoed
22

preservation
method
Schneider
M. S.
Hoogmoed
23

preservation
method
Schneider
M. S.
Hoogmoed
predicted value: Rhapdophis
24

• <100 cells to check for a column instead of
16,780
• recall (estimate): 90-100%
• one-size-ﬁts-all
25

• Data-driven cleaning cannot detect
systematic errors
• Maybe systematics can help?
26

subject relation object
specimen
collection
occurs before
entry in
museum
species
has broader
term
genus
city falls within country
27

• detects inconsistencies database usage
• small scope
• high recall and precision within scope
• needs adapting for each new domain
28

30
Challenge Example
Ambiguous location name Amsterdam
Two or more location
descriptors
Wakarusa, 24mi WSW of
Lawrence
Topological nesting Moccassin Creek on Hog Island
Complex description
Bupo [?Buso] River, 15 miles
[24km] E of Lae
Linear feature measurement 16km (by road) N of Murtoa
Linear ambiguity
On the road between Sydney
and Bathurst
Vague localities Southeast Michigan
Changed political borders Yugoslavia
Historical Place Names British North Borneo

• Randomly annotated geographical
information in 200 database records
• 50 records for development, 150 for testing
31

• Record retrieval
• Text parsing
• Gazetteer lookup
• Offset calculation
• Disambiguation Heuristics
32
Knowledge-driven
Georeferencing

Disambiguation
Heuristics
• Spatial Minimality
• if Amsterdam and Utrecht are mentioned in the same record,
then Amsterdam, NL is more likely than Amsterdam, NY, USA
• Expedition clusters
• It is unlikely that a collector was collecting in Europe on
Monday and in the US on Tuesday
• Species occurrence data
• GBIF can tell us where a certain species does or does not
occur
34

Results
36
Correct
@5km
Correct
@25km
Correct
@100km
Mean
distance
off
Not Found
Baseline
+ Google
maps + fuzzy
+ Spatial
minimality
+ Expedition
+ GBIF
38.9 47.0 58.4 251.1 26.2
53.0 65.1 74.5 244.1 8.7
59.1 71.8 77.2 171.1 7.4
59.1 71.8 77.2 171.1 7.4
61.7 74.5 79.9 114.5 7.4

Generating Stories
38
Image source: http://www.gungeralv.org/dg/images/chapter1.JPG

• data cleaning is essential
• “digitising” a heritage collection is
complicated
• don’t try to tame text
General Conclusions
40

Thank you for your
attention!
41

• CATCH: http://www.nwo.nl/catch
• MITCH: http://ilk.uvt.nl/mitch
• NewsReader: http://www.newsreader-project.eu
42

• More information about machine learning
• Video explaining k-nearest neighbour
algorithm: http://videolectures.net/
aaai07_bosch_knnc/
• Weka Toolkit: http://
www.cs.waikato.ac.nz/ml/weka/
43

Orientation EBC 2013: Digitising Natural History

Recommended

Recommended

More Related Content

More from Marieke van Erp

More from Marieke van Erp (20)

Recently uploaded

Recently uploaded (20)

Orientation EBC 2013: Digitising Natural History