• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Richness oftheworld2012
 

Richness oftheworld2012

on

  • 265 views

Guest lecture about digitising natural history in the Richness of the World module at Leiden University. October 1st 2012

Guest lecture about digitising natural history in the Richness of the World module at Leiden University. October 1st 2012

Statistics

Views

Total Views
265
Views on SlideShare
265
Embed Views
0

Actions

Likes
0
Downloads
4
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Richness oftheworld2012 Richness oftheworld2012 Presentation Transcript

    • Digitising Natural History Marieke van Erp marieke@cs.vu.nl 1
    • 2
    • Why Digitise?New technology offers many new possibilities • improves collection management • opens up new avenues of research • digital collection access 3
    • Digitisation at Naturalis• goal is to have 7 million objects digitised by mid-2015 (out of 37 million) + robust infrastructure for continuation of digitisation• 3 million within Naturalis digitisation streets• 4 million elsewhere• other 30 million objects will be digitised at less detailed level 4
    • 5
    • 6
    • 7
    • • Leposoma Guianense, Sipaliwini, 4 km e. of airport, near base camp, forest ground, among leaves, 28-VIII-1968, 12.45 u. reg. nr. 13879 8
    • But what you really want... Genus Leposoma Species Guianense Region SipaliwiniLocation 4 km e. of airport near base camp, forest groundBiotope among leaves Date 28-08-1968 Time 12:45 Reg # 13879 9
    • • Leposoma Guianense, Sipaliwini, 4 km e. of airport, near base camp, forest ground, among leaves, 28- VIII-1968, 12.45 u. reg. nr. 13879• ask a computer to learn to segment and classify text snippets 10
    • • Manually annotate 500 text snippets (~3h)• 300 for training• 200 for testing 11
    • • 49,688 new database records (547,528 database cells) at ~84.57 accuracy 12
    • The Manually Created Reptiles and Amphibians Database• 16,870 records describing characteristics and history of animal specimens in a natural history database• 39 columns• Dutch, English, German and Portuguese• numeric and textual values (both atomic and elaborate) 13
    • column Name value order Anura genus Megophrys country Indonesia biotope in rain near roadcollection date 01.02.1888 type holotype determinator A. Dubois defined by (Linnaeus, 1758) in bad condition, was eaten by Leptodactylus rugosus (3023) atspecial remarks night and thrown up again the next morning when killed, partly digested 14
    • 15
    • • a database provides structure• computers are good at comparing values• statistical methods can detect inconsistencies 16
    • preservation author determinator family genus country method(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae Geophis Indonesia ----- M. S. Schneider ------ Bufo Suriname ----- Hoogmoed(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol 17
    • preservation author determinator family genus country method(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae Geophis Indonesia ----- M. S. Schneider ------ Bufo Suriname ----- Hoogmoed(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol 18
    • actual value: Geophis preservation author determinator family genus country method(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- M. S. Schneider ------ Bufo Suriname ----- Hoogmoed(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol 19
    • actual value: Geophis preservation author determinator family genus country method(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- M. S. Schneider ------ Bufo Suriname ----- Hoogmoed(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol 20
    • actual value: Geophis preservation author determinator family genus country method(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- M. S. Schneider ------ Bufo Suriname ----- Hoogmoed(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol 21
    • actual value: Geophis preservation author determinator family genus country method(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- M. S. Schneider ------ Bufo Suriname ----- Hoogmoed(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol 22
    • actual value: Geophis preservation author determinator family genus country method(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- M. S. Schneider ------ Bufo Suriname ----- Hoogmoed(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol 23
    • actual value: Geophis predicted value: Rhapdophis preservation author determinator family genus country method(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- M. S. Schneider ------ Bufo Suriname ----- Hoogmoed(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol 24
    • • <100 cells to check for a column instead of 16,780• recall (estimate): 90-100%• one-size-fits-all 25
    • • Data-driven cleaning cannot detect systematic errors• Maybe systematics can help? 26
    • subject relation objectspecimen entry in occurs beforecollection museum has broader species genus term city falls within country 27
    • • detects inconsistencies database usage• small scope• high recall and precision within scope• needs adapting for each new domain 28
    • Disambiguating Locations 29
    • Challenge Example Ambiguous location name Amsterdam Two or more location Wakarusa, 24mi WSW of descriptors Lawrence Topological nesting Moccassin Creek on Hog Island Bupo [?Buso] River, 15 miles Complex description [24km] E of LaeLinear feature measurement 16km (by road) N of Murtoa On the road between Sydney Linear ambiguity and Bathurst Vague localities Southeast Michigan Changed political borders Yugoslavia Historical Place Names British North Borneo 30
    • • Randomly annotated geographical information in 200 database records• 50 records for development, 150 for testing 31
    • Knowledge-driven Georeferencing• Record retrieval• Text parsing• Gazetteer lookup• Offset calculation• Disambiguation Heuristics 32
    • Offset 33
    • Disambiguation Heuristics• Spatial Minimality • if Amsterdam and Utrecht are mentioned in the same record, then Amsterdam, NL is more likely than Amsterdam, NY, USA• Expedition clusters • It is unlikely that a collector was collecting in Europe on Monday and in the US on Tuesday• Species occurrence data • GBIF can tell us where a certain species does or does not occur 34
    • Species Occurrence Data 35
    • Results Mean Correct Correct Correct distance Not Found @5km @25km @100km off Baseline 38.9 47.0 58.4 251.1 26.2 + Google 53.0 65.1 74.5 244.1 8.7maps + fuzzy + Spatial 59.1 71.8 77.2 171.1 7.4 minimality+ Expedition 59.1 71.8 77.2 171.1 7.4 + GBIF 61.7 74.5 79.9 114.5 7.4 36
    • Confidence 37
    • General Conclusions• data cleaning is essential• “digitising” a heritage collection is complicated• don’t try to tame text 38
    • • Data-driven error correction method is being developed further in the CATCHPlus programme • http://www.catchplus.nl/diensten/ deelprojecten/checkers/ 39
    • Thank you for your attention! 40
    • • CATCH: http://www.nwo.nl/catch• MITCH: http://ilk.uvt.nl/mitch• Agora: http://agora.cs.vu.nl/ 41
    • • More information about machine learning • Video explaining k-nearest neighbour algorithm: http://videolectures.net/ aaai07_bosch_knnc/ • Weka Toolkit: http:// www.cs.waikato.ac.nz/ml/weka/ 42