Your SlideShare is downloading. ×
Richness oftheworld2012
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Richness oftheworld2012

199

Published on

Guest lecture about digitising natural history in the Richness of the World module at Leiden University. October 1st 2012

Guest lecture about digitising natural history in the Richness of the World module at Leiden University. October 1st 2012

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
199
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Digitising Natural History Marieke van Erp marieke@cs.vu.nl 1
  • 2. 2
  • 3. Why Digitise?New technology offers many new possibilities • improves collection management • opens up new avenues of research • digital collection access 3
  • 4. Digitisation at Naturalis• goal is to have 7 million objects digitised by mid-2015 (out of 37 million) + robust infrastructure for continuation of digitisation• 3 million within Naturalis digitisation streets• 4 million elsewhere• other 30 million objects will be digitised at less detailed level 4
  • 5. 5
  • 6. 6
  • 7. 7
  • 8. • Leposoma Guianense, Sipaliwini, 4 km e. of airport, near base camp, forest ground, among leaves, 28-VIII-1968, 12.45 u. reg. nr. 13879 8
  • 9. But what you really want... Genus Leposoma Species Guianense Region SipaliwiniLocation 4 km e. of airport near base camp, forest groundBiotope among leaves Date 28-08-1968 Time 12:45 Reg # 13879 9
  • 10. • Leposoma Guianense, Sipaliwini, 4 km e. of airport, near base camp, forest ground, among leaves, 28- VIII-1968, 12.45 u. reg. nr. 13879• ask a computer to learn to segment and classify text snippets 10
  • 11. • Manually annotate 500 text snippets (~3h)• 300 for training• 200 for testing 11
  • 12. • 49,688 new database records (547,528 database cells) at ~84.57 accuracy 12
  • 13. The Manually Created Reptiles and Amphibians Database• 16,870 records describing characteristics and history of animal specimens in a natural history database• 39 columns• Dutch, English, German and Portuguese• numeric and textual values (both atomic and elaborate) 13
  • 14. column Name value order Anura genus Megophrys country Indonesia biotope in rain near roadcollection date 01.02.1888 type holotype determinator A. Dubois defined by (Linnaeus, 1758) in bad condition, was eaten by Leptodactylus rugosus (3023) atspecial remarks night and thrown up again the next morning when killed, partly digested 14
  • 15. 15
  • 16. • a database provides structure• computers are good at comparing values• statistical methods can detect inconsistencies 16
  • 17. preservation author determinator family genus country method(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae Geophis Indonesia ----- M. S. Schneider ------ Bufo Suriname ----- Hoogmoed(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol 17
  • 18. preservation author determinator family genus country method(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae Geophis Indonesia ----- M. S. Schneider ------ Bufo Suriname ----- Hoogmoed(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol 18
  • 19. actual value: Geophis preservation author determinator family genus country method(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- M. S. Schneider ------ Bufo Suriname ----- Hoogmoed(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol 19
  • 20. actual value: Geophis preservation author determinator family genus country method(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- M. S. Schneider ------ Bufo Suriname ----- Hoogmoed(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol 20
  • 21. actual value: Geophis preservation author determinator family genus country method(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- M. S. Schneider ------ Bufo Suriname ----- Hoogmoed(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol 21
  • 22. actual value: Geophis preservation author determinator family genus country method(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- M. S. Schneider ------ Bufo Suriname ----- Hoogmoed(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol 22
  • 23. actual value: Geophis preservation author determinator family genus country method(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- M. S. Schneider ------ Bufo Suriname ----- Hoogmoed(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol 23
  • 24. actual value: Geophis predicted value: Rhapdophis preservation author determinator family genus country method(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- M. S. Schneider ------ Bufo Suriname ----- Hoogmoed(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol 24
  • 25. • <100 cells to check for a column instead of 16,780• recall (estimate): 90-100%• one-size-fits-all 25
  • 26. • Data-driven cleaning cannot detect systematic errors• Maybe systematics can help? 26
  • 27. subject relation objectspecimen entry in occurs beforecollection museum has broader species genus term city falls within country 27
  • 28. • detects inconsistencies database usage• small scope• high recall and precision within scope• needs adapting for each new domain 28
  • 29. Disambiguating Locations 29
  • 30. Challenge Example Ambiguous location name Amsterdam Two or more location Wakarusa, 24mi WSW of descriptors Lawrence Topological nesting Moccassin Creek on Hog Island Bupo [?Buso] River, 15 miles Complex description [24km] E of LaeLinear feature measurement 16km (by road) N of Murtoa On the road between Sydney Linear ambiguity and Bathurst Vague localities Southeast Michigan Changed political borders Yugoslavia Historical Place Names British North Borneo 30
  • 31. • Randomly annotated geographical information in 200 database records• 50 records for development, 150 for testing 31
  • 32. Knowledge-driven Georeferencing• Record retrieval• Text parsing• Gazetteer lookup• Offset calculation• Disambiguation Heuristics 32
  • 33. Offset 33
  • 34. Disambiguation Heuristics• Spatial Minimality • if Amsterdam and Utrecht are mentioned in the same record, then Amsterdam, NL is more likely than Amsterdam, NY, USA• Expedition clusters • It is unlikely that a collector was collecting in Europe on Monday and in the US on Tuesday• Species occurrence data • GBIF can tell us where a certain species does or does not occur 34
  • 35. Species Occurrence Data 35
  • 36. Results Mean Correct Correct Correct distance Not Found @5km @25km @100km off Baseline 38.9 47.0 58.4 251.1 26.2 + Google 53.0 65.1 74.5 244.1 8.7maps + fuzzy + Spatial 59.1 71.8 77.2 171.1 7.4 minimality+ Expedition 59.1 71.8 77.2 171.1 7.4 + GBIF 61.7 74.5 79.9 114.5 7.4 36
  • 37. Confidence 37
  • 38. General Conclusions• data cleaning is essential• “digitising” a heritage collection is complicated• don’t try to tame text 38
  • 39. • Data-driven error correction method is being developed further in the CATCHPlus programme • http://www.catchplus.nl/diensten/ deelprojecten/checkers/ 39
  • 40. Thank you for your attention! 40
  • 41. • CATCH: http://www.nwo.nl/catch• MITCH: http://ilk.uvt.nl/mitch• Agora: http://agora.cs.vu.nl/ 41
  • 42. • More information about machine learning • Video explaining k-nearest neighbour algorithm: http://videolectures.net/ aaai07_bosch_knnc/ • Weka Toolkit: http:// www.cs.waikato.ac.nz/ml/weka/ 42

×