Your SlideShare is downloading. ×
Orientation EBC 2013: Digitising Natural History
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Orientation EBC 2013: Digitising Natural History

106
views

Published on

Slides of Digitising Natural History Lecture Orientation EBC 2013 course. Naturalis 3 September 2013

Slides of Digitising Natural History Lecture Orientation EBC 2013 course. Naturalis 3 September 2013

Published in: Technology, Education

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
106
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Digitising Natural History Marieke van Erp marieke.van.erp@vu.nl 1
  • 2. 2
  • 3. New technology offers many new possibilities • improves collection management • opens up new avenues of research • digital collection access 3 Why Digitise?
  • 4. Digitisation at Naturalis • goal is to have 7 million objects digitised by mid-2015 (out of 37 million) + robust infrastructure for continuation of digitisation • 3 million within Naturalis digitisation streets • 4 million elsewhere • other 30 million objects will be digitised at less detailed level 4
  • 5. 5
  • 6. 6
  • 7. 7
  • 8. • Leposoma Guianense, Sipaliwini, 4 km e. of airport, near base camp, forest ground, among leaves, 28-VIII-1968, 12.45 u. reg. nr. 13879 8
  • 9. Genus Species Region Location Biotope Date Time Reg # Leposoma Guianense Sipaliwini 4 km e. of airport near base camp, forest ground among leaves 28-08-1968 12:45 13879 9 But what you really want...
  • 10. • Leposoma Guianense, Sipaliwini, 4 km e. of airport, near base camp, forest ground, among leaves, 28- VIII-1968, 12.45 u. reg. nr. 13879 • ask a computer to learn to segment and classify text snippets 10
  • 11. • Manually annotate 500 text snippets (~3h) • 300 for training • 200 for testing 11
  • 12. • 49,688 new database records (547,528 database cells) at ~84.57 accuracy 12
  • 13. • 16,870 records describing characteristics and history of animal specimens in a natural history database • 39 columns • Dutch, English, German and Portuguese • numeric and textual values (both atomic and elaborate) The Manually Created Reptiles and Amphibians Database 13
  • 14. column Name value order genus country biotope collection date type determinator defined by special remarks Anura Megophrys Indonesia in rain near road 01.02.1888 holotype A. Dubois (Linnaeus, 1758) in bad condition, was eaten by Leptodactylus rugosus (3023) at night and thrown up again the next morning when killed, partly digested 14
  • 15. 15
  • 16. • a database provides structure • computers are good at comparing values • statistical methods can detect inconsistencies 16
  • 17. 17 author determinator family genus country preservation method (Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae Geophis Indonesia ----- Schneider M. S. Hoogmoed ------ Bufo Suriname ----- (Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol
  • 18. author determinator family genus country preservation method (Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae Geophis Indonesia ----- Schneider M. S. Hoogmoed ------ Bufo Suriname ----- (Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol 18
  • 19. author determinator family genus country preservation method (Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- Schneider M. S. Hoogmoed ------ Bufo Suriname ----- (Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol actual value: Geophis 19
  • 20. author determinator family genus country preservation method (Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- Schneider M. S. Hoogmoed ------ Bufo Suriname ----- (Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol actual value: Geophis 20
  • 21. author determinator family genus country preservation method (Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- Schneider M. S. Hoogmoed ------ Bufo Suriname ----- (Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol actual value: Geophis 21
  • 22. author determinator family genus country preservation method (Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- Schneider M. S. Hoogmoed ------ Bufo Suriname ----- (Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol actual value: Geophis 22
  • 23. author determinator family genus country preservation method (Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- Schneider M. S. Hoogmoed ------ Bufo Suriname ----- (Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol actual value: Geophis 23
  • 24. author determinator family genus country preservation method (Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- Schneider M. S. Hoogmoed ------ Bufo Suriname ----- (Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol actual value: Geophis predicted value: Rhapdophis 24
  • 25. • <100 cells to check for a column instead of 16,780 • recall (estimate): 90-100% • one-size-fits-all 25
  • 26. • Data-driven cleaning cannot detect systematic errors • Maybe systematics can help? 26
  • 27. subject relation object specimen collection occurs before entry in museum species has broader term genus city falls within country 27
  • 28. • detects inconsistencies database usage • small scope • high recall and precision within scope • needs adapting for each new domain 28
  • 29. 29 Disambiguating Locations
  • 30. 30 Challenge Example Ambiguous location name Amsterdam Two or more location descriptors Wakarusa, 24mi WSW of Lawrence Topological nesting Moccassin Creek on Hog Island Complex description Bupo [?Buso] River, 15 miles [24km] E of Lae Linear feature measurement 16km (by road) N of Murtoa Linear ambiguity On the road between Sydney and Bathurst Vague localities Southeast Michigan Changed political borders Yugoslavia Historical Place Names British North Borneo
  • 31. • Randomly annotated geographical information in 200 database records • 50 records for development, 150 for testing 31
  • 32. • Record retrieval • Text parsing • Gazetteer lookup • Offset calculation • Disambiguation Heuristics 32 Knowledge-driven Georeferencing
  • 33. Offset 33
  • 34. Disambiguation Heuristics • Spatial Minimality • if Amsterdam and Utrecht are mentioned in the same record, then Amsterdam, NL is more likely than Amsterdam, NY, USA • Expedition clusters • It is unlikely that a collector was collecting in Europe on Monday and in the US on Tuesday • Species occurrence data • GBIF can tell us where a certain species does or does not occur 34
  • 35. Species Occurrence Data 35
  • 36. Results 36 Correct @5km Correct @25km Correct @100km Mean distance off Not Found Baseline + Google maps + fuzzy + Spatial minimality + Expedition + GBIF 38.9 47.0 58.4 251.1 26.2 53.0 65.1 74.5 244.1 8.7 59.1 71.8 77.2 171.1 7.4 59.1 71.8 77.2 171.1 7.4 61.7 74.5 79.9 114.5 7.4
  • 37. Confidence 37
  • 38. Generating Stories 38 Image source: http://www.gungeralv.org/dg/images/chapter1.JPG
  • 39. Work in Progress 39
  • 40. • data cleaning is essential • “digitising” a heritage collection is complicated • don’t try to tame text General Conclusions 40
  • 41. Thank you for your attention! 41
  • 42. • CATCH: http://www.nwo.nl/catch • MITCH: http://ilk.uvt.nl/mitch • NewsReader: http://www.newsreader-project.eu 42
  • 43. • More information about machine learning • Video explaining k-nearest neighbour algorithm: http://videolectures.net/ aaai07_bosch_knnc/ • Weka Toolkit: http:// www.cs.waikato.ac.nz/ml/weka/ 43