Richness oftheworld2012

253
-1

Published on

Guest lecture about digitising natural history in the Richness of the World module at Leiden University. October 1st 2012

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
253
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Richness oftheworld2012

  1. 1. Digitising Natural History Marieke van Erp marieke@cs.vu.nl 1
  2. 2. 2
  3. 3. Why Digitise?New technology offers many new possibilities • improves collection management • opens up new avenues of research • digital collection access 3
  4. 4. Digitisation at Naturalis• goal is to have 7 million objects digitised by mid-2015 (out of 37 million) + robust infrastructure for continuation of digitisation• 3 million within Naturalis digitisation streets• 4 million elsewhere• other 30 million objects will be digitised at less detailed level 4
  5. 5. 5
  6. 6. 6
  7. 7. 7
  8. 8. • Leposoma Guianense, Sipaliwini, 4 km e. of airport, near base camp, forest ground, among leaves, 28-VIII-1968, 12.45 u. reg. nr. 13879 8
  9. 9. But what you really want... Genus Leposoma Species Guianense Region SipaliwiniLocation 4 km e. of airport near base camp, forest groundBiotope among leaves Date 28-08-1968 Time 12:45 Reg # 13879 9
  10. 10. • Leposoma Guianense, Sipaliwini, 4 km e. of airport, near base camp, forest ground, among leaves, 28- VIII-1968, 12.45 u. reg. nr. 13879• ask a computer to learn to segment and classify text snippets 10
  11. 11. • Manually annotate 500 text snippets (~3h)• 300 for training• 200 for testing 11
  12. 12. • 49,688 new database records (547,528 database cells) at ~84.57 accuracy 12
  13. 13. The Manually Created Reptiles and Amphibians Database• 16,870 records describing characteristics and history of animal specimens in a natural history database• 39 columns• Dutch, English, German and Portuguese• numeric and textual values (both atomic and elaborate) 13
  14. 14. column Name value order Anura genus Megophrys country Indonesia biotope in rain near roadcollection date 01.02.1888 type holotype determinator A. Dubois defined by (Linnaeus, 1758) in bad condition, was eaten by Leptodactylus rugosus (3023) atspecial remarks night and thrown up again the next morning when killed, partly digested 14
  15. 15. 15
  16. 16. • a database provides structure• computers are good at comparing values• statistical methods can detect inconsistencies 16
  17. 17. preservation author determinator family genus country method(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae Geophis Indonesia ----- M. S. Schneider ------ Bufo Suriname ----- Hoogmoed(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol 17
  18. 18. preservation author determinator family genus country method(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae Geophis Indonesia ----- M. S. Schneider ------ Bufo Suriname ----- Hoogmoed(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol 18
  19. 19. actual value: Geophis preservation author determinator family genus country method(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- M. S. Schneider ------ Bufo Suriname ----- Hoogmoed(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol 19
  20. 20. actual value: Geophis preservation author determinator family genus country method(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- M. S. Schneider ------ Bufo Suriname ----- Hoogmoed(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol 20
  21. 21. actual value: Geophis preservation author determinator family genus country method(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- M. S. Schneider ------ Bufo Suriname ----- Hoogmoed(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol 21
  22. 22. actual value: Geophis preservation author determinator family genus country method(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- M. S. Schneider ------ Bufo Suriname ----- Hoogmoed(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol 22
  23. 23. actual value: Geophis preservation author determinator family genus country method(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- M. S. Schneider ------ Bufo Suriname ----- Hoogmoed(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol 23
  24. 24. actual value: Geophis predicted value: Rhapdophis preservation author determinator family genus country method(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- M. S. Schneider ------ Bufo Suriname ----- Hoogmoed(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol 24
  25. 25. • <100 cells to check for a column instead of 16,780• recall (estimate): 90-100%• one-size-fits-all 25
  26. 26. • Data-driven cleaning cannot detect systematic errors• Maybe systematics can help? 26
  27. 27. subject relation objectspecimen entry in occurs beforecollection museum has broader species genus term city falls within country 27
  28. 28. • detects inconsistencies database usage• small scope• high recall and precision within scope• needs adapting for each new domain 28
  29. 29. Disambiguating Locations 29
  30. 30. Challenge Example Ambiguous location name Amsterdam Two or more location Wakarusa, 24mi WSW of descriptors Lawrence Topological nesting Moccassin Creek on Hog Island Bupo [?Buso] River, 15 miles Complex description [24km] E of LaeLinear feature measurement 16km (by road) N of Murtoa On the road between Sydney Linear ambiguity and Bathurst Vague localities Southeast Michigan Changed political borders Yugoslavia Historical Place Names British North Borneo 30
  31. 31. • Randomly annotated geographical information in 200 database records• 50 records for development, 150 for testing 31
  32. 32. Knowledge-driven Georeferencing• Record retrieval• Text parsing• Gazetteer lookup• Offset calculation• Disambiguation Heuristics 32
  33. 33. Offset 33
  34. 34. Disambiguation Heuristics• Spatial Minimality • if Amsterdam and Utrecht are mentioned in the same record, then Amsterdam, NL is more likely than Amsterdam, NY, USA• Expedition clusters • It is unlikely that a collector was collecting in Europe on Monday and in the US on Tuesday• Species occurrence data • GBIF can tell us where a certain species does or does not occur 34
  35. 35. Species Occurrence Data 35
  36. 36. Results Mean Correct Correct Correct distance Not Found @5km @25km @100km off Baseline 38.9 47.0 58.4 251.1 26.2 + Google 53.0 65.1 74.5 244.1 8.7maps + fuzzy + Spatial 59.1 71.8 77.2 171.1 7.4 minimality+ Expedition 59.1 71.8 77.2 171.1 7.4 + GBIF 61.7 74.5 79.9 114.5 7.4 36
  37. 37. Confidence 37
  38. 38. General Conclusions• data cleaning is essential• “digitising” a heritage collection is complicated• don’t try to tame text 38
  39. 39. • Data-driven error correction method is being developed further in the CATCHPlus programme • http://www.catchplus.nl/diensten/ deelprojecten/checkers/ 39
  40. 40. Thank you for your attention! 40
  41. 41. • CATCH: http://www.nwo.nl/catch• MITCH: http://ilk.uvt.nl/mitch• Agora: http://agora.cs.vu.nl/ 41
  42. 42. • More information about machine learning • Video explaining k-nearest neighbour algorithm: http://videolectures.net/ aaai07_bosch_knnc/ • Weka Toolkit: http:// www.cs.waikato.ac.nz/ml/weka/ 42
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×