SlideShare a Scribd company logo
Digitising Natural
History
Marieke van Erp
marieke.van.erp@vu.nl
1
2
New technology offers many new possibilities
• improves collection management
• opens up new avenues of research
• digital collection access
3
Why Digitise?
Digitisation at Naturalis
• goal is to have 7 million objects digitised by mid-2015
(out of 37 million) + robust infrastructure for
continuation of digitisation
• 3 million within Naturalis digitisation streets
• 4 million elsewhere
• other 30 million objects will be digitised at less detailed
level
4
5
6
7
• Leposoma Guianense, Sipaliwini, 4 km e. of
airport, near base camp, forest ground,
among leaves, 28-VIII-1968, 12.45 u. reg. nr.
13879
8
Genus
Species
Region
Location
Biotope
Date
Time
Reg #
Leposoma
Guianense
Sipaliwini
4 km e. of airport
near base camp, forest ground
among leaves
28-08-1968
12:45
13879
9
But what you really want...
• Leposoma Guianense, Sipaliwini, 4 km e. of airport,
near base camp, forest ground, among leaves, 28-
VIII-1968, 12.45 u. reg. nr. 13879
• ask a computer to learn to segment and classify
text snippets
10
• Manually annotate 500 text snippets (~3h)
• 300 for training
• 200 for testing
11
• 49,688 new database records (547,528
database cells) at ~84.57 accuracy
12
• 16,870 records describing characteristics and
history of animal specimens in a natural
history database
• 39 columns
• Dutch, English, German and Portuguese
• numeric and textual values (both atomic and
elaborate)
The Manually Created Reptiles and
Amphibians Database
13
column Name value
order
genus
country
biotope
collection date
type
determinator
defined by
special remarks
Anura
Megophrys
Indonesia
in rain near road
01.02.1888
holotype
A. Dubois
(Linnaeus, 1758)
in bad condition, was eaten by
Leptodactylus rugosus (3023) at
night and thrown up again the next
morning when killed, partly digested
14
15
• a database provides structure
• computers are good at comparing values
• statistical methods can detect
inconsistencies
16
17
author determinator family genus country
preservation
method
(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry)
(Schlegel) G. vd. Boog Colubridae Geophis Indonesia -----
Schneider
M. S.
Hoogmoed
------ Bufo Suriname -----
(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol
author determinator family genus country
preservation
method
(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry)
(Schlegel) G. vd. Boog Colubridae Geophis Indonesia -----
Schneider
M. S.
Hoogmoed
------ Bufo Suriname -----
(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol
18
author determinator family genus country
preservation
method
(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry)
(Schlegel) G. vd. Boog Colubridae ? Indonesia -----
Schneider
M. S.
Hoogmoed
------ Bufo Suriname -----
(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol
actual value: Geophis
19
author determinator family genus country
preservation
method
(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry)
(Schlegel) G. vd. Boog Colubridae ? Indonesia -----
Schneider
M. S.
Hoogmoed
------ Bufo Suriname -----
(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol
actual value: Geophis
20
author determinator family genus country
preservation
method
(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry)
(Schlegel) G. vd. Boog Colubridae ? Indonesia -----
Schneider
M. S.
Hoogmoed
------ Bufo Suriname -----
(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol
actual value: Geophis
21
author determinator family genus country
preservation
method
(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry)
(Schlegel) G. vd. Boog Colubridae ? Indonesia -----
Schneider
M. S.
Hoogmoed
------ Bufo Suriname -----
(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol
actual value: Geophis
22
author determinator family genus country
preservation
method
(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry)
(Schlegel) G. vd. Boog Colubridae ? Indonesia -----
Schneider
M. S.
Hoogmoed
------ Bufo Suriname -----
(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol
actual value: Geophis
23
author determinator family genus country
preservation
method
(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry)
(Schlegel) G. vd. Boog Colubridae ? Indonesia -----
Schneider
M. S.
Hoogmoed
------ Bufo Suriname -----
(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol
actual value: Geophis
predicted value: Rhapdophis
24
• <100 cells to check for a column instead of
16,780
• recall (estimate): 90-100%
• one-size-fits-all
25
• Data-driven cleaning cannot detect
systematic errors
• Maybe systematics can help?
26
subject relation object
specimen
collection
occurs before
entry in
museum
species
has broader
term
genus
city falls within country
27
• detects inconsistencies database usage
• small scope
• high recall and precision within scope
• needs adapting for each new domain
28
29
Disambiguating
Locations
30
Challenge Example
Ambiguous location name Amsterdam
Two or more location
descriptors
Wakarusa, 24mi WSW of
Lawrence
Topological nesting Moccassin Creek on Hog Island
Complex description
Bupo [?Buso] River, 15 miles
[24km] E of Lae
Linear feature measurement 16km (by road) N of Murtoa
Linear ambiguity
On the road between Sydney
and Bathurst
Vague localities Southeast Michigan
Changed political borders Yugoslavia
Historical Place Names British North Borneo
• Randomly annotated geographical
information in 200 database records
• 50 records for development, 150 for testing
31
• Record retrieval
• Text parsing
• Gazetteer lookup
• Offset calculation
• Disambiguation Heuristics
32
Knowledge-driven
Georeferencing
Offset
33
Disambiguation
Heuristics
• Spatial Minimality
• if Amsterdam and Utrecht are mentioned in the same record,
then Amsterdam, NL is more likely than Amsterdam, NY, USA
• Expedition clusters
• It is unlikely that a collector was collecting in Europe on
Monday and in the US on Tuesday
• Species occurrence data
• GBIF can tell us where a certain species does or does not
occur
34
Species Occurrence
Data
35
Results
36
Correct
@5km
Correct
@25km
Correct
@100km
Mean
distance
off
Not Found
Baseline
+ Google
maps + fuzzy
+ Spatial
minimality
+ Expedition
+ GBIF
38.9 47.0 58.4 251.1 26.2
53.0 65.1 74.5 244.1 8.7
59.1 71.8 77.2 171.1 7.4
59.1 71.8 77.2 171.1 7.4
61.7 74.5 79.9 114.5 7.4
Confidence
37
Generating Stories
38
Image source: http://www.gungeralv.org/dg/images/chapter1.JPG
Work in Progress
39
• data cleaning is essential
• “digitising” a heritage collection is
complicated
• don’t try to tame text
General Conclusions
40
Thank you for your
attention!
41
• CATCH: http://www.nwo.nl/catch
• MITCH: http://ilk.uvt.nl/mitch
• NewsReader: http://www.newsreader-project.eu
42
• More information about machine learning
• Video explaining k-nearest neighbour
algorithm: http://videolectures.net/
aaai07_bosch_knnc/
• Weka Toolkit: http://
www.cs.waikato.ac.nz/ml/weka/
43

More Related Content

More from Marieke van Erp

Towards Culturally Aware AI Systems - TSDH Symposium
Towards Culturally Aware AI Systems - TSDH SymposiumTowards Culturally Aware AI Systems - TSDH Symposium
Towards Culturally Aware AI Systems - TSDH Symposium
Marieke van Erp
 
A Polyvocal and Contextualised Semantic Web
A Polyvocal and Contextualised Semantic WebA Polyvocal and Contextualised Semantic Web
A Polyvocal and Contextualised Semantic Web
Marieke van Erp
 
AI x Digital Humanities = > Inclusiviteit
AI x Digital Humanities = > Inclusiviteit AI x Digital Humanities = > Inclusiviteit
AI x Digital Humanities = > Inclusiviteit
Marieke van Erp
 
Computationally Tracing Concepts Through Time and Space
Computationally Tracing Concepts Through Time and SpaceComputationally Tracing Concepts Through Time and Space
Computationally Tracing Concepts Through Time and Space
Marieke van Erp
 
The Hitchhiker's Guide to the Future of Digital Humanities
The Hitchhiker's Guide to the Future of Digital HumanitiesThe Hitchhiker's Guide to the Future of Digital Humanities
The Hitchhiker's Guide to the Future of Digital Humanities
Marieke van Erp
 
Why language technology can’t handle Game of Thrones (yet)
Why language technology can’t handle Game of Thrones (yet)Why language technology can’t handle Game of Thrones (yet)
Why language technology can’t handle Game of Thrones (yet)
Marieke van Erp
 
(Beyond) Combining Text and Tables for qualitative and quantitative research
(Beyond) Combining Text and Tables for qualitative and quantitative research (Beyond) Combining Text and Tables for qualitative and quantitative research
(Beyond) Combining Text and Tables for qualitative and quantitative research
Marieke van Erp
 
Finding common ground between text, maps, and tables for quantitative and qua...
Finding common ground between text, maps, and tables for quantitative and qua...Finding common ground between text, maps, and tables for quantitative and qua...
Finding common ground between text, maps, and tables for quantitative and qua...
Marieke van Erp
 
Slicing and Dicing a Newspaper Corpus for Historical Ecology Research
Slicing and Dicing a Newspaper Corpus for Historical Ecology ResearchSlicing and Dicing a Newspaper Corpus for Historical Ecology Research
Slicing and Dicing a Newspaper Corpus for Historical Ecology Research
Marieke van Erp
 
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
Marieke van Erp
 
Good Lynx, bad Lynx: Document enrichment for historical ecologists
Good Lynx, bad Lynx: Document enrichment for historical ecologistsGood Lynx, bad Lynx: Document enrichment for historical ecologists
Good Lynx, bad Lynx: Document enrichment for historical ecologists
Marieke van Erp
 
Towards Semantic Enrichment of Newspapers: a historical ecology use case
Towards Semantic Enrichment of Newspapers: a historical ecology use case Towards Semantic Enrichment of Newspapers: a historical ecology use case
Towards Semantic Enrichment of Newspapers: a historical ecology use case
Marieke van Erp
 
Natural Language Processing en Named Entity Recognition
Natural Language Processing en Named Entity Recognition Natural Language Processing en Named Entity Recognition
Natural Language Processing en Named Entity Recognition
Marieke van Erp
 
HuC lecture - Digital and Humanities: Continuing the Conversation
HuC lecture - Digital and Humanities: Continuing the ConversationHuC lecture - Digital and Humanities: Continuing the Conversation
HuC lecture - Digital and Humanities: Continuing the Conversation
Marieke van Erp
 
Multilingual Fine-grained Entity Typing
Multilingual Fine-grained Entity Typing Multilingual Fine-grained Entity Typing
Multilingual Fine-grained Entity Typing
Marieke van Erp
 
Entity Typing Using Distributional Semantics and DBpedia
Entity Typing Using Distributional Semantics and DBpedia Entity Typing Using Distributional Semantics and DBpedia
Entity Typing Using Distributional Semantics and DBpedia
Marieke van Erp
 
Entity Typing and Event Extraction
Entity Typing and Event Extraction Entity Typing and Event Extraction
Entity Typing and Event Extraction
Marieke van Erp
 
The domain as unifier, how focusing on social history can bring technical fie...
The domain as unifier, how focusing on social history can bring technical fie...The domain as unifier, how focusing on social history can bring technical fie...
The domain as unifier, how focusing on social history can bring technical fie...
Marieke van Erp
 
Evaluating entity linking an analysis of current benchmark datasets and a ro...
Evaluating entity linking  an analysis of current benchmark datasets and a ro...Evaluating entity linking  an analysis of current benchmark datasets and a ro...
Evaluating entity linking an analysis of current benchmark datasets and a ro...
Marieke van Erp
 
Evaluating Named Entity Recognition and Disambiguation in News and Tweets
Evaluating Named Entity Recognition and Disambiguation in News and TweetsEvaluating Named Entity Recognition and Disambiguation in News and Tweets
Evaluating Named Entity Recognition and Disambiguation in News and Tweets
Marieke van Erp
 

More from Marieke van Erp (20)

Towards Culturally Aware AI Systems - TSDH Symposium
Towards Culturally Aware AI Systems - TSDH SymposiumTowards Culturally Aware AI Systems - TSDH Symposium
Towards Culturally Aware AI Systems - TSDH Symposium
 
A Polyvocal and Contextualised Semantic Web
A Polyvocal and Contextualised Semantic WebA Polyvocal and Contextualised Semantic Web
A Polyvocal and Contextualised Semantic Web
 
AI x Digital Humanities = > Inclusiviteit
AI x Digital Humanities = > Inclusiviteit AI x Digital Humanities = > Inclusiviteit
AI x Digital Humanities = > Inclusiviteit
 
Computationally Tracing Concepts Through Time and Space
Computationally Tracing Concepts Through Time and SpaceComputationally Tracing Concepts Through Time and Space
Computationally Tracing Concepts Through Time and Space
 
The Hitchhiker's Guide to the Future of Digital Humanities
The Hitchhiker's Guide to the Future of Digital HumanitiesThe Hitchhiker's Guide to the Future of Digital Humanities
The Hitchhiker's Guide to the Future of Digital Humanities
 
Why language technology can’t handle Game of Thrones (yet)
Why language technology can’t handle Game of Thrones (yet)Why language technology can’t handle Game of Thrones (yet)
Why language technology can’t handle Game of Thrones (yet)
 
(Beyond) Combining Text and Tables for qualitative and quantitative research
(Beyond) Combining Text and Tables for qualitative and quantitative research (Beyond) Combining Text and Tables for qualitative and quantitative research
(Beyond) Combining Text and Tables for qualitative and quantitative research
 
Finding common ground between text, maps, and tables for quantitative and qua...
Finding common ground between text, maps, and tables for quantitative and qua...Finding common ground between text, maps, and tables for quantitative and qua...
Finding common ground between text, maps, and tables for quantitative and qua...
 
Slicing and Dicing a Newspaper Corpus for Historical Ecology Research
Slicing and Dicing a Newspaper Corpus for Historical Ecology ResearchSlicing and Dicing a Newspaper Corpus for Historical Ecology Research
Slicing and Dicing a Newspaper Corpus for Historical Ecology Research
 
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
 
Good Lynx, bad Lynx: Document enrichment for historical ecologists
Good Lynx, bad Lynx: Document enrichment for historical ecologistsGood Lynx, bad Lynx: Document enrichment for historical ecologists
Good Lynx, bad Lynx: Document enrichment for historical ecologists
 
Towards Semantic Enrichment of Newspapers: a historical ecology use case
Towards Semantic Enrichment of Newspapers: a historical ecology use case Towards Semantic Enrichment of Newspapers: a historical ecology use case
Towards Semantic Enrichment of Newspapers: a historical ecology use case
 
Natural Language Processing en Named Entity Recognition
Natural Language Processing en Named Entity Recognition Natural Language Processing en Named Entity Recognition
Natural Language Processing en Named Entity Recognition
 
HuC lecture - Digital and Humanities: Continuing the Conversation
HuC lecture - Digital and Humanities: Continuing the ConversationHuC lecture - Digital and Humanities: Continuing the Conversation
HuC lecture - Digital and Humanities: Continuing the Conversation
 
Multilingual Fine-grained Entity Typing
Multilingual Fine-grained Entity Typing Multilingual Fine-grained Entity Typing
Multilingual Fine-grained Entity Typing
 
Entity Typing Using Distributional Semantics and DBpedia
Entity Typing Using Distributional Semantics and DBpedia Entity Typing Using Distributional Semantics and DBpedia
Entity Typing Using Distributional Semantics and DBpedia
 
Entity Typing and Event Extraction
Entity Typing and Event Extraction Entity Typing and Event Extraction
Entity Typing and Event Extraction
 
The domain as unifier, how focusing on social history can bring technical fie...
The domain as unifier, how focusing on social history can bring technical fie...The domain as unifier, how focusing on social history can bring technical fie...
The domain as unifier, how focusing on social history can bring technical fie...
 
Evaluating entity linking an analysis of current benchmark datasets and a ro...
Evaluating entity linking  an analysis of current benchmark datasets and a ro...Evaluating entity linking  an analysis of current benchmark datasets and a ro...
Evaluating entity linking an analysis of current benchmark datasets and a ro...
 
Evaluating Named Entity Recognition and Disambiguation in News and Tweets
Evaluating Named Entity Recognition and Disambiguation in News and TweetsEvaluating Named Entity Recognition and Disambiguation in News and Tweets
Evaluating Named Entity Recognition and Disambiguation in News and Tweets
 

Recently uploaded

Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
ViralQR
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 

Recently uploaded (20)

Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 

Orientation EBC 2013: Digitising Natural History

  • 1. Digitising Natural History Marieke van Erp marieke.van.erp@vu.nl 1
  • 2. 2
  • 3. New technology offers many new possibilities • improves collection management • opens up new avenues of research • digital collection access 3 Why Digitise?
  • 4. Digitisation at Naturalis • goal is to have 7 million objects digitised by mid-2015 (out of 37 million) + robust infrastructure for continuation of digitisation • 3 million within Naturalis digitisation streets • 4 million elsewhere • other 30 million objects will be digitised at less detailed level 4
  • 5. 5
  • 6. 6
  • 7. 7
  • 8. • Leposoma Guianense, Sipaliwini, 4 km e. of airport, near base camp, forest ground, among leaves, 28-VIII-1968, 12.45 u. reg. nr. 13879 8
  • 9. Genus Species Region Location Biotope Date Time Reg # Leposoma Guianense Sipaliwini 4 km e. of airport near base camp, forest ground among leaves 28-08-1968 12:45 13879 9 But what you really want...
  • 10. • Leposoma Guianense, Sipaliwini, 4 km e. of airport, near base camp, forest ground, among leaves, 28- VIII-1968, 12.45 u. reg. nr. 13879 • ask a computer to learn to segment and classify text snippets 10
  • 11. • Manually annotate 500 text snippets (~3h) • 300 for training • 200 for testing 11
  • 12. • 49,688 new database records (547,528 database cells) at ~84.57 accuracy 12
  • 13. • 16,870 records describing characteristics and history of animal specimens in a natural history database • 39 columns • Dutch, English, German and Portuguese • numeric and textual values (both atomic and elaborate) The Manually Created Reptiles and Amphibians Database 13
  • 14. column Name value order genus country biotope collection date type determinator defined by special remarks Anura Megophrys Indonesia in rain near road 01.02.1888 holotype A. Dubois (Linnaeus, 1758) in bad condition, was eaten by Leptodactylus rugosus (3023) at night and thrown up again the next morning when killed, partly digested 14
  • 15. 15
  • 16. • a database provides structure • computers are good at comparing values • statistical methods can detect inconsistencies 16
  • 17. 17 author determinator family genus country preservation method (Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae Geophis Indonesia ----- Schneider M. S. Hoogmoed ------ Bufo Suriname ----- (Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol
  • 18. author determinator family genus country preservation method (Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae Geophis Indonesia ----- Schneider M. S. Hoogmoed ------ Bufo Suriname ----- (Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol 18
  • 19. author determinator family genus country preservation method (Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- Schneider M. S. Hoogmoed ------ Bufo Suriname ----- (Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol actual value: Geophis 19
  • 20. author determinator family genus country preservation method (Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- Schneider M. S. Hoogmoed ------ Bufo Suriname ----- (Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol actual value: Geophis 20
  • 21. author determinator family genus country preservation method (Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- Schneider M. S. Hoogmoed ------ Bufo Suriname ----- (Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol actual value: Geophis 21
  • 22. author determinator family genus country preservation method (Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- Schneider M. S. Hoogmoed ------ Bufo Suriname ----- (Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol actual value: Geophis 22
  • 23. author determinator family genus country preservation method (Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- Schneider M. S. Hoogmoed ------ Bufo Suriname ----- (Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol actual value: Geophis 23
  • 24. author determinator family genus country preservation method (Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- Schneider M. S. Hoogmoed ------ Bufo Suriname ----- (Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol actual value: Geophis predicted value: Rhapdophis 24
  • 25. • <100 cells to check for a column instead of 16,780 • recall (estimate): 90-100% • one-size-fits-all 25
  • 26. • Data-driven cleaning cannot detect systematic errors • Maybe systematics can help? 26
  • 27. subject relation object specimen collection occurs before entry in museum species has broader term genus city falls within country 27
  • 28. • detects inconsistencies database usage • small scope • high recall and precision within scope • needs adapting for each new domain 28
  • 30. 30 Challenge Example Ambiguous location name Amsterdam Two or more location descriptors Wakarusa, 24mi WSW of Lawrence Topological nesting Moccassin Creek on Hog Island Complex description Bupo [?Buso] River, 15 miles [24km] E of Lae Linear feature measurement 16km (by road) N of Murtoa Linear ambiguity On the road between Sydney and Bathurst Vague localities Southeast Michigan Changed political borders Yugoslavia Historical Place Names British North Borneo
  • 31. • Randomly annotated geographical information in 200 database records • 50 records for development, 150 for testing 31
  • 32. • Record retrieval • Text parsing • Gazetteer lookup • Offset calculation • Disambiguation Heuristics 32 Knowledge-driven Georeferencing
  • 34. Disambiguation Heuristics • Spatial Minimality • if Amsterdam and Utrecht are mentioned in the same record, then Amsterdam, NL is more likely than Amsterdam, NY, USA • Expedition clusters • It is unlikely that a collector was collecting in Europe on Monday and in the US on Tuesday • Species occurrence data • GBIF can tell us where a certain species does or does not occur 34
  • 36. Results 36 Correct @5km Correct @25km Correct @100km Mean distance off Not Found Baseline + Google maps + fuzzy + Spatial minimality + Expedition + GBIF 38.9 47.0 58.4 251.1 26.2 53.0 65.1 74.5 244.1 8.7 59.1 71.8 77.2 171.1 7.4 59.1 71.8 77.2 171.1 7.4 61.7 74.5 79.9 114.5 7.4
  • 38. Generating Stories 38 Image source: http://www.gungeralv.org/dg/images/chapter1.JPG
  • 40. • data cleaning is essential • “digitising” a heritage collection is complicated • don’t try to tame text General Conclusions 40
  • 41. Thank you for your attention! 41
  • 42. • CATCH: http://www.nwo.nl/catch • MITCH: http://ilk.uvt.nl/mitch • NewsReader: http://www.newsreader-project.eu 42
  • 43. • More information about machine learning • Video explaining k-nearest neighbour algorithm: http://videolectures.net/ aaai07_bosch_knnc/ • Weka Toolkit: http:// www.cs.waikato.ac.nz/ml/weka/ 43