BiographyNet: Linking the world of History

BiographyNet
Linking the world of History
Serge ter Braake, Antske Fokkens, Niels Ockeloen,
Susan Legêne, Guus Schreiber, Piek Vossen, et al.
The Network Institute, VU University Amsterdam
http://wm.cs.vu.nl http://www.biographynet.nl
October 2013

BiographyNet: Linking the world of history
General project info, February 2014
Overview of this presentation
• Introduction of the project
• What is E-history?
• Project goals
• Short overview of use cases
• Illustrative use case example
• Text mining using NLP
• Challenges
• Preliminary results
• Why provenance is important
• Requirements from the perspective of the Historian
• Requirements from the perspective of the Computer scientist
• The BiographyNet schema
• Extending the schema with Provenance
• Aggregated provenance information
• Detailed provenance information
• Demonstrator Interface
• First ideas and sketches
Overview

BiographyNet: Extracting relations between people,
places and historic events
• Multidisciplinary E-History Project
What is BiographyNet?

E-humanities
Investigates what can be done in humanities with modern
techniques which we could not do before, or only with a
great deal of effort
What is E-history?
E-history
Sub domain of E-humanities which aims at improving existing methods
of historical research rather than introducing
a whole new way of doing historical research *
* Zaagsma, G.: Doing history in the digital age: history as a hybrid practice (2013)
http://gerbenzaagsma.org/blog/16-03-2013/doing-history-digital-age-history-hybrid-practice

BiographyNet: Extracting relations between people,
places and historic events
• Multidisciplinary E-History Project
What is BiographyNet?
• Funded by the Netherlands eScience Center
• Partners are the Netherlands eScience Center, the
Huygens/ING Institute of the Royal Dutch Academy of
Sciences and VU University Amsterdam
• Starting Point: The Biographical Portal of the
Netherlands - http://www.biografischportaal.nl
• 125,000 short biographical descriptions with limited meta
data from a variety of Dutch biographical dictionaries
• 76,000 individuals

Short biographical descriptions
with limited meta data
0 20 40 60 80 100 120
Name
Category
Gender
Date of Death
Date of Birth
Place of Birth
Place of Death
Occupation
Religion
Father
Mother
Claim to Fame
Partner
Text
Name
Category
Gender
Date of Death
Date of Birth
Place of Birth
Place of Death
Occupation
Religion
Father
Mother
Claim to Fame
Partner
Text
Individuals with available information (%)

Main project goals
• Provide a richer historic knowledge base by creating a semantic layer on
top of the data from the Biographical Portal
• Convert the available data to RDF (first conversion available)
• Enrichments (NLP) and Aggregations
• Link to other sources
• Inspire Historians in setting up new research projects by providing them
with interesting leads
• Development of a demonstrator
• Quantitative analysis, visualisation and browsing techniques
• Re-usable deliverables
• Open-source release of the platform for analyzing texts about people
• Methodology for extraction of a relation network between
people, places and events
Project Goals

Currently 12 use cases developed involving quantitative
analysis, relation discovery, thematic research, etc.
• Simple:
• Group analysis of Governors-general
of the Dutch Indies
• More complex:
• When did Dutch elites get involved
with the ‘New World’?
• Highly complex:
• What can we say about nationalism in biographical
dictionaries from the nineteenth and twentieth century?
Use Case Overview

Governors-General of the Dutch Indies
• Highest Official in the Dutch Indies (1610-1949)
• 129 Biographies describing 71 individuals
• What can we say about these men as a group?
• What properties did they need to have to be appointed?
• Personal qualities
• Relations (already
more difficult)
Illustrative use case

Focus on the following information
• Family connections
• Parents
• Partner
• Children
• Dates
• Birth
• Appointment
• Death
• Motivation
• Education
• Religion
• Reasons for appointment
• Reasons for leaving the office
Governors General: Data Mining

Manual analysis
“More than one full week to manually mine this information
from the Biography Portal.” (Serge ter Braake)
The question
“Can a historian do this with (almost) the same results in
less than an hour when using the demonstrator?”
Governors General: Time and effort

Basic System for data enrichment using text:
• Identifying meta data in text
• Linguistically naïve supervised machine learning
• Linguistic processing
• Detection of (co-referenced) named-entities
(persons, places and dates) and events
• Concept identification
Text mining using Natural Language
Processing (NLP)

Challenges for NLP within BiographyNet:
• Deal with alternative spelling
• Texts vary from 19th century Dutch to contemporary Dutch
• Variations in the naming of people and places
• OCR-ed texts contain errors
• Used methods may introduce bias:
• Example: Location identification with GeoNames
Heuristic: On multiple possibilities, take the one in, or
closest to The Netherlands
• Problem: ‘America’ is a place in The Netherlands, but
what about trade with the new world?
NLP: Challenges

NLP: Preliminary results – Governors
0
10
20
30
40
50
60
70
80
90
100
metadata
text
Presence of information in text vs. meta data (% on 71 individuals)

Before development of the actual demonstrator can
commence, we first need to:
• Convert the data of the Biography Portal to RDF
• Prevent loss of information
• Devise a schema
• Structure the data
• Provide compatibility with other interesting sources
• Facilitate the recording of provenance information on the
manipulation of the data
Towards the demonstrator

Two main requirements for the demonstrator:
• A trace back to all original sources (texts and meta data) involved
in producing a certain result
• Which sources were used for the overall outcome and how often?
• What potentially relevant data was excluded from the end result?
• Which piece of data led to a specific result (e.g. the age of a specific
governor at his appointment)?
• Insight in the processes manipulating and selecting the data
• Indication of overall performance: Focus on recall or precision?
• Global description of the used heuristics should be provided
• Indication of responsibility: Who to contact when results are pulled
into question?
Requirements from the perspective
of the Historian

Reproducing results:
• Reproducing results in NLP is non-trivial
• Details in implementations or experimental setup can
influence results up to a point where they tell a different story
• Clear registration of all steps involved and storage of
intermediate system output can improve reproducibility
• Systematic testing can help to gain insight into the variation
of the outcome of our systems and hence lead to more
insight in their performance
Antske Fokkens, Marieke van Erp, Marten Postma, Ted Pedersen, Piek Vossen and Nuno
Freire (2013) Offspring from Reproduction Problems: What Replication Failure Teaches
Us. In: Proceedings of ACL 2013, Sofia, Bulgaria, August 2013.
Requirements from the perspective of the
Computer Scientist / Computational Linguist

Translation into requirements for the demonstrator:
• Facilitate Replication and Reproduction
• Recording of information on used tools such as Creator, version
number, etc.
• Recording of any kind of pre- / post-processing done on
input/output data.
• Recording of the intention behind the various steps in the NLP
pipeline, including made assumptions and possible biases.
• Intermediate results need to be preserved for debugging purposes
• The schema needs to be both generic and flexible
• NLP pipeline design can change
• Tools and their formats unclear towards the future
Requirements from the perspective of the
Computer Scientist / Computational Linguist

Foundations of the schema:
• Based on the structure of the original XML files
• Needs to facilitate the coupling of different biographies of the same
person, without compromising the original data
• Needs to facilitate the incorporation of several enrichments, following
from NLP, as well as aggregations
• Compatible with existing
schemas such as the
Europeana Data Model,
PROV, P-PLAN,
DC terms, etc.
The BiographyNet Schema

Purely syntactic conversion
• Preserve the original
structure of the data
• Prevent los of information
• Allow for reinterpretation of
the original data in the future
The conversion process
<XML> Very simplified BP XML Example
<BioDes>
<FileDes> Source Meta Data
<Author></Author>
</FileDes>
<PersonDes> Person Meta Data
<Name></Name>
</PersonDes>
<BioPart> Biographical Text
<Snippet></Snippet>
<BioPart>
</BioDes>
</XML>

Conversion steps:
• Retrieval of XML dump of the Biography Portal
• Initial conversion to ‘crude’ RDF
• Using ClioPatria and the XMLRDF
tool for ClioPatria
• RDF restructuring
• Correction of purely syntactic
inefficiencies in the data
• TODO: Linking to other sources
• Essential step in the
‘Linked Data’ philosophy
The conversion process

Provenance information is information on how Entities
come into existence
• What are entities?
• Documents, Articles, Pictures, etc.
• Basically anything that can be
‘produced’ by something or someone
• What kind of information?
• Who did what?
• Using which entities?
• In which processes?
• Why use the PROV-DM, i.e. PROV-O?
• PROV-DM now an official W3C recommendation
Adding Provenance Information

Based on the requirements for the demonstrator,
provenance needs to be modeled:
• From several perspectives:
• Information involved  Sources, but also: NER input data, etc.
• Processes involved  All steps in enrichment, aggregation, etc
• People involved  Who was responsible for pipeline, tool, etc.
• At multiple levels:
• An aggregated level,  Targeted at the Historian
i.e. per enrichment
• A detailed level, i.e. all  Targeted at the Computer Scientist and
individual processes  computational linguist
Provenance in BiographyNet

Needed to ensure credibility of the demonstrator, to
evaluate its performance and to improve the academic
status of the tool
• One needs to be able to validate results
• Replication: Retrieving the same results later using the
demonstrator
• Reproducibility: Manually by the historian
• The aggregated level – Targeted at the historian
• Which original sources where involved?
• Who to contact in case results are pulled into question?
• The detailed level – Targeted at the computer scientist
• Detailed information on each individual step
• Allows for debugging the internal processing pipeline
Recap: Why is provenance info
important for BiographyNet?

BiographyNet: Schema illustration
http://www.biographynet.nl/schema

Johan Rudolph Thorbecke werd
in 1798 geboren op 14 januari
in Zwolle en komt uit een half-Duits
BiographyNet
Enrichment example
Thorbecke
Biographical
Description
File
Meta Data
NNBW
Person
Meta Data
“Thorbecke”
Biography
Parts
Birth
1798
Event
Biographical
Description
Enrichment
NLP
Pipeline
Person
Meta Data
Event
Birth
Zwolle
1798-01-14
prov:plan

Provenance and Plans (P-PLAN):* Represent the plans that
guided the execution of scientific processes
• ‘Plans’ describe the original idea behind an activity
• Each ‘Plan’ can consist of one or more ‘Steps’
• Each ‘Step’ corresponds to an ‘Activity’
• ‘Variables’ describe the input/output of an activity
• Structure, format, quantity, etc.
• Each ‘Variable’ corresponds with an input/output ‘Entity’ of an
‘Activity’
• ‘Plans’ have their own provenance info
• E.g. who was responsible for the creation of a plan?
*Daniel Garijo, Yolanda Gil; http://www.opmw.org/model/p-plan
More than just Provenance:

P-PLAN is used to not only model what actually
happened, but also what was supposed to happen
• Forces the recording of what an activity and its
input/output should look like
• Provides abstract description of original idea behind activity
• As such, can provide info on heuristics and assumptions
• Allows for comparing the actual activity and its
input/output with the original plan and its variables
• Do they differ from each other and to what extend?
• Makes finding errors much easier, as more information is
available about what the input/output should look like
Why model plans besides provenance?

BiographyNet: Schema illustration

Activity
Plan
EntityEntity
Variable Variable
Agent
Agent
Association
Activit
Plan
Person
NLP
Tool

• The interface should be easy to use
• The demonstrator should inspire historians to
undertake new research and give
direction, rather than being the ‘closing factor’
in their research
• The interface should allow to ‘fine tune’
results returned upon an initial action
Interface: Focus

• Query composition
• Faceted browsing
• A combination
Interface: Options

• Drop down boxes
to select ‘Verbs’,
data elements
and relations
Interface: Query composition

• No explicit querying, but
convergence of the data through
browsing and selecting
• Provides better feedback to the user
• Allows for more direct and easier
adjustment of the selected data
Interface: Faceted browsing

• Query composition combined with faceted
browsing
• Create new facets by defining a query
– The result of the query is available as a subset of
the data by selecting the defined facet
– As such, combinable with other facets
• Method to integrate ‘open’ querying of the
data into a general interface and visualization
Interface: A combination

Interface: A combination
Question
Analysis
Selection
Process
Results
Data
Facets

Time and place
are primary elements
Interface: Demonstrator
Results
?

Main components of the demonstrator
• Initial schema available
• Schema models enrichments and aggregations alongside original
sources
• Allows for storing various levels of provenance information
• Model will be adapted while progressing with building the
demonstrator
• Initial conversion to RDF available
• Structure according to devised schema
• Next step is linking to external sources
• Initial NLP system setup available
• Preliminary results comparable with manual use case
• Interface
• First ideas and sketches
Current Status

Thank you for your attention
www.biographynet.nl
Feel free to ask questions

BiographyNet: Linking the world of History

Recommended

Recommended

More Related Content

What's hot

What's hot (11)

Viewers also liked

Viewers also liked (9)

Similar to BiographyNet: Linking the world of History

Similar to BiographyNet: Linking the world of History (20)

Recently uploaded

Recently uploaded (20)

BiographyNet: Linking the world of History