Innovative methods for data integration: Linked Data and NLP

ARIADNE is funded by the European Commission's Seventh Framework Programme
Innovative methods for data
integration: Linked Data and NLP
Douglas Tudhope
Hypermedia Research Group
University of South Wales (USW)

Linked Data and NLP
Linked Data (LD) + Natural Language Processing (NLP)
Two technologies that open up new possibilities for
semantic integration of archaeological datasets and
fieldwork reports.
Eg
• cross searching
• meta research
• reinterpretation of previous work

This presentation
• Overview
• Illustrative early examples
- a flavour of progress and challenges to date
• NLP of grey literature (English – Dutch)
• Mapping between multilingual vocabularies

What is Linked Data?
“The Web enables us to link related documents. Similarly
it enables us to link related data. The term Linked Data
refers to a set of best practices for publishing and
connecting structured data on the Web.
Key technologies that support Linked Data are URIs (a generic
means to identify entities or concepts in the world), HTTP (a
simple yet universal mechanism for retrieving resources, or
descriptions of resources), and RDF (a generic graph-based
data model with which to structure and link data that describes
things in the world).”
http://linkeddata.org/faq
http://data.archaeologydataservice.ac.uk/page/10.5284/1000389

Linked Data
• Making RDF format data available via the web
• Data expressed in RDF
• Using (HTTP) URIs as identifiers for things
• When someone looks up a URI, provide useful
information (including links to other things)
• Will it work for cultural heritage...? Yes
– http://data.ordnancesurvey.co.uk/
– http://collection.britishmuseum.org/
– http://data.archaeologydataservice.ac.uk/
http://linkeddata.org/faq and
Bizer, C., Heath, T., & Berners-Lee, T. (2009). Linked Data - The Story So Far. International Journal
on Semantic Web and Information Systems, 5(3), 1-22.
Also see http://data.gov.uk/linked-data

Standards are key
• Linked data rests upon layers of technological
standards.
• Within archaeology, vocabulary standards are seen
as a potential solution to the current fieldwork
situation where isolated silos of data impede sharing,
cross search, comparison and reinterpretation.
• Standard representations of thesauri and other
vocabularies
• Standard ways of mapping between vocabularies
and using vocabularies within ontology frameworks

ARIADNE Linked Data aims
Support provision, management and use of LD
in the Integrated Infrastructure
• Operational LD management service (based on a
triple store) working with the ARIADNE Registry
• Supporting tools for the linking of infrastructure, such
as dictionaries, glossaries and thesauri. Tools for
semantic enrichment
• Exemplar applications exploiting the LD
• Advise ARIADNE data providers in creation and
publishing of LD

Semantic enrichment
• Semantic enrichment requires an
infrastructure for linking
- dictionaries, glossaries, thesauri, ontologies.
• Perhaps some linking by hand possible between
closely associated datasets but not scalable
• Critical for enrichment are concepts from major
vocabularies and ontological classes that can act
as hubs in a web of archaeological data

NLP methods
a) Rule-based systems - a pipeline of cascaded
software elements using domain knowledge and
vocabularies together with domain-independent
linguistic syntax
b) Machine learning (supervised) has less domain-
dependency but depends on the existence of a
training set.
ARIADNE currently exploring complementary use of
both methods, either in combination or sequentially
in a pipeline. Looking to make use of relevant CLARIN
and other resources.

NLP ongoing case study
Rule-based pilot system
• Archaeological vocabularies from English Heritage
(EH) and Rijksdienst Cultureel Erfgoed (RCE).
• Building upon previous work producing semantic
enrichment of ADS grey literature via archaeological
thesauri and corresponding CIDOC CRM ontology
• Next slides show examples from pilot English and
Dutch pipelines. Three entity types shown: Green
(Physical Object), Purple (Material) and Red (Time
Appellation).

NLP on English grey literature

Pilot rule based system
Early stages but promising semantic enrichment
for a range of archaeological concepts
eg
• artefacts (“vessel/vaatwerk”)
• materials ("charchoal/houtskool”)
• monuments ("Castel/Kasteel”)
• contexts ("grave/graf")
• temporal entities
both numerical ("3025 BC /3025 v.Chr")
and time appellations ("Neolithic/Neolithicum”)

NLP challenges
Generalisation of the English rule based techniques to Dutch (in this case)
faces various challenges:
• different set of vocabularies
(archaeology has very specific terminology and important to use it)
• differences in language characteristics
such as compound noun forms
eg
• “beslagplaat” - both “beslag” and “plaat” known vocabulary
• “aardewerkmagering” - aardewerk (pottery) known
but “magering” not known
• Current work investigating gazetteers operating on part matching, in
order to overcome the ‘whole word’ restriction.
• Mapping between vocabularies essential to actually use the results!

You say potato, I say tomato…
• Multiple datasets, multiple organisations,
multiple languages
• Unification of data structures may be
possible, BUT…
– Incompatible terminology hinders cross search
and prevents greater interoperability
– Indexing using text is ambiguous, leading to
incorrect search results
– Applications attempting to reuse data must all
individually tackle the same problems
• E.g. Find all the iron age post holes…
• The problem here is in the use of text to
convey meaning – whereas the underlying
concepts are actually the same
• use of concept-based controlled
vocabularies and mapping between them
Feature Period
Post-hole IRON AGE
Posthole |ron age
POST HOLE Iron age?
POSTHLOLE EARLY IRON AGE
POST HOLE
(POSSIBLE)
250 BC
POSTHOLES C 500-200 B.C.

Mapping links:
many-to-many vs. hub architecture
Number of bidirectional links when linking
between multiple thesauri

Multilingual Mapping Experiment
• Explored the potential of a mediating structure
(a ‘mapping spine’) to support search in the
ARIADNE Registry across metadata expressed
via partner vocabularies in different languages.
• The mapping spine was expressed as a poly-
hierarchical structure using RDF (SKOS).
• Experimental mappings from partner
vocabulary resources (DAI, DANS/RCE, FASTI,
EH, ICCD) to the concept identifiers of the
central spine were expressed in RDF using
standard SKOS mapping relationships.

Example mappings from ICCD (Italian)
vocabulary to mapping spine
• @prefix iccd: < http://www.iccd.beniculturali.it/monuments/> .
• @prefix skos: <http://www.w3.org/2004/02/skos/core#> .
• @prefix aat: <http://vocab.getty.edu/aat/> .
•
• # NOTE: iccd URIs have been invented for this example
• iccd:catacomba skos:prefLabel "catacomba"@it ;
• skos:closeMatch aat:300000367 .
• iccd:cenotafio skos:prefLabel "cenotafio"@it ;
• iccd:cimitero skos:prefLabel "cimitero"@it ;
• iccd:colombario skos:prefLabel "colombario"@it ;
• iccd:dolmen skos:prefLabel "dolmen"@it ;
• iccd:mausoleo skos:prefLabel "mausoleo"@it ;
• iccd:menhir skos:prefLabel "menhir"@it ;
• iccd:necropoli skos:prefLabel "necropoli"@it ;
• iccd:sepolcreto-rupestre skos:prefLabel "sepolcreto rupestre"@it ;
• iccd:tomba skos:prefLabel "tomba"@it ;
Google Translate (https://translate.google.com/) was used to determine English translations of the ICCD terminology, these terms were then also
manually mapped to AAT concepts

Multilingual Mapping Experiment
• Results from an example query using a concept
identifier for “cemetery” from a partner vocabulary
are shown, where the search is programmed to locate
vocabulary concepts from any partner vocabulary
mapped into the mapping spine at that level or
below (expanded to more specific concepts).
• The different partner vocabularies can be seen in the
prefix to each concept (eg iccd is the Italian ICCD
Istituto Centrale per il Catalogo e la Documentazione
archaeological site type vocabulary).

Cross searching and expanding the
mapped vocabularies
• The results show that a query on a concept
from one partner (Fasti) vocabulary has
located (multilingual) concepts originating
from five different controlled vocabularies,
all related via the mapping spine (AAT)
structure.
• The query has also included semantic
expansion to more specific concepts.

Standards again
• The experiment is only possible because of
the standards based approach that has been
followed by ARIADNE and which underpins
Linked Data.
• In the next phase of the Registry
development, it would be a straightforward
query to find all collection items indexed
using any of these multilingual, multi-
vocabulary concepts.

Contact Information
Douglas Tudhope
Hypermedia Research Group
Faculty of Computing, Engineering and Science
University of South Wales
Pontypridd CF37 1DL
Wales, UK
douglas.tudhope@southwales.ac.uk
Related links
http://www.heritagedata.org
http://data.archaeologydataservice.ac.uk
http://hypermedia.research.glam.ac.uk/kos/STELLAR/
http://hypermedia.research.southwales.ac.uk/kos/

Disclaimer
ARIADNE is a project funded by the European Commission under the
Community’s Seventh Framework Programme, contract no. FP7-
INFRASTRUCTURES-2012-1-313193.
The views and opinions expressed in this presentation are the sole
responsibility of the authors and do not necessarily reflect the views
of the European Commission.

Innovative methods for data integration: Linked Data and NLP

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Innovative methods for data integration: Linked Data and NLP

Similar to Innovative methods for data integration: Linked Data and NLP (20)

More from ariadnenetwork

More from ariadnenetwork (20)

Recently uploaded

Recently uploaded (20)

Innovative methods for data integration: Linked Data and NLP