Linked data integration_framework

LINKED DATA INTEGRATION
FRAMEWORK

LDIF
 Expressive language mapping for translating data from
the various vocabularies used on the web into a
consistent, local target vocabulary [Schultz et al, 2011]

CHALLENGES
 Vocabulary heterogeneity – wide range of different
RDF vocabularies to represent data about the same
type of entity.
 URI aliases – the same real-world entity is identified
with different URIs within different data sources.

SOLUTION
 Have all data describing one class of entities
being represented using the same vocabulary
 Have all triples describing the same entity have
the same subject URI

TARGET
 Vocabulary mapping = translate data to a single target vocabulary
 Identity resolution = replace URI aliases wit ha single target URI on
the client’s side (based on user-provided matching heuristics)
 while keeping track of data provenance
(using the Named Graphs data model)

INTEGRATION PIPLELINE STEPS
1. COLLECT DATA: replicate data sets locally via file download,
crawling and SPARQL;
2. MAP TO SCHEMA: expressive language mapping from the various
vocabularies used on the Web into consistent, local target
vocabulary;
3. RESOLVE IDENTITIES: identity resolution component – replace URI
aliases;
4. OUTPUT: integrated data in a single file + provenance tracking
(Named Graphs data model).

ARCHITECTURE
 Steps of the data integration process that are currently supported by LDIF.

SCHEDULER
 Used for triggering pending data import jobs or integration jobs;
 Configured with an XML document;
 Updates the representation of external sources in the local cache;
 Has the following elements:
 Properties : path to a Java properties file for configuration parameters;
 dataSources: directory containing the data sources configurations;
 importJobs configurations
 integrationJob
 dumpLocation: directory where local dumps are cached
 Supports relative and absolute paths

SCHEDULER
<scheduler xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www4.wiwiss.fu-berlin.de/ldif/">
<properties>scheduler.properties</properties>
<dataSources>datasources</dataSources>
<importJobs>importJobs</importJobs>
<integrationJobs>integration-config.xml</integrationJob>
<dumpLocation>dumps</dumpLocation>
</scheduler>

DATA IMPORT
 Replicate data sets locally;
 Different types of import jobs generate provenance
metadata, tracked throughout the integration process;
 Managed by a scheduler configured to refresh (e.g.
hourly, daily) the local cache for each source.

DATA IMPORT
 Elements:
 internalId: unique ID used to internally track the import job and its
files (i.e/ data and provenance)
 dataSource: reference to a data source to state from which source this
job imports data;
 One kind of importJob (exactly one for each element)
 refreshSchedule

DATA IMPORT
Mechanisms to import external data:
 Quad Import Job – import N-Quad dumps
 Triple Import Job – import RDF/N-Triple dumps
 Crawl Import Job – import by dereferencing URIs as RDF data, using the LDSpider
Web Crawling Framework
 SPARQL Import Job – import by querying a SPARQL endpoint

TRIPLE/QUAD DUMP IMPORT
 Download a file containing the data set;
 Difference Triple and Quad: LDIF generates a
provenance graph for a triple dump import, whereas it
takes the given graphs from a quad dump import as
provenance graphs;

CRAWLER IMPORT
 Data sets that can be accessed only via
dereferenceable URIS are good candidates for a
crawler;
 Each crawled URI is put into a separate named graph
for provenance tracking

SPARQL IMPORT
 The relevant data tube queried can be further
specified in the configuration file for a SPARQL import
job;
 Data from each SPARQL import job gets tracked by its
own named graph.

INTEGRATION RUNTIME ENVIRONMENT
 Manages the data flow between the various
stages/modules, the caching of intermediate
results and the execution of the different
modules for each stage.
 Mechanisms: data input, transformation, data
output, and runtime environments.

INTEGRATION RUNTIME ENVIRONMENT
Mechanisms:
 Data Input: expects to be represented as Named Graphs and be stored in N-Quands format
accessible locally;
 Transformation: LDIF provides transformation modules for vocabulary mapping and identity
resolution:
 R2R Data Translation
 Silk Identity Resolution – Silk Link Discovery Framework
 Data Output: formats supported are N-Quads Writer and N-Triples Writer
 Runtime Environments: depending on the size of the dataset and the available computing
resources:
 Single machine / In-memory – keeps all intermediate results in the memory (fast, but limited scalability);
 Single machine / RDF Store - Jena TDB RDF store to store intermediate results, communicating with the RDF
and runtime environment through SPARQL queries (allows the processing of datasets that don't fit in the
memory, but it is slower);
 Cluster / Hadoop - parallelize the work onto multiple machines using Hadoop.

FUTHER STEPS
• Data Quality Evaluation and Data Fusion Module: should allow data to be filtered
according to different quality data assessment policies and provide for fusing Web
data according to different conflict resolution methods;
• Flexible integration workflow: make the workflow and its configuration more
flexible in order to make it easier to include additional modules to cover other
data integration aspects.

REFERENCES
• Andreas Schultz, Andrea Matteini, Robert Isele, Christian Bizer, Christian
Becker (2012) “LDIF – Linked Data Integration Framework ” Available
online: http://www4.wiwiss.fu-berlin.de/bizer/ldif/, retrieved 06.02.2012
(since the link from above is not active anymore try:
http://www.wiwiss.fu-
berlin.de/en/institute/pwo/bizer/news/ldif03released.html)
• Andreas Schultz, Andrea Matteini, Robert Isele, Christian Bizer, Christian
Becker (2011) “LDIF - Linked Data Integration Framework”. 2nd
International Workshop on Consuming Linked Data, Bonn, Germany,
October 2011.

Linked data integration_framework

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Linked data integration_framework

Similar to Linked data integration_framework (20)

More from STIinnsbruck

More from STIinnsbruck (20)

Recently uploaded

Recently uploaded (20)

Linked data integration_framework