UnifiedViews: Towards ETL Tool
for Simple yet Powerful RDF Data Management
T. Knap, P. Škoda, J. Klímek, M. Nečaský
http://xrg.cz | knap@ksi.mff.cuni.cz
XML and Web Engineering Research Group
Faculty of Mathematics and Physics
Charles University in Prague, Czech Republic
Dateso 2015
Agenda
 UnifiedViews
 Basic concepts, Impact
 Areas of ongoing and future work
Dateso 2015
UnifiedViews
Basic Concepts, Impact
Dateso 2015
UnifiedViews
 an Extract-Transform-Load (ETL)
framework with UI that allows users to
define, execute, monitor, debug, schedule,
and share RDF data processing tasks
 UnifiedViews differs from other ETL
frameworks by natively supporting processing
of RDF data.
Dateso 2015
A Pipeline
 Every data processing task is modelled as a pipeline in
UnifiedViews
 Every pipeline consists of one or more DPUs (data
processing units) and arrows depicting data flow
 a
Dateso 2015
A Data Processing Unit (DPU)
 Plugin, which encapsulates certain functionality, typically on top of
RDF data
 Users may prepare custom plugins
 Every DPU has its inputs, outputs, business logic and configuration
 E.g., DPU may apply SPARQL Update query to the input RDF data and
produce output RDF data
 a
Dateso 2015
Key Features
 Web administration interface:
 Define and manage pipelines
 Validate, execute, monitor and debug pipelines
 Possibility to schedule tasks, set up notifications about the pipeline executions
 Define and manage DPUs
 Possibility to debug inputs to/outputs from DPUs
 Possibility to share pipelines and DPUs
 Possibility to get notifications about the result of the pipeline execution
 Multi-user environment
 Engine running the tasks
 Ensures that DPUs on the pipeline are executed in the proper order
 It may send notifications about the result of the pipeline execution
 Core DPUs to work with RDF data
 Easy way how to extend Unified Views with your own DPUs
 Every DPU is an OSGi bundle, as a result, two DPUs using two different
versions of the same library may coexist in the framework
Dateso 2015
Impact of UnifiedViews
 Projects
• OpenData.cz initiative
• INTLIB (2012-2014) – TaCR project
• LOD2 (2011-2014) – EU FP7 project
• UnifiedViews integrated into the LOD2 stack
• COMSODE (2013-2015) – EU FP7 project
• Open Data Node contains UnifiedViews
• YourDataStories (2015+), H2020
• TenForce, Belgium
 also commercial projects
• Semantic Web Company (Austria),
• EEA s.r.o. (SK)
Dateso 2015
UnifiedViews
Ongoing and Future Work
Dateso 2015
Automatic Schema Alignment and
Object Linkage
 Object Linkage:
 Motivation: If various datasets use the same identifiers for the same
real world objects (cities, countries), level of data integration is
increased and costs of ad-hoc application integration is reduced
 Goal: To automatically discover that certain columns in the processed
tabular data represent certain types of data (e.g. cities, countries) and
automatically mapping values in this column to Linked Data URIs taken
from the preferred dataset for the given type of data
 Schema Alignment:
 Motivation: increase understandability of the data and simplify reuse
of the data by various applications by using common vocabularies.
 Goal: To automatically suggest mappings of used RDF vocabulary
terms (e.g., predicated) to well-known RDF terms (e.g., predicates)
Dateso 2015
Simplicity of Use
 Hiding SPARQL Queries
 Goal: To provide set of DPUs for executing typical SPARQL query
operations on top of RDF data
 Autocompleting Terms from Well-known Vocabularies
 Goal: To Suggest and autocomplete vocabulary terms from well-
known Linked Data vocabularies
• Vocabulary autocomplete-aware controls (text boxes)
• Description of the term, formal def., recommended usage
 Wizards for Simple Definition of Data Processing Tasks
 Motivation: Defining data processing tasks typically requires
detailed knowledge of the DPUs that are available in the
deployed UnifiedViews instance;
 Goal: Step by step guides for defining new typical types of data
processing tasks, e.g, extracting and publishing tabular
Dateso 2015
Sustainability and Quality
 Sustainable RDF Data Processing
 Goal: To allow task designer to define for each DPU a set of
SPARQL queries, which tests that the output data
produced by the given DPU satisfies certain conditions. If
possible, automate creation of such queries.
 Assessing Quality of Produced Data, Recommendation
of Cleansing DPUs
 Motivation: task designer should be informed about any
problems in the data, e.g., w.r.t. syntactic/semantic
accuracy of the produced Linked Data or completeness of
the published datasets
 Goal: Set of DPUs assessing the quality of the data,
cleansing the data
Dateso 2015
Conclusions
Dateso 2015
Summary
 UnifiedViews – ETL tool for RDF data
processing
 Basic concepts, Impact
 Areas of ongoing and future work
Dateso 2015
Would you like to try UnifiedViews?
 UnifiedViews is available under open
source license
 GPLv3 + LGPLv3
 Hosted on GitHub
 Repository: https://github.com/UnifiedView
 Current latest version: Unified Views 2.0.1
 More info:
 unifiedviews.eu
Dateso 2015
Thank You!
Dateso 2015
How to contribute?
 Guideline for contributors:
 https://grips.semantic-
web.at/display/UDDOC/Guidelines+for+Contributors
Dateso 2015
Join the Unified Views Team

UnifiedViews: Towards ETL Tool for Simple yet Powerful RDF Data Management.

  • 1.
    UnifiedViews: Towards ETLTool for Simple yet Powerful RDF Data Management T. Knap, P. Škoda, J. Klímek, M. Nečaský http://xrg.cz | knap@ksi.mff.cuni.cz XML and Web Engineering Research Group Faculty of Mathematics and Physics Charles University in Prague, Czech Republic Dateso 2015
  • 2.
    Agenda  UnifiedViews  Basicconcepts, Impact  Areas of ongoing and future work Dateso 2015
  • 3.
  • 4.
    UnifiedViews  an Extract-Transform-Load(ETL) framework with UI that allows users to define, execute, monitor, debug, schedule, and share RDF data processing tasks  UnifiedViews differs from other ETL frameworks by natively supporting processing of RDF data. Dateso 2015
  • 5.
    A Pipeline  Everydata processing task is modelled as a pipeline in UnifiedViews  Every pipeline consists of one or more DPUs (data processing units) and arrows depicting data flow  a Dateso 2015
  • 6.
    A Data ProcessingUnit (DPU)  Plugin, which encapsulates certain functionality, typically on top of RDF data  Users may prepare custom plugins  Every DPU has its inputs, outputs, business logic and configuration  E.g., DPU may apply SPARQL Update query to the input RDF data and produce output RDF data  a Dateso 2015
  • 7.
    Key Features  Webadministration interface:  Define and manage pipelines  Validate, execute, monitor and debug pipelines  Possibility to schedule tasks, set up notifications about the pipeline executions  Define and manage DPUs  Possibility to debug inputs to/outputs from DPUs  Possibility to share pipelines and DPUs  Possibility to get notifications about the result of the pipeline execution  Multi-user environment  Engine running the tasks  Ensures that DPUs on the pipeline are executed in the proper order  It may send notifications about the result of the pipeline execution  Core DPUs to work with RDF data  Easy way how to extend Unified Views with your own DPUs  Every DPU is an OSGi bundle, as a result, two DPUs using two different versions of the same library may coexist in the framework Dateso 2015
  • 8.
    Impact of UnifiedViews Projects • OpenData.cz initiative • INTLIB (2012-2014) – TaCR project • LOD2 (2011-2014) – EU FP7 project • UnifiedViews integrated into the LOD2 stack • COMSODE (2013-2015) – EU FP7 project • Open Data Node contains UnifiedViews • YourDataStories (2015+), H2020 • TenForce, Belgium  also commercial projects • Semantic Web Company (Austria), • EEA s.r.o. (SK) Dateso 2015
  • 9.
  • 10.
    Automatic Schema Alignmentand Object Linkage  Object Linkage:  Motivation: If various datasets use the same identifiers for the same real world objects (cities, countries), level of data integration is increased and costs of ad-hoc application integration is reduced  Goal: To automatically discover that certain columns in the processed tabular data represent certain types of data (e.g. cities, countries) and automatically mapping values in this column to Linked Data URIs taken from the preferred dataset for the given type of data  Schema Alignment:  Motivation: increase understandability of the data and simplify reuse of the data by various applications by using common vocabularies.  Goal: To automatically suggest mappings of used RDF vocabulary terms (e.g., predicated) to well-known RDF terms (e.g., predicates) Dateso 2015
  • 11.
    Simplicity of Use Hiding SPARQL Queries  Goal: To provide set of DPUs for executing typical SPARQL query operations on top of RDF data  Autocompleting Terms from Well-known Vocabularies  Goal: To Suggest and autocomplete vocabulary terms from well- known Linked Data vocabularies • Vocabulary autocomplete-aware controls (text boxes) • Description of the term, formal def., recommended usage  Wizards for Simple Definition of Data Processing Tasks  Motivation: Defining data processing tasks typically requires detailed knowledge of the DPUs that are available in the deployed UnifiedViews instance;  Goal: Step by step guides for defining new typical types of data processing tasks, e.g, extracting and publishing tabular Dateso 2015
  • 12.
    Sustainability and Quality Sustainable RDF Data Processing  Goal: To allow task designer to define for each DPU a set of SPARQL queries, which tests that the output data produced by the given DPU satisfies certain conditions. If possible, automate creation of such queries.  Assessing Quality of Produced Data, Recommendation of Cleansing DPUs  Motivation: task designer should be informed about any problems in the data, e.g., w.r.t. syntactic/semantic accuracy of the produced Linked Data or completeness of the published datasets  Goal: Set of DPUs assessing the quality of the data, cleansing the data Dateso 2015
  • 13.
  • 14.
    Summary  UnifiedViews –ETL tool for RDF data processing  Basic concepts, Impact  Areas of ongoing and future work Dateso 2015
  • 15.
    Would you liketo try UnifiedViews?  UnifiedViews is available under open source license  GPLv3 + LGPLv3  Hosted on GitHub  Repository: https://github.com/UnifiedView  Current latest version: Unified Views 2.0.1  More info:  unifiedviews.eu Dateso 2015
  • 16.
  • 17.
    How to contribute? Guideline for contributors:  https://grips.semantic- web.at/display/UDDOC/Guidelines+for+Contributors Dateso 2015 Join the Unified Views Team

Editor's Notes

  • #5 Priklad ulohy It may employ custom plugins (data processing units, DPUs) created by users. General Problem with RDF data processing: Consumers have to write most of the logic to define, execute, monitor, schedule, and share RDF data processing tasks
  • #10 online platform for data exploitation focused in the financial flows that are critical for transparency, collaboration and participation
  • #12 To realise 1), first, it is necessary to identify that certain columns contain certain types of values; such identification is always probabilistic and typically based on the comparison of the name of the column with the list of names of the RDF classes and/or based on matching sample data from the considered column against known codelists, such as list of Czech cities; experiments are needed to decide the particular algorithm for identification of types among input data. Second step to realise 1) is to apply predefined Silk~\cite{DBLP:conf/www/VolzBGK09} rules for the given identified type of data within the column of the input tabular data. To realise 2), various schema matching techniques has to be experimented~\cite{Rahm:2001:SAA:767149.767154}.
  • #14 Evolution of DPUs (Done) Proper handling of version migrations