SlideShare a Scribd company logo
1 of 37
Download to read offline
WWW.LEDS-PROJEKT.DE
LEDS
KNOWLEDGE EXTRACTION
FROM HETEROGENEOUS
SEMI-STRUCTURED DATA SOURCES
MARTIN SEIDEL, MICHAEL KRUG, FRANK BURIAN, MARTIN GAEDKE
12. September 2016
LEDSCURRENT SITUATION
• knowledge in the Web often only available as weakly
interlinked, heterogeneous, semi-structured data
à no semantic classification
• how to link or merge data?
• how to do semantic queries?
à not usable in a meaningful way
2 12. September 2016
LEDSGOAL
Extraction of knowledge from semi-structured data
• knowledge in terms of semantic metadata
• semantically enriched data then can utilize
the potential of Linked Data
à provide an automatic process
3 13. September 2016
LEDS
THE KESEDA
APPROACH
LEDSTHE KESEDA APPROACH
• Especially designed to work on JSON data
• Challenges when working with JSON data
à no schema, only name-value pairs
à any structure and depth possible
12. September 20165
LEDSTHE KESEDA APPROACH
{
"id": "krug”,
"firstName": "Michael",
"lastName": "Krug",
"title": "Dipl.-Inf.",
"phone": "+49 371 531 39929",
"email": "michael.krug@informatik.tu-chemnitz.de",
[...]
}
12. September 20166
LEDSTHE KESEDA APPROACH
{
"id": "2015-007",
"title": "SmartComposition: ...",
"author": [ "Michael Krug", "Martin Gaedke"],
"year": "2015",
"type": "Conference Paper",
"event": {
"name": "24th International World Wide Web Conference",
"url": http://www.www2015.it/
},
[...]
}
12. September 20167
Arrays
Objects
LEDSTHE KESEDA APPROACH
• multi-step algorithm
• work in existing JSON structure
• find and store various matches with different weights
• use additional information sources like API descriptions
• assign classes to objects with multiple properties
• link detected entities
12. September 20168
LEDSTHE KESEDA APPROACH
1. Differentiation of input sources / formats
2. Preparation of data structure
3. Analysis of property labels
4. Analysis of property values
5. Mapping of classes
6. Generate JSON-LD document
7. Evaluation of results
13. September 20169
LEDS
PROTOTYPE
LEDSPROTOTYPE
• prototype implemented in Node.js
• working with properties and classes from:
• schema.org
• foaf
• dublincore
• goodrelations
• music ontology
• dictionaries for: first & last names, cities, streets, languages
• list of manually curated synonyms
• option to provide pre-defined mappings
12. September 201611
LEDSPROTOTYPE
• Web interface for
• pre-configuration
• mappings, synonyms, dictionaries
• data upload
• result analysis
• statistics and browsing
12. September 201612
LEDSPROTOTYPE
12. September 201613
CONFIGURATION
LEDSPROTOTYPE
12. September 201614
RESULTS
LEDS
EVALUATION
LEDSEVALUATION
Algorithm applied to datasets of
1) JSON array of people
2) JSON array of publications
a) Without custom pre-configuration
b) With custom pre-configuration
12. September 201616
LEDSEVALUATION
Initial Setup
• dictionary and structure pattern matching
• label à predicate string matching
• classes and properties: schema.org, foaf, dublincore, goodrelations
Custom Pre-Configuration
• set of label à predicate mappings (hand-picked for data context)
• list of known synonyms
• more structure patterns
12. September 201617
LEDS1A) PEOPLE W/O CONFIG
12. September 201618
LEDS1A) PEOPLE W/O CONFIG
12. September 201619
LEDS2A) PEOPLE W/ CONFIG
12. September 201620
LEDS2A) PEOPLE W/ CONFIG
12. September 201621
LEDS1B) PUBLICATIONS W/O CONFIG
12. September 201622
LEDS1B) PUBLICATIONS W/O CONFIG
12. September 201623
LEDS2B) PUBLICATIONS W/ CONFIG
12. September 201624
LEDS2B) PUBLICATIONS W/ CONFIG
12. September 201625
LEDS
SUMMARY
LEDSSUMMARY
➙ Approach for extracting knowledge from semi-
structured data
➙ by applying a multi-step algorithm
➙ to convert JSON data to RDF
➙ that assigns known classes to objects and maps
their properties to S-P-O triples
12. September 201627
LEDSOPEN CHALLENGES
• detect and reuse JSON structure pattern
• disambiguate values
• apply quality control to results
• improve scalability for large datasets
• research application of machine learning
12. September 201628
WWW.LEDS-PROJEKT.DE
LEDS
THANK YOU!
MICHAEL.KRUG@INFORMATIK.TU-CHEMNITZ.DE
VSR.INFORMATIK.TU-CHEMNITZ.DE
WWW.LEDS-PROJEKT.DE
12. September 201629
LEDS
LEDSTHE KESEDA APPROACH
1. Differentiation of input sources / formats
• text, file, URL, API
• check for format
• optional conversion of XML to JSON
13. September 201631
LEDSTHE KESEDA APPROACH
2. Preparation of data structure
• pre-process JSON tree to store matches and mappings
• keep original structure to preserve hierachie for later
relations
• detect arrays and objects for seperate processing
• clean up: remove empty entries
12. September 201632
LEDSTHE KESEDA APPROACH
3. Analysis of property labels
• string matching (substrings, prefixes, …)
• synonyms
• pre-defined mappings
• use metadata from API description, if available
12. September 201633
LEDSTHE KESEDA APPROACH
4. Analysis of property values
• dictionaries
• structure patterns (uri, date, address, color…)
• data types (date, time, number, boolean…)
• (lower weighted)
12. September 201634
LEDSTHE KESEDA APPROACH
5. Mapping of classes
• find class by number of matched properties
• select match that is most appropriate for chosen class
• take different weights into account
12. September 201635
LEDSTHE KESEDA APPROACH
6. Generate JSON-LD document
• use matches and mappings
• link entities depending on JSON tree structure
• validation of output
• optional conversion to various RDF formats
12. September 201636
LEDSTHE KESEDA APPROACH
7. Evaluation of results
• manual or automatic comparision of actual vs. desired
result to reweight matching components
• store correctly applied mappings for later reuse
12. September 201637

More Related Content

What's hot

eNanoMapper database, search tools and templates
eNanoMapper database, search tools and templateseNanoMapper database, search tools and templates
eNanoMapper database, search tools and templates
Nina Jeliazkova
 

What's hot (20)

Big Linked Data - Creating Training Curricula
Big Linked Data - Creating Training CurriculaBig Linked Data - Creating Training Curricula
Big Linked Data - Creating Training Curricula
 
Lauruhn-5-jun15
Lauruhn-5-jun15Lauruhn-5-jun15
Lauruhn-5-jun15
 
New tasks, new roles: Libraries in the tension between Digital Humanities, Re...
New tasks, new roles: Libraries in the tension between Digital Humanities, Re...New tasks, new roles: Libraries in the tension between Digital Humanities, Re...
New tasks, new roles: Libraries in the tension between Digital Humanities, Re...
 
Maximising (Re)Usability of Library metadata using Linked Data
Maximising (Re)Usability of Library metadata using Linked Data Maximising (Re)Usability of Library metadata using Linked Data
Maximising (Re)Usability of Library metadata using Linked Data
 
Thompson 6-jun15-final
Thompson 6-jun15-finalThompson 6-jun15-final
Thompson 6-jun15-final
 
Interaction with Linked Data
Interaction with Linked DataInteraction with Linked Data
Interaction with Linked Data
 
Lawless-3-jun15
Lawless-3-jun15Lawless-3-jun15
Lawless-3-jun15
 
McDanold-1-jun15
McDanold-1-jun15McDanold-1-jun15
McDanold-1-jun15
 
eNanoMapper database, search tools and templates
eNanoMapper database, search tools and templateseNanoMapper database, search tools and templates
eNanoMapper database, search tools and templates
 
Let's Get Visible! with Karla Smith, Winnefox Library System
Let's Get Visible! with Karla Smith, Winnefox Library SystemLet's Get Visible! with Karla Smith, Winnefox Library System
Let's Get Visible! with Karla Smith, Winnefox Library System
 
Dash UCCSC 2016
Dash UCCSC 2016Dash UCCSC 2016
Dash UCCSC 2016
 
Providing Linked Data
Providing Linked DataProviding Linked Data
Providing Linked Data
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
 
RDF Seminar Presentation
RDF Seminar PresentationRDF Seminar Presentation
RDF Seminar Presentation
 
DBpedia/association Introduction The Hague 12.2.2016
DBpedia/association Introduction The Hague 12.2.2016DBpedia/association Introduction The Hague 12.2.2016
DBpedia/association Introduction The Hague 12.2.2016
 
The SPARQL Anything project
The SPARQL Anything projectThe SPARQL Anything project
The SPARQL Anything project
 
Trying SPARQL Anything with MEI
Trying SPARQL Anything with MEITrying SPARQL Anything with MEI
Trying SPARQL Anything with MEI
 
Visualizing linkeddata aall2012d-ss
Visualizing linkeddata aall2012d-ssVisualizing linkeddata aall2012d-ss
Visualizing linkeddata aall2012d-ss
 
Knowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything ProjectKnowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything Project
 
The Information Workbench - Linked Data and Semantic Wikis in the Enterprise
The Information Workbench - Linked Data and Semantic Wikis in the EnterpriseThe Information Workbench - Linked Data and Semantic Wikis in the Enterprise
The Information Workbench - Linked Data and Semantic Wikis in the Enterprise
 

Viewers also liked

FAME.Q – A Formal approach to Master Quality in Enterprise Linked Data
FAME.Q – A Formal approach to Master Quality in Enterprise Linked DataFAME.Q – A Formal approach to Master Quality in Enterprise Linked Data
FAME.Q – A Formal approach to Master Quality in Enterprise Linked Data
Linked Enterprise Date Services
 

Viewers also liked (20)

Good bye conversion rate - a smarter way to optimising Search Engine Result P...
Good bye conversion rate - a smarter way to optimising Search Engine Result P...Good bye conversion rate - a smarter way to optimising Search Engine Result P...
Good bye conversion rate - a smarter way to optimising Search Engine Result P...
 
Good bye conversion rate - a smarter way to optimising Search Engine Result P...
Good bye conversion rate - a smarter way to optimising Search Engine Result P...Good bye conversion rate - a smarter way to optimising Search Engine Result P...
Good bye conversion rate - a smarter way to optimising Search Engine Result P...
 
Semantically integrated Enterprise Data Lakes and Co-Evolution of Public / Pr...
Semantically integrated Enterprise Data Lakes and Co-Evolution of Public / Pr...Semantically integrated Enterprise Data Lakes and Co-Evolution of Public / Pr...
Semantically integrated Enterprise Data Lakes and Co-Evolution of Public / Pr...
 
FAME.Q – A Formal approach to Master Quality in Enterprise Linked Data
FAME.Q – A Formal approach to Master Quality in Enterprise Linked DataFAME.Q – A Formal approach to Master Quality in Enterprise Linked Data
FAME.Q – A Formal approach to Master Quality in Enterprise Linked Data
 
Philippe Martin and Jérémy Bénard | Importing, Translating and Exporting Know...
Philippe Martin and Jérémy Bénard | Importing, Translating and Exporting Know...Philippe Martin and Jérémy Bénard | Importing, Translating and Exporting Know...
Philippe Martin and Jérémy Bénard | Importing, Translating and Exporting Know...
 
Michael Fuchs | How to compute semantic relationships between entities and fa...
Michael Fuchs | How to compute semantic relationships between entities and fa...Michael Fuchs | How to compute semantic relationships between entities and fa...
Michael Fuchs | How to compute semantic relationships between entities and fa...
 
Thomas Kaleske | KN(owl)edge – the Linked Data Platform at Kuehne + Nagel
Thomas Kaleske | KN(owl)edge – the Linked Data Platform at Kuehne + NagelThomas Kaleske | KN(owl)edge – the Linked Data Platform at Kuehne + Nagel
Thomas Kaleske | KN(owl)edge – the Linked Data Platform at Kuehne + Nagel
 
Camilo Thorne, Stefano Faralli and Heiner Stuckenschmidt | Entity Linking for...
Camilo Thorne, Stefano Faralli and Heiner Stuckenschmidt | Entity Linking for...Camilo Thorne, Stefano Faralli and Heiner Stuckenschmidt | Entity Linking for...
Camilo Thorne, Stefano Faralli and Heiner Stuckenschmidt | Entity Linking for...
 
Sebastian Bader | Semantic Technologies for Assisted Decision-Making in Indus...
Sebastian Bader | Semantic Technologies for Assisted Decision-Making in Indus...Sebastian Bader | Semantic Technologies for Assisted Decision-Making in Indus...
Sebastian Bader | Semantic Technologies for Assisted Decision-Making in Indus...
 
Gianluca Correndo, Simon Crowle, Juri Papay and Michael Boniface | Enhancing ...
Gianluca Correndo, Simon Crowle, Juri Papay and Michael Boniface | Enhancing ...Gianluca Correndo, Simon Crowle, Juri Papay and Michael Boniface | Enhancing ...
Gianluca Correndo, Simon Crowle, Juri Papay and Michael Boniface | Enhancing ...
 
Diego Esteves, Pablo Mendes, Diego Moussallem, Julio Cesar Duarte, Amrapali Z...
Diego Esteves, Pablo Mendes, Diego Moussallem, Julio Cesar Duarte, Amrapali Z...Diego Esteves, Pablo Mendes, Diego Moussallem, Julio Cesar Duarte, Amrapali Z...
Diego Esteves, Pablo Mendes, Diego Moussallem, Julio Cesar Duarte, Amrapali Z...
 
Kerstin Diwisch | Towards a holistic visualization management for knowledge g...
Kerstin Diwisch | Towards a holistic visualization management for knowledge g...Kerstin Diwisch | Towards a holistic visualization management for knowledge g...
Kerstin Diwisch | Towards a holistic visualization management for knowledge g...
 
Joe Pairman | Multiplying the Power of Taxonomy with Granular, Structured Con...
Joe Pairman | Multiplying the Power of Taxonomy with Granular, Structured Con...Joe Pairman | Multiplying the Power of Taxonomy with Granular, Structured Con...
Joe Pairman | Multiplying the Power of Taxonomy with Granular, Structured Con...
 
eHealth projects in Sierre – Khresmoi
eHealth projects in Sierre – KhresmoieHealth projects in Sierre – Khresmoi
eHealth projects in Sierre – Khresmoi
 
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...
 
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...
 
Jörg Waitelonis, Henrik Jürges and Harald Sack | Don't compare Apples to Oran...
Jörg Waitelonis, Henrik Jürges and Harald Sack | Don't compare Apples to Oran...Jörg Waitelonis, Henrik Jürges and Harald Sack | Don't compare Apples to Oran...
Jörg Waitelonis, Henrik Jürges and Harald Sack | Don't compare Apples to Oran...
 
Umutcan Şimşek, Anna Fensel, Anastasios Zafeiropoulos, Eleni Fotopoulou, Pari...
Umutcan Şimşek, Anna Fensel, Anastasios Zafeiropoulos, Eleni Fotopoulou, Pari...Umutcan Şimşek, Anna Fensel, Anastasios Zafeiropoulos, Eleni Fotopoulou, Pari...
Umutcan Şimşek, Anna Fensel, Anastasios Zafeiropoulos, Eleni Fotopoulou, Pari...
 
Vladimir Alexiev | Semantic Enrichment of Twitter Microposts Helps Understand...
Vladimir Alexiev | Semantic Enrichment of Twitter Microposts Helps Understand...Vladimir Alexiev | Semantic Enrichment of Twitter Microposts Helps Understand...
Vladimir Alexiev | Semantic Enrichment of Twitter Microposts Helps Understand...
 
Adam Bartusiak and Jörg Lässig | Semantic Processing for the Conversion of Un...
Adam Bartusiak and Jörg Lässig | Semantic Processing for the Conversion of Un...Adam Bartusiak and Jörg Lässig | Semantic Processing for the Conversion of Un...
Adam Bartusiak and Jörg Lässig | Semantic Processing for the Conversion of Un...
 

Similar to Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

A candidate dataset_discovery_and_linkage_recommendation_system_for_linked_data
A candidate dataset_discovery_and_linkage_recommendation_system_for_linked_dataA candidate dataset_discovery_and_linkage_recommendation_system_for_linked_data
A candidate dataset_discovery_and_linkage_recommendation_system_for_linked_data
STIinnsbruck
 
Semantic Technology In Oracle Database 12c
Semantic Technology In Oracle Database 12cSemantic Technology In Oracle Database 12c
Semantic Technology In Oracle Database 12c
Martin Toshev
 

Similar to Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources (20)

How to use NoSQL in Enterprise Java Applications - NoSQL Roadshow Zurich
How to use NoSQL in Enterprise Java Applications - NoSQL Roadshow ZurichHow to use NoSQL in Enterprise Java Applications - NoSQL Roadshow Zurich
How to use NoSQL in Enterprise Java Applications - NoSQL Roadshow Zurich
 
A candidate dataset_discovery_and_linkage_recommendation_system_for_linked_data
A candidate dataset_discovery_and_linkage_recommendation_system_for_linked_dataA candidate dataset_discovery_and_linkage_recommendation_system_for_linked_data
A candidate dataset_discovery_and_linkage_recommendation_system_for_linked_data
 
Semantic Technology In Oracle Database 12c
Semantic Technology In Oracle Database 12cSemantic Technology In Oracle Database 12c
Semantic Technology In Oracle Database 12c
 
CCS334 BIG DATA ANALYTICS Session 2 Types NoSQL.pptx
CCS334 BIG DATA ANALYTICS Session 2 Types NoSQL.pptxCCS334 BIG DATA ANALYTICS Session 2 Types NoSQL.pptx
CCS334 BIG DATA ANALYTICS Session 2 Types NoSQL.pptx
 
Introducción a NoSQL
Introducción a NoSQLIntroducción a NoSQL
Introducción a NoSQL
 
Structured Data for the Financial Industry
Structured Data for the Financial Industry Structured Data for the Financial Industry
Structured Data for the Financial Industry
 
Release webinar: Sansa and Ontario
Release webinar: Sansa and OntarioRelease webinar: Sansa and Ontario
Release webinar: Sansa and Ontario
 
Why I don't use Semantic Web technologies anymore, event if they still influe...
Why I don't use Semantic Web technologies anymore, event if they still influe...Why I don't use Semantic Web technologies anymore, event if they still influe...
Why I don't use Semantic Web technologies anymore, event if they still influe...
 
FAIR data: LOUD for all audiences
FAIR data: LOUD for all audiencesFAIR data: LOUD for all audiences
FAIR data: LOUD for all audiences
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Copy of MongoDB .pptx
Copy of MongoDB .pptxCopy of MongoDB .pptx
Copy of MongoDB .pptx
 
Linked (Geo) Data - Adding a Spatial Dimension to the Web of Data
Linked (Geo) Data - Adding a Spatial Dimension to the Web of DataLinked (Geo) Data - Adding a Spatial Dimension to the Web of Data
Linked (Geo) Data - Adding a Spatial Dimension to the Web of Data
 
Unit II -BIG DATA ANALYTICS.docx
Unit II -BIG DATA ANALYTICS.docxUnit II -BIG DATA ANALYTICS.docx
Unit II -BIG DATA ANALYTICS.docx
 
H2 o deep water making deep learning accessible to everyone -jo-fai chow
H2 o deep water   making deep learning accessible to everyone -jo-fai chowH2 o deep water   making deep learning accessible to everyone -jo-fai chow
H2 o deep water making deep learning accessible to everyone -jo-fai chow
 
H2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to EveryoneH2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to Everyone
 
Echoes Project
Echoes ProjectEchoes Project
Echoes Project
 
Non relational databases-no sql
Non relational databases-no sqlNon relational databases-no sql
Non relational databases-no sql
 
DLP: a Web-based Facility for Exploration and Basic Modification of Ontologie...
DLP: a Web-based Facility for Exploration and Basic Modification of Ontologie...DLP: a Web-based Facility for Exploration and Basic Modification of Ontologie...
DLP: a Web-based Facility for Exploration and Basic Modification of Ontologie...
 
Schema.org where did that come from?
Schema.org where did that come from?Schema.org where did that come from?
Schema.org where did that come from?
 
Hide the Stack: Toward Usable Linked Data
Hide the Stack:Toward Usable Linked DataHide the Stack:Toward Usable Linked Data
Hide the Stack: Toward Usable Linked Data
 

More from semanticsconference

More from semanticsconference (20)

Linear books to open world adventure
Linear books to open world adventureLinear books to open world adventure
Linear books to open world adventure
 
Session 1.2 high-precision, context-free entity linking exploiting unambigu...
Session 1.2   high-precision, context-free entity linking exploiting unambigu...Session 1.2   high-precision, context-free entity linking exploiting unambigu...
Session 1.2 high-precision, context-free entity linking exploiting unambigu...
 
Session 4.3 semantic annotation for enhancing collaborative ideation
Session 4.3   semantic annotation for enhancing collaborative ideationSession 4.3   semantic annotation for enhancing collaborative ideation
Session 4.3 semantic annotation for enhancing collaborative ideation
 
Session 1.1 dalicc - data licenses clearance center
Session 1.1   dalicc - data licenses clearance centerSession 1.1   dalicc - data licenses clearance center
Session 1.1 dalicc - data licenses clearance center
 
Session 1.3 context information management across smart city knowledge domains
Session 1.3   context information management across smart city knowledge domainsSession 1.3   context information management across smart city knowledge domains
Session 1.3 context information management across smart city knowledge domains
 
Session 0.0 aussenac semanticsnl-pwebsem2017-v4
Session 0.0   aussenac semanticsnl-pwebsem2017-v4Session 0.0   aussenac semanticsnl-pwebsem2017-v4
Session 0.0 aussenac semanticsnl-pwebsem2017-v4
 
Session 0.0 keynote sandeep sacheti - final hi res
Session 0.0   keynote sandeep sacheti - final hi resSession 0.0   keynote sandeep sacheti - final hi res
Session 0.0 keynote sandeep sacheti - final hi res
 
Session 1.1 linked data applied: a field report from the netherlands
Session 1.1   linked data applied: a field report from the netherlandsSession 1.1   linked data applied: a field report from the netherlands
Session 1.1 linked data applied: a field report from the netherlands
 
Session 1.2 enrich your knowledge graphs: linked data integration with pool...
Session 1.2   enrich your knowledge graphs: linked data integration with pool...Session 1.2   enrich your knowledge graphs: linked data integration with pool...
Session 1.2 enrich your knowledge graphs: linked data integration with pool...
 
Session 1.4 connecting information from legislation and datasets using a ca...
Session 1.4   connecting information from legislation and datasets using a ca...Session 1.4   connecting information from legislation and datasets using a ca...
Session 1.4 connecting information from legislation and datasets using a ca...
 
Session 1.4 a distributed network of heritage information
Session 1.4   a distributed network of heritage informationSession 1.4   a distributed network of heritage information
Session 1.4 a distributed network of heritage information
 
Session 0.0 media panel - matthias priem - gtuo - semantics 2017
Session 0.0   media panel - matthias priem - gtuo - semantics 2017Session 0.0   media panel - matthias priem - gtuo - semantics 2017
Session 0.0 media panel - matthias priem - gtuo - semantics 2017
 
Session 1.3 semantic asset management in the dutch rail engineering and con...
Session 1.3   semantic asset management in the dutch rail engineering and con...Session 1.3   semantic asset management in the dutch rail engineering and con...
Session 1.3 semantic asset management in the dutch rail engineering and con...
 
Session 1.3 energy, smart homes & smart grids: towards interoperability...
Session 1.3   energy, smart homes & smart grids: towards interoperability...Session 1.3   energy, smart homes & smart grids: towards interoperability...
Session 1.3 energy, smart homes & smart grids: towards interoperability...
 
Session 1.2 improving access to digital content by semantic enrichment
Session 1.2   improving access to digital content by semantic enrichmentSession 1.2   improving access to digital content by semantic enrichment
Session 1.2 improving access to digital content by semantic enrichment
 
Session 2.3 semantics for safeguarding & security – a police story
Session 2.3   semantics for safeguarding & security – a police storySession 2.3   semantics for safeguarding & security – a police story
Session 2.3 semantics for safeguarding & security – a police story
 
Session 2.5 semantic similarity based clustering of license excerpts for im...
Session 2.5   semantic similarity based clustering of license excerpts for im...Session 2.5   semantic similarity based clustering of license excerpts for im...
Session 2.5 semantic similarity based clustering of license excerpts for im...
 
Session 4.2 unleash the triple: leveraging a corporate discovery interface....
Session 4.2   unleash the triple: leveraging a corporate discovery interface....Session 4.2   unleash the triple: leveraging a corporate discovery interface....
Session 4.2 unleash the triple: leveraging a corporate discovery interface....
 
Session 1.6 slovak public metadata governance and management based on linke...
Session 1.6   slovak public metadata governance and management based on linke...Session 1.6   slovak public metadata governance and management based on linke...
Session 1.6 slovak public metadata governance and management based on linke...
 
Session 5.6 towards a semantic outlier detection framework in wireless sens...
Session 5.6   towards a semantic outlier detection framework in wireless sens...Session 5.6   towards a semantic outlier detection framework in wireless sens...
Session 5.6 towards a semantic outlier detection framework in wireless sens...
 

Recently uploaded

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 

Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

  • 1. WWW.LEDS-PROJEKT.DE LEDS KNOWLEDGE EXTRACTION FROM HETEROGENEOUS SEMI-STRUCTURED DATA SOURCES MARTIN SEIDEL, MICHAEL KRUG, FRANK BURIAN, MARTIN GAEDKE 12. September 2016
  • 2. LEDSCURRENT SITUATION • knowledge in the Web often only available as weakly interlinked, heterogeneous, semi-structured data à no semantic classification • how to link or merge data? • how to do semantic queries? à not usable in a meaningful way 2 12. September 2016
  • 3. LEDSGOAL Extraction of knowledge from semi-structured data • knowledge in terms of semantic metadata • semantically enriched data then can utilize the potential of Linked Data à provide an automatic process 3 13. September 2016
  • 5. LEDSTHE KESEDA APPROACH • Especially designed to work on JSON data • Challenges when working with JSON data à no schema, only name-value pairs à any structure and depth possible 12. September 20165
  • 6. LEDSTHE KESEDA APPROACH { "id": "krug”, "firstName": "Michael", "lastName": "Krug", "title": "Dipl.-Inf.", "phone": "+49 371 531 39929", "email": "michael.krug@informatik.tu-chemnitz.de", [...] } 12. September 20166
  • 7. LEDSTHE KESEDA APPROACH { "id": "2015-007", "title": "SmartComposition: ...", "author": [ "Michael Krug", "Martin Gaedke"], "year": "2015", "type": "Conference Paper", "event": { "name": "24th International World Wide Web Conference", "url": http://www.www2015.it/ }, [...] } 12. September 20167 Arrays Objects
  • 8. LEDSTHE KESEDA APPROACH • multi-step algorithm • work in existing JSON structure • find and store various matches with different weights • use additional information sources like API descriptions • assign classes to objects with multiple properties • link detected entities 12. September 20168
  • 9. LEDSTHE KESEDA APPROACH 1. Differentiation of input sources / formats 2. Preparation of data structure 3. Analysis of property labels 4. Analysis of property values 5. Mapping of classes 6. Generate JSON-LD document 7. Evaluation of results 13. September 20169
  • 11. LEDSPROTOTYPE • prototype implemented in Node.js • working with properties and classes from: • schema.org • foaf • dublincore • goodrelations • music ontology • dictionaries for: first & last names, cities, streets, languages • list of manually curated synonyms • option to provide pre-defined mappings 12. September 201611
  • 12. LEDSPROTOTYPE • Web interface for • pre-configuration • mappings, synonyms, dictionaries • data upload • result analysis • statistics and browsing 12. September 201612
  • 16. LEDSEVALUATION Algorithm applied to datasets of 1) JSON array of people 2) JSON array of publications a) Without custom pre-configuration b) With custom pre-configuration 12. September 201616
  • 17. LEDSEVALUATION Initial Setup • dictionary and structure pattern matching • label à predicate string matching • classes and properties: schema.org, foaf, dublincore, goodrelations Custom Pre-Configuration • set of label à predicate mappings (hand-picked for data context) • list of known synonyms • more structure patterns 12. September 201617
  • 18. LEDS1A) PEOPLE W/O CONFIG 12. September 201618
  • 19. LEDS1A) PEOPLE W/O CONFIG 12. September 201619
  • 20. LEDS2A) PEOPLE W/ CONFIG 12. September 201620
  • 21. LEDS2A) PEOPLE W/ CONFIG 12. September 201621
  • 22. LEDS1B) PUBLICATIONS W/O CONFIG 12. September 201622
  • 23. LEDS1B) PUBLICATIONS W/O CONFIG 12. September 201623
  • 24. LEDS2B) PUBLICATIONS W/ CONFIG 12. September 201624
  • 25. LEDS2B) PUBLICATIONS W/ CONFIG 12. September 201625
  • 27. LEDSSUMMARY ➙ Approach for extracting knowledge from semi- structured data ➙ by applying a multi-step algorithm ➙ to convert JSON data to RDF ➙ that assigns known classes to objects and maps their properties to S-P-O triples 12. September 201627
  • 28. LEDSOPEN CHALLENGES • detect and reuse JSON structure pattern • disambiguate values • apply quality control to results • improve scalability for large datasets • research application of machine learning 12. September 201628
  • 30. LEDS
  • 31. LEDSTHE KESEDA APPROACH 1. Differentiation of input sources / formats • text, file, URL, API • check for format • optional conversion of XML to JSON 13. September 201631
  • 32. LEDSTHE KESEDA APPROACH 2. Preparation of data structure • pre-process JSON tree to store matches and mappings • keep original structure to preserve hierachie for later relations • detect arrays and objects for seperate processing • clean up: remove empty entries 12. September 201632
  • 33. LEDSTHE KESEDA APPROACH 3. Analysis of property labels • string matching (substrings, prefixes, …) • synonyms • pre-defined mappings • use metadata from API description, if available 12. September 201633
  • 34. LEDSTHE KESEDA APPROACH 4. Analysis of property values • dictionaries • structure patterns (uri, date, address, color…) • data types (date, time, number, boolean…) • (lower weighted) 12. September 201634
  • 35. LEDSTHE KESEDA APPROACH 5. Mapping of classes • find class by number of matched properties • select match that is most appropriate for chosen class • take different weights into account 12. September 201635
  • 36. LEDSTHE KESEDA APPROACH 6. Generate JSON-LD document • use matches and mappings • link entities depending on JSON tree structure • validation of output • optional conversion to various RDF formats 12. September 201636
  • 37. LEDSTHE KESEDA APPROACH 7. Evaluation of results • manual or automatic comparision of actual vs. desired result to reweight matching components • store correctly applied mappings for later reuse 12. September 201637