Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

721 views

Published on

http://2016.semantics.cc/michael-krug

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Michael Krug, Martin Seidel, Frank Burian and Martin Gaedke | KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

  1. 1. WWW.LEDS-PROJEKT.DE LEDS KNOWLEDGE EXTRACTION FROM HETEROGENEOUS SEMI-STRUCTURED DATA SOURCES MARTIN SEIDEL, MICHAEL KRUG, FRANK BURIAN, MARTIN GAEDKE 12. September 2016
  2. 2. LEDSCURRENT SITUATION • knowledge in the Web often only available as weakly interlinked, heterogeneous, semi-structured data à no semantic classification • how to link or merge data? • how to do semantic queries? à not usable in a meaningful way 2 12. September 2016
  3. 3. LEDSGOAL Extraction of knowledge from semi-structured data • knowledge in terms of semantic metadata • semantically enriched data then can utilize the potential of Linked Data à provide an automatic process 3 13. September 2016
  4. 4. LEDS THE KESEDA APPROACH
  5. 5. LEDSTHE KESEDA APPROACH • Especially designed to work on JSON data • Challenges when working with JSON data à no schema, only name-value pairs à any structure and depth possible 12. September 20165
  6. 6. LEDSTHE KESEDA APPROACH { "id": "krug”, "firstName": "Michael", "lastName": "Krug", "title": "Dipl.-Inf.", "phone": "+49 371 531 39929", "email": "michael.krug@informatik.tu-chemnitz.de", [...] } 12. September 20166
  7. 7. LEDSTHE KESEDA APPROACH { "id": "2015-007", "title": "SmartComposition: ...", "author": [ "Michael Krug", "Martin Gaedke"], "year": "2015", "type": "Conference Paper", "event": { "name": "24th International World Wide Web Conference", "url": http://www.www2015.it/ }, [...] } 12. September 20167 Arrays Objects
  8. 8. LEDSTHE KESEDA APPROACH • multi-step algorithm • work in existing JSON structure • find and store various matches with different weights • use additional information sources like API descriptions • assign classes to objects with multiple properties • link detected entities 12. September 20168
  9. 9. LEDSTHE KESEDA APPROACH 1. Differentiation of input sources / formats 2. Preparation of data structure 3. Analysis of property labels 4. Analysis of property values 5. Mapping of classes 6. Generate JSON-LD document 7. Evaluation of results 13. September 20169
  10. 10. LEDS PROTOTYPE
  11. 11. LEDSPROTOTYPE • prototype implemented in Node.js • working with properties and classes from: • schema.org • foaf • dublincore • goodrelations • music ontology • dictionaries for: first & last names, cities, streets, languages • list of manually curated synonyms • option to provide pre-defined mappings 12. September 201611
  12. 12. LEDSPROTOTYPE • Web interface for • pre-configuration • mappings, synonyms, dictionaries • data upload • result analysis • statistics and browsing 12. September 201612
  13. 13. LEDSPROTOTYPE 12. September 201613 CONFIGURATION
  14. 14. LEDSPROTOTYPE 12. September 201614 RESULTS
  15. 15. LEDS EVALUATION
  16. 16. LEDSEVALUATION Algorithm applied to datasets of 1) JSON array of people 2) JSON array of publications a) Without custom pre-configuration b) With custom pre-configuration 12. September 201616
  17. 17. LEDSEVALUATION Initial Setup • dictionary and structure pattern matching • label à predicate string matching • classes and properties: schema.org, foaf, dublincore, goodrelations Custom Pre-Configuration • set of label à predicate mappings (hand-picked for data context) • list of known synonyms • more structure patterns 12. September 201617
  18. 18. LEDS1A) PEOPLE W/O CONFIG 12. September 201618
  19. 19. LEDS1A) PEOPLE W/O CONFIG 12. September 201619
  20. 20. LEDS2A) PEOPLE W/ CONFIG 12. September 201620
  21. 21. LEDS2A) PEOPLE W/ CONFIG 12. September 201621
  22. 22. LEDS1B) PUBLICATIONS W/O CONFIG 12. September 201622
  23. 23. LEDS1B) PUBLICATIONS W/O CONFIG 12. September 201623
  24. 24. LEDS2B) PUBLICATIONS W/ CONFIG 12. September 201624
  25. 25. LEDS2B) PUBLICATIONS W/ CONFIG 12. September 201625
  26. 26. LEDS SUMMARY
  27. 27. LEDSSUMMARY ➙ Approach for extracting knowledge from semi- structured data ➙ by applying a multi-step algorithm ➙ to convert JSON data to RDF ➙ that assigns known classes to objects and maps their properties to S-P-O triples 12. September 201627
  28. 28. LEDSOPEN CHALLENGES • detect and reuse JSON structure pattern • disambiguate values • apply quality control to results • improve scalability for large datasets • research application of machine learning 12. September 201628
  29. 29. WWW.LEDS-PROJEKT.DE LEDS THANK YOU! MICHAEL.KRUG@INFORMATIK.TU-CHEMNITZ.DE VSR.INFORMATIK.TU-CHEMNITZ.DE WWW.LEDS-PROJEKT.DE 12. September 201629
  30. 30. LEDS
  31. 31. LEDSTHE KESEDA APPROACH 1. Differentiation of input sources / formats • text, file, URL, API • check for format • optional conversion of XML to JSON 13. September 201631
  32. 32. LEDSTHE KESEDA APPROACH 2. Preparation of data structure • pre-process JSON tree to store matches and mappings • keep original structure to preserve hierachie for later relations • detect arrays and objects for seperate processing • clean up: remove empty entries 12. September 201632
  33. 33. LEDSTHE KESEDA APPROACH 3. Analysis of property labels • string matching (substrings, prefixes, …) • synonyms • pre-defined mappings • use metadata from API description, if available 12. September 201633
  34. 34. LEDSTHE KESEDA APPROACH 4. Analysis of property values • dictionaries • structure patterns (uri, date, address, color…) • data types (date, time, number, boolean…) • (lower weighted) 12. September 201634
  35. 35. LEDSTHE KESEDA APPROACH 5. Mapping of classes • find class by number of matched properties • select match that is most appropriate for chosen class • take different weights into account 12. September 201635
  36. 36. LEDSTHE KESEDA APPROACH 6. Generate JSON-LD document • use matches and mappings • link entities depending on JSON tree structure • validation of output • optional conversion to various RDF formats 12. September 201636
  37. 37. LEDSTHE KESEDA APPROACH 7. Evaluation of results • manual or automatic comparision of actual vs. desired result to reweight matching components • store correctly applied mappings for later reuse 12. September 201637

×