Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Semantically enhanced quality assurance in the jurion business use case

463 views

Published on

Presentation of our paper in ESWC 2016 Crete
http://svn.aksw.org/papers/2016/ESWC_Jurion/public.pdf

Published in: Technology
  • Be the first to comment

Semantically enhanced quality assurance in the jurion business use case

  1. 1. Semantically Enhanced Quality Assurance in the JURION Business Use Case Dimitris Kontokostas, Christian Mader, Christian Dirschl, Katja Eck, Michael Leuthold, Jens Lehmann, Sebastian Hellmann
  2. 2. ESWC 2016 Overview ● Wolters Kluwers overview ● Use Case Tools ● Challenges ● Solutions ● Evaluation ● Future Work
  3. 3. ESWC 2016 Wolters Kluwers Wolters Kluwer provides solutions to customers in over 170 countries and provides content in at least a dozen languages. Focusing on legal, tax, finance and health industries.
  4. 4. ESWC 2016 Wolters Kluwer Transformation
  5. 5. ESWC 2016 Wolters Kluwer Transformation Quality
  6. 6. ESWC 2016 WKD in LOD2 project
  7. 7. ESWC 2016
  8. 8. ESWC 2016 WKD in the ALIGNED Project
  9. 9. ESWC 2016 RDF in the publishing industry
  10. 10. ESWC 2016 Use Case Tools
  11. 11. ESWC 2016 ● TDDD: Test Driven (Data) Development ○ Methodology, definitions & Tools ● SPARQL ● Reusable unit tests for ○ vocabularies ○ datasets ○ applications ● Test Auto Generators ○ OWL ○ IBM Shapes ○ DSP (Dublin Core Set Profiles) ○ W3c Shapes (in progress) ● Open Source (Apache license) ● Stable tool, used in many research & industrial settings http://rdfunit.aksw.org
  12. 12. ESWC 2016 https://www.poolparty.biz ● Commercial product developed by Semantic Web Company ● Thesauri development in a collaborative way ○ From scratch / by extraction of terms from a document corpus ● Compliance to the 5-star Open Data principles (RDF & SKOS) ● Automatically retrieve potential additional concepts for inclusion into the thesauri by querying SPARQL endpoints (e.g. DBpedia) ● identify and link to related resources from local / remote projects ● Simple ontology editing (rdf:type, rdfs:subClassOf, rdfs:domain/range,...) ● Automated quality assurance mechanisms ○ Conformance to SKOS or a custom schema ○ Enforcement level of some quality metrics can be configured by the user so that it is, e.g., possible to get an alert if circular hierarchical relation ○ Check a taxonomy “as a whole” against a set of potential quality violations
  13. 13. ESWC 2016 Challenges
  14. 14. ESWC 2016 Metadata RDF Conversion Verification Existing Infrastructure ● Platform Content Interface (PCI) ontology ○ proprietary schema that describes legal documents and metadata in OWL ● PCI revisions => verify data conforms to PCI ● Proprietary SOAP-based validation service ○ Package based validation => hard error detection ○ Asynchronous & complex web service => hard to use ○ Network dependency => potentially unstable
  15. 15. ESWC 2016 Metadata RDF Conversion Verification Continuous & high quality triplification of semi-structured data is a common problem in the information industry. Schema changes and enhancements are routine tasks, but ensuring data quality is still very often purely manual effort. So any automation will support a lot of real-life use cases in different domains. Goal: Based on the schema, test cases should automatically be created, which are run on a regular basis against the data that needs to be transformed. The errors detected will lead to refinements and changes of the XSLT scripts and sometimes also to schema changes, which impose again new automatically created test cases
  16. 16. ESWC 2016
  17. 17. ESWC 2016 RDFUnit / JUnit Integration
  18. 18. ESWC 2016 Quality Control in Thesaurus Management ● WKD develops multiple controlled vocabularies for annotating documents (e.g., court decision, labour law,...) using PoolParty ● Interconnected to each other ● Consistency and quality must be ensured over all vocabularies ● Various quality issues, e.g., ○ Duplicates ○ Links to deprecated (deleted) concepts ○ Unresolvable links ● Up to now curated manually in deployed system, regular errors in production versions
  19. 19. ESWC 2016 Quality Control in Thesaurus Management The creation and maintenance of knowledge models is gaining importance in the Web of Data. These tasks are increasingly being executed by SME’s in the domain, not in knowledge modelling and IT as such. Therefore, better automatic support of these processes will directly help achieving quality and efficiency gains. ● Automated quality checks over multiple vocabularies ● Improved notifications: email on changes performed by users ● Additional statistics on, e.g, vocabulary dependencies, changes, etc
  20. 20. ESWC 2016 Vocabulary link validation (PoolParty) ● Uses project metadata to identify linked vocabularies ● Link is invalid if target concept is either deprecated or deleted ● Creates a report for human curators ● Vocabulary repair still manual process Quality Control in Thesaurus Management
  21. 21. ESWC 2016 Results & Evaluation The analysis is based on measured metrics and the qualitative feedback of experts and users. Participants of the evaluation study were selected from WKD staff in the fields of software development and data development. There were seven participants in total: four involved in the expert evaluation and three content experts involved in the usability/interview evaluation. ● Productivity ● Quality ● Agility
  22. 22. ESWC 2016 Productivity (RDFUnit) ● Total time for quality checks and error detection ● The time need for manual interaction. What we measured: ● 1ms to 50ms per single test (depending on the document / ontology size) ○ as close to real-time as possible, currently a couple of minutes ● Quality checks can be triggered by manual execution, but they are always verified automatically by the CI build system ● A total of 44.000 tests with a total duration of 11 minutes ○ may scale-up easily when parallelized or clustered
  23. 23. ESWC 2016 Quality (RDFUnit) What kind of errors can be detected and is categorization possible? ● Experts concluded that it is helpful to spot errors introduced by changes, since issues spotted in this way can be assumed to point to really existing errors; the causes of which can be identified and addressed ● Successful tests are less significant as we are not yet able to evaluate whether and how the measurements taken correspond to target measures and these tests do not point to concrete errors. ○ Coverage & other metrics needed
  24. 24. ESWC 2016 Agility (RDFUnit) … time to include new requirements ● Including new constraints or adapting existing constraints works by adding new reference documents to the input dataset to make the test environment as representative as possible. ● The process of generating tests and testing is fully automated, it adapts very easily to changed parameters. ● Adding more documents to the input dataset increases the total runtime
  25. 25. ESWC 2016 Productivity (PoolParty) ● The number of checked links ● The number of violations ● The total time What we measured: The presentation of the results was well understood. In general, the tool was received well by the experts, which was reflected by their feedback in the interviews.
  26. 26. ESWC 2016 Quality (PoolParty) ● No false broken link detection ● Prototype still lacks some usability.
  27. 27. ESWC 2016 Agility (PoolParty) … integration, configuration time and extension ● Very useful for getting an overview ● cases it is desired to limit the link lookups and adapt the way links to external datasets are detected ○ Use custom base URI or regular expression-based techniques ● Re-configuration is possible but recompiling the application might be needed ○ Plans to delegate this process to unified views
  28. 28. ESWC 2016 Future Work ● Error analysis (statistics, time to fix an issue, regressions) ● Test coverage and better metrics ● Improve the UI of the Link Validation tool ● Provide more advanced settings ● Inter-repository Link Validation
  29. 29. ESWC 2016 Thank You! Questions ? (You might want to) take a look at… RDF and XML Interoperability W3c Community group https://www.w3.org/community/rax/

×