Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
×
1 of 126

Public PhD Defense - Ben De Meester

0

Share

Download to read offline

Our daily life is strongly influenced through decision-making processes based on large amounts of data, of which both the
data values ​​as the meaningful (semantic) relationships can be included in knowledge graphs.
Given their automatic processing, knowledge graphs must be of high quality on both these fronts.
This thesis focuses on both improving data quality, as assessing semantic quality of knowledge graphs.

On the one hand, it describes a framework to generate knowledge graphs with extensible data transformations that can clean data ("RML + FnO"), expanded to perform data transformations automatically and implementation-independent ("FnO.io").
On the other hand, it describes a validation approach building on a rule-based reasoning solution ("Validatrr"). This takes into account the semantics used, and enables specific improvements to knowledge graph due to detail root cause explanation of quality problems.

Thanks to these contributions, data values ​​in knowledge graphs are cleaned up while generating knowledge graphs, and they can be completed using automatic data transformations on existing knowledge graphs. Our validation approach makes it possible to accurately assess the quality of semantic relationships in knowledge graphs.

The combined work makes it easier to improve data quality and assess semantic quality for knowledge graphs, which ensures that knowledge graphs can be used correctly in decision-making processes.

Public PhD Defense - Ben De Meester

  1. 1. Improving and Assessing Data Quality of Knowledge Graphs Ben De Meester Public defense presentation May 28th, 2020 Promotors Dr. Anastasia Dimou, Prof. Ruben Verborgh Jury Prof. Filip De Turck, Prof. Steven Verstockt, Prof. Erik Mannens, Dr. Juan Sequeda, Prof. Heiko Paulheim, Prof. Jose Emilio Labra Gayo
  2. 2. Short story
  3. 3. Improving and Assessing Data Quality of Knowledge Graphs
  4. 4. Better Data Integration
  5. 5. Why/When do you need to integrate data?
  6. 6. Why data integration is needed Reuse already existing data Needed on all scales People, companies, governments, …
  7. 7. Better data integration: how? By providing methods and tools for Cleaning data Checking data before, during, and after data integration.
  8. 8. Better data integration: how? By providing methods and tools for Cleaning data Checking data before, during, and after integration. Simple enough!
  9. 9. Better data integration: how? By providing methods and tools for Cleaning data Checking data before, during, and after integration. Challenge: automatic & reusable
  10. 10. Automatic?
  11. 11. There’s a lotta data Automatic?
  12. 12. (Re-)building systems per use case is $$$. Reusable?
  13. 13. I worked on reusable systems to automatically improve or check the quality when integrating data Short story
  14. 14. What we’ll discuss Intro Integrating Data: Knowledge Graphs Quality Problems Cleaning: Data Quality Checking: Schema Quality What now?
  15. 15. What we’ll discuss Intro Integrating Data: Knowledge Graphs Quality Problems Cleaning: Data Quality Checking: Schema Quality What now?
  16. 16. Data Integration Problems Hard to exchange and reuse existing data Different formats Different meanings
  17. 17. Knowledge graphs Framework to make structure and meaning of data explicit by adding meaningful (semantic) annotations Why semantics? How do I add these annotations?
  18. 18. Why semantics?
  19. 19. Semantics? Bass
  20. 20. Semantics?
  21. 21. Semantics?
  22. 22. Also works the other way around Name Lastname Family name Achternaam lname Name of his father
  23. 23. Semantics! By making the meaning explicit, it’s easier to integrate AND …
  24. 24. You know more than you know! = Inferencing! My bass has 4…
  25. 25. You know more than you know! = Inferencing! My bass has 4…
  26. 26. You know more than you know! = Inferencing! My bass has 4 strings!
  27. 27. How to add these annotations?
  28. 28. Can we use a global information system? The Web! Let’s steal/reuse from there. We use links to look up further information <http://ben.be/myBass> <http://dict.com/hasGuitarStrings> "4"
  29. 29. Knowledge Graph "Data values" <Semantic annotations>+
  30. 30. What we’ll discuss Intro Data Integration: Knowledge Graphs Quality Problems Cleaning: Data Quality Checking: Schema Quality What now?
  31. 31. Garbage in = Garbage out Data integration is great… If quality is good Data Schema (= combined use of semantic annotations)
  32. 32. Bad data quality <myBass> <hasGuitarStrings> "bunny" <myBass> <hasGuitarStrings> "four" <myBass> <hasGuitarStrings> "4.0"
  33. 33. Bad schema quality <myBass> <hasUnderwearStrings> "4" <Bass> <hasGuitarStrings> "4" <Bass> <likesToEat> <Worm> ?
  34. 34. Bad schema quality <myBass> <hasUnderwearStrings> "4" <Bass> <hasGuitarStrings> "4" <Bass> <likesToEat> <Worm>
  35. 35. Yeah, but how bad can it be?
  36. 36. Yeah, but how bad can it be? Mars Climate Orbiter Imperial vs metric system 320 million $
  37. 37. Yeah, but how bad can it be? Mars Climate Orbiter Imperial vs metric system 320 million $
  38. 38. BEN’S PHD STRATEGY
  39. 39. BEN’S PHD STRATEGY
  40. 40. BEN’S PHD STRATEGY
  41. 41. BEN’S PHD STRATEGY
  42. 42. BEN’S PHD STRATEGY DATA VALUES • Messy data when generating a knowledge graph DATA VALUES • Clean data values using data transformations Phase 1a Today
  43. 43. BEN’S PHD STRATEGY DATA VALUES • Incomplete data in a knowledge graph DATA VALUES • Automatically derived data values Phase 1b Today
  44. 44. BEN’S PHD STRATEGY SCHEMA • Mismatch when inferring new things and checking/validating the knowledge graph SCHEMA • Combine inferring en validating Phase 2 Today
  45. 45. Knowledge Graph "Data values" <Semantic annotations>+
  46. 46. RML + FnO Knowledge GraphImprove quality of literals during generation "Data values" <Semantic annotations>+1a
  47. 47. FnO.io RML + FnO Knowledge Graph Generalizes Improve quality of literals during generation Can complete "Data values" <Semantic annotations>+ 1b 1a
  48. 48. FnO.io RML + FnO Validatrr Knowledge Graph Generalizes Improve quality of literals during generation Can complete Validates "Data values" <Semantic annotations>+ 21b 1a
  49. 49. What we’ll discuss Intro Data Integration: Knowledge Graphs Quality Problems Cleaning: Data Quality Data transformations Checking: Schema Quality What now? 1a
  50. 50. What we’ll discuss Cleaning: Data Quality When generating knowledge graphs: data transformations Knowledge graph generation: RML.io Data transformations: FnO Combining: RML+FnO 1a
  51. 51. Knowledge graph generation
  52. 52. Adding a new piece of data...
  53. 53. Process it...
  54. 54. To make it fit
  55. 55. To make it fit
  56. 56. Costs a lot!
  57. 57. RML.io Set of technologies and tools to stitch relevant pieces of data together, by configuration
  58. 58. Take relevant data...
  59. 59. Take relevant data...
  60. 60. Take relevant data...
  61. 61. Take relevant data...
  62. 62. ... And configure how to link it
  63. 63. Integrate multiple sources
  64. 64. What if the values themselves don’t fit? “Thursday, May 28th 2020”
  65. 65. What if the values themselves don’t fit? “Thursday, May 28th 2020” “May 28 2020” “28/05” “Today”
  66. 66. ROCKBORINGEM WWW.ROCK-BORINGEM.BE
  67. 67. How to describe any reusable data transformation?
  68. 68. Configure how to process source data “Thursday, May 28th 2020” "2020-05-28" FnO.io
  69. 69. Configure how to process source data “Thursday, May 28th 2020” "2020-05-28" FnO.io
  70. 70. FnO: declarative functions Function: from inputs to outputs Function Input Output
  71. 71. FnO: declarative functions Function: from inputs to outputs Date Parsing Function Text Date
  72. 72. Music Instrument Two hands Music
  73. 73. Music Instrument MusicTwo hands
  74. 74. Music Instrument MusicTwo hands
  75. 75. Music Instrument MusicTwo hands
  76. 76. Function Input Output
  77. 77. RML + FnO
  78. 78. RML + FnO
  79. 79. Use aligned RML and FnO configurations Filled gap: data transformations during Knowledge Graph generation Because it is a high-level configuration model no restriction on complexity, use case, or implementation not dependent on RML.io
  80. 80. Evaluation? See whether we can handle real-world use case with complex data transformations
  81. 81. Evaluation? DBpedia
  82. 82. We generated roughly the same DBpedia data, and: The system is reusable You don’t depend on the implementation or use case DBpedia data transformations can be reused elsewhere Alternative solutions would require re-implementation
  83. 83. What we’ll discuss Cleaning: Data Quality When generating knowledge graphs: data transformations When completing knowledge graphs: automatic calculations FnO.io 1b
  84. 84. How can we automatically calculate derived data?
  85. 85. Automatic calculation? Three-step plan 1. Discover relevant functions 2. Get implementations 3. Execute them automatically
  86. 86. FnO.io Extended the FnO model Implementation-independent Describe links to actual implementations Function Hub Function and implementation discovery Function Handler Automatic function/implementation execution
  87. 87. Complements automatic calculations! Discover relevant functions Function Hub, using semantic descriptions Find implementations Carefully described using the extended FnO model Execute them automatically The Function Handler knows how to interpret these descriptions to linked implementations
  88. 88. What we’ll discuss Intro Data Integration: Knowledge Graphs Quality Problems Cleaning: Data Quality Checking: Schema Quality What now? 2
  89. 89. Knowledge Graph "Data values" <Semantic annotations>+
  90. 90. From individual values to sets of statements <myBass> <hasGuitarStrings> "4" <myBass> <model> <Ibanez> <myBass> <hasColor> <purple>
  91. 91. Shapes SHACL ShEx
  92. 92. The All Acoustic music store Stradivarius violin OK Nord piano Not OK Current languages/tools don’t allow the combination of inferring and validating
  93. 93. How can we create a system that can check shapes, whilst inferring new things?
  94. 94. Contribution: Validatrr, using a rule- based reasoner You can combine rules to Check the shapes Infer new things Support existing languages Backtrace the used rules: know exactly why something was wrong Better that existing validators e.g., SHACL has problems with disjunction: 4 string bass guitar OR 6 string electric guitar
  95. 95. Evaluation Compared to existing validators (RDFUnit) Can check the same kind of shapes Combining inferencing and checking More detailed root cause description Up to an order faster for smaller datasets (<100 000 statements) Good alternative if you want more control, and data isn’t too big
  96. 96. FnO.io RML + FnO Validatrr Knowledge Graph Generalizes Improve quality of literals during generation Can complete Validates "Data values" <Semantic annotations>+
  97. 97. What we’ll discuss Intro Data Integration: Knowledge Graphs Quality Problems Cleaning: Data Quality Checking: Schema Quality What now?
  98. 98. Conclusion RML + FnO: complete generation toolkit Uptake in multiple projects and companies Need for standardization: community group
  99. 99. Conclusion RML + FnO: complete generation toolkit FnO.io: automatic function execution, not only for knowledge graphs Further exploitation of semantic descriptions to be investigated Need for more links to existing functions / implementations
  100. 100. Conclusion RML + FnO: complete generation toolkit FnO.io: automatic function execution, not only for knowledge graphs Validatrr: expressive data checking Get it standards-compliant Exploit the combination of inferring and checking Improve performance
  101. 101. What’s in it for you? Right now? Companies with complex data integrations are already looking to us though. So hopefully, in the not so distant future… Nothing.
  102. 102. Improving and Assessing Data Quality of Knowledge Graphs Ben De Meester Public defense presentation May 28th, 2020 https://ben.de-meester.org/phd (it has the afterword!)

×