SlideShare a Scribd company logo
1 of 30
Download to read offline
Towards a Vocabulary for
  DQM in Semantic Web
      Architectures
                 (Research in Progress)

        Christian Fürber and Martin Hepp
       christian@fuerber.com, mhepp@computer.org

Presentation @ 1st International Workshop on Linked Web
                    Data Management,
           March 25th, 2011, Uppsala, Sweden
Part 1:
                      What‘s the Problem?



C. Fürber, M. Hepp:                         2
Towards a Vocabulary for DQM
In SemWeb Architectures
Various Data Quality Problems
                                                          Inconsistent duplicates
                       Invalid characters                                Missing classification




                                                                                                                           Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
  Incorrect reference                                                                      Approximate duplicates




                                                                                                                               Reference: Linking Open Data cloud diagram, by
                                                          Character alignment violation

                    Word transpositions
                                     Invalid substrings
                                                               Mistyping / Misspelling errors
  Cardinality violation
                                                     Missing values                  Referential integrity violation
                  Misfielded values
         Unique value violation            False values             Functional Dependency
                               Out of range values
                                                                    Violation                Imprecise values
    Existence of Homonyms                 Meaningless values
                                                                            Incorrect classification
         Existence of Synonyms                                   Contradictory relationships
                               Outdated conceptual elements         Untyped literals        Outdated values


C. Fürber, M. Hepp:                                                                                                    3
Towards a Vocabulary for DQM
in SemWeb Architectures
The Problem
                                                                                        Negative
                                                                                        Population


                                                                           Weird Population
                                                                           Values


                                                                                              Invalid
                                                                                              URL‘s

                                Data retrieved on 2011-03-12 from http://loc.openlinksw.com/sparql



C. Fürber, M. Hepp:                                                                                  4
Towards a Vocabulary for DQM
in SemWeb Architectures
Part 2:
        What are high quality data?



C. Fürber, M. Hepp:                   5
Towards a Vocabulary for DQM
In SemWeb Architectures
What is Data Quality?
• Data‘s „fitness for use by data consumers“ (Wang, Strong 1996)

• „Conformance to specification“ (Kahn et al. 2002)
• „Data are of high quality if they are fit for their intended
  uses in operations, decision making, and planning. Data
  are fit for use if they are free of defects and possess
  desired features.“ (Redman 2001)


                    • Requirements as „Benchmark“
C. Fürber, M. Hepp:                                              6
Towards a Vocabulary for DQM
in SemWeb Architectures
Perspective-Neutral Data Quality


              Data quality is the degree to which
               data fulfills quality requirements

        …no matter who makes the quality requirements.



C. Fürber, M. Hepp:                                 7
Towards a Vocabulary for DQM
In SemWeb Architectures
Quality-
   Requirements
                                    The Problem
                                    Population
                                    cannot be                                                    Negative
                                     negative                                                    Population
                            Population is
                            indicated by
                           numeric values                                           Weird Population
                                                                                    Values
                        URL‘s usually
                       start with http://,
                         https://, etc.                                                                Invalid
                                                                                                       URL‘s

                                         Data retrieved on 2011-03-12 from http://loc.openlinksw.com/sparql



C. Fürber, M. Hepp:                                                                                           8
Towards a Vocabulary for DQM
in SemWeb Architectures
Satisfying Quality Requirements
         Problem 3: Satisfying
            Requirements            Desired
                                     State

                                                            Individuals

       Status
        Quo
                               =   Desired
                                    State
                                                             Groups


                                    Desired
                                     State
                                                           Standards,
                                                              etc.
  Problem 2: Harmonizing
       Requirements                           Problem 1: Expressing
                                              Quality Requirements
C. Fürber, M. Hepp:                                               9
Towards a Vocabulary for DQM
In SemWeb Architectures
Part 3:
                               Research Goal



C. Fürber, M. Hepp:                            10
Towards a Vocabulary for DQM
In SemWeb Architectures
Major Research Goal
 • Represent Quality-Relevant information for
   automated…
                       – Data Quality Monitoring
                       – Data Quality Assessment
                       – Data Cleansing
                       – Filtering of High Quality Data

                                 …in a standardized vocabulary.


C. Fürber, M. Hepp:                                               11
Towards a Vocabulary for DQM
in SemWeb Architectures
Motives for DQM-Vocabulary
• Support people to explicitly express data quality
  requirements in „same language“ on Web-Scale
• Support the creation of consensual agreements
  upon quality requirements
• Reduce effort for DQM-Activities
• Raise transparency about assumed quality
  requirements
• Enable consistency checks among quality
  requirements
C. Fürber, M. Hepp:                              12
Towards a Vocabulary for DQM
In SemWeb Architectures
Part 4:
                               Our Approach



C. Fürber, M. Hepp:                           13
Towards a Vocabulary for DQM
In SemWeb Architectures
Basic Architecture
                                 Assessment   HQ Data
      Problem                      Scores     Retrieval           Cleansed
    Classification                                                  Data


                                  SPARQL-Query-Engine
                                              DQM-Vocabulary



                          Knowledgebase
                        RDB A     RDB B        Data Acquisition

C. Fürber, M. Hepp:                                                          14
Towards a Vocabulary for DQM
in SemWeb Architectures
Main Concepts of DQM-Vocabulary
                               Classify Quality     Express
                                  Problems        Requirements

                                                                 Annotate
                                                                  Quality
                                                                  Scores




                                                                  Express
                                                                 Cleansing
     Account for                                                   Tasks
   Task-Dependent
    Requirements
C. Fürber, M. Hepp:                                                   15
Towards a Vocabulary for DQM
In SemWeb Architectures
Data Quality Problem Types:
          Source for Potential Requirements
                                                          Inconsistent duplicates
                       Invalid characters                                Missing classification
  Incorrect reference                                     Character alignment violation
                                                                                           Approximate duplicates

                    Word transpositions
                                     Invalid substrings
                                                               Mistyping / Misspelling errors
  Cardinality violation
                                                     Missing values                  Referential integrity violation
                  Misfielded values
         Unique value violation            False values             Functional Dependency
                               Out of range values
                                                                    Violation                Imprecise values
    Existence of Homonyms                 Meaningless values
                                                                            Incorrect classification
         Existence of Synonyms                                   Contradictory relationships
                               Outdated conceptual elements                                 Outdated values
C. Fürber, M. Hepp:
Towards a Vocabulary for DQM                                                                                           16
in SemWeb Architectures
Data Quality Requirements
                                      Syntactical Rules
                                      Semantic Rules
                                     Redundancy Rules
                                    Completeness Rules
                                      Timeliness Rules




C. Fürber, M. Hepp:                                  17
Towards a Vocabulary for DQM
In SemWeb Architectures
Quality-Influencing Artifacts


        Current Focus
     of DQM-Vocabulary
                                    Data




C. Fürber, M. Hepp:                            18
Towards a Vocabulary for DQM
In SemWeb Architectures
Design Alternatives:
   Statements about Classes & Properties


(1) Using classes and properties as subjects

(2) Using datatype properties with xsd:anyURI

(3) Mapping class and property URI‘s to new URI‘s


C. Fürber, M. Hepp:                             19
Towards a Vocabulary for DQM
In SemWeb Architectures
Part 5:
                    Application Examples



C. Fürber, M. Hepp:                        20
Towards a Vocabulary for DQM
In SemWeb Architectures
Example 1: Legal Value Rule (1/3)


               What instances have illegal values
                 for property foo:country ?




C. Fürber, M. Hepp:                                 21
Towards a Vocabulary for DQM
In SemWeb Architectures
Example 1: Legal Value Rule (2/3)
                               dqm:LegalValueRule          Class
                                                          Instance

                                                         Literal value
                                  foo:LegalValueRule_1




   “tref:Countries“
                                                          “foo:Countries“



        “tref:countryName“                               “foo:countryName“



C. Fürber, M. Hepp:                                                  22
Towards a Vocabulary for DQM
In SemWeb Architectures
Example 1: Legal Value Rule (3/3)




C. Fürber, M. Hepp:                        23
Towards a Vocabulary for DQM
In SemWeb Architectures
Example 2: DQ-Assessment (1/2)


               How syntactically accurate are all
                 properties that are subject to
                      LegalValueRules?




C. Fürber, M. Hepp:                                 24
Towards a Vocabulary for DQM
In SemWeb Architectures
Example 2: DQ-Assessment (2/2)




C. Fürber, M. Hepp:                      25
Towards a Vocabulary for DQM
In SemWeb Architectures
Part 6:
                               Conclusions &
                               Planned Work


C. Fürber, M. Hepp:                            26
Towards a Vocabulary for DQM
In SemWeb Architectures
Advantages of DQM-Voabulary

• Minimizes human effort for DQM
• Web-Scale sharing/reuse of data quality
  requirements
• Consistency checks among data quality
  requirements
• Transparency about applied data quality
  rules
C. Fürber, M. Hepp:                         27
Towards a Vocabulary for DQM
In SemWeb Architectures
Limitations
• Representation of complex functional
  dependency rules and derivation rules
• Limited experience on real world-data sets
• Currently no own concepts for classes and
  properties
• Research still in progress


C. Fürber, M. Hepp:                          28
Towards a Vocabulary for DQM
In SemWeb Architectures
Future Work
• Evaluation of design alternatives
• Development of processing framework
• Representation of more complex
  functional dependency rules / derivation
  rules
• Extension of DQM-Vobulary
• Evaluation on real-world data sets
• Publication at http://semwebquality.org
C. Fürber, M. Hepp:                          29
Towards a Vocabulary for DQM
in SemWeb Architectures
Christian Fürber
   Researcher
   E-Business & Web Science Research Group

                 Werner-Heisenberg-Weg 39
                 85577 Neubiberg
                 Germany

                 skype            c.fuerber
                 email            christian@fuerber.com
                 web              http://www.unibw.de/ebusiness
                 homepage         http://www.fuerber.com
                 twitter          http://www.twitter.com/cfuerber




Paper available at http://bit.ly/gYEDdQ
                                                                    30

More Related Content

What's hot

社内Git勉強会向け資料
社内Git勉強会向け資料社内Git勉強会向け資料
社内Git勉強会向け資料Hiroki Saiki
 
Git flowの活用事例
Git flowの活用事例Git flowの活用事例
Git flowの活用事例Hirohito Kato
 
テスト駆動開発入門
テスト駆動開発入門テスト駆動開発入門
テスト駆動開発入門Shuji Watanabe
 
NuGetの社内利用のススメ
NuGetの社内利用のススメNuGetの社内利用のススメ
NuGetの社内利用のススメNarami Kiyokura
 
Startup Metrics for Pirates
Startup Metrics for PiratesStartup Metrics for Pirates
Startup Metrics for PiratesDave McClure
 
おすすめVimプラグインまとめ
おすすめVimプラグインまとめおすすめVimプラグインまとめ
おすすめVimプラグインまとめShun Iwase
 
Unityネイティブプラグインの勧め
Unityネイティブプラグインの勧めUnityネイティブプラグインの勧め
Unityネイティブプラグインの勧めKLab Inc. / Tech
 
DDDはオブジェクト指向を利用してどのようにメンテナブルなコードを書くか
DDDはオブジェクト指向を利用してどのようにメンテナブルなコードを書くかDDDはオブジェクト指向を利用してどのようにメンテナブルなコードを書くか
DDDはオブジェクト指向を利用してどのようにメンテナブルなコードを書くかKoichiro Matsuoka
 
GitHubにバグ報告して賞金$500を頂いた話
GitHubにバグ報告して賞金$500を頂いた話GitHubにバグ報告して賞金$500を頂いた話
GitHubにバグ報告して賞金$500を頂いた話Yoshio Hanawa
 
Lean startup, customer development, and the business model canvas
Lean startup, customer development, and the business model canvasLean startup, customer development, and the business model canvas
Lean startup, customer development, and the business model canvasgistinitiative
 
SQL Server 資料庫版本控管
SQL Server 資料庫版本控管SQL Server 資料庫版本控管
SQL Server 資料庫版本控管Will Huang
 
C# 8.0 null許容参照型
C# 8.0 null許容参照型C# 8.0 null許容参照型
C# 8.0 null許容参照型信之 岩永
 
オープンソースBotフレームワークではじめるChatOps
オープンソースBotフレームワークではじめるChatOpsオープンソースBotフレームワークではじめるChatOps
オープンソースBotフレームワークではじめるChatOpsAkihiko Horiuchi
 
Bounce Tracking ProtectionにおけるFederationへの課題と最新動向 #openid #technight
Bounce Tracking ProtectionにおけるFederationへの課題と最新動向 #openid #technightBounce Tracking ProtectionにおけるFederationへの課題と最新動向 #openid #technight
Bounce Tracking ProtectionにおけるFederationへの課題と最新動向 #openid #technightYahoo!デベロッパーネットワーク
 
추천시스템 이제는 돈이 되어야 한다.
추천시스템 이제는 돈이 되어야 한다.추천시스템 이제는 돈이 되어야 한다.
추천시스템 이제는 돈이 되어야 한다.choi kyumin
 
メタプログラミングって何だろう
メタプログラミングって何だろうメタプログラミングって何だろう
メタプログラミングって何だろうKota Mizushima
 
ReactorKit으로 단방향 반응형 앱 만들기
ReactorKit으로 단방향 반응형 앱 만들기ReactorKit으로 단방향 반응형 앱 만들기
ReactorKit으로 단방향 반응형 앱 만들기Suyeol Jeon
 

What's hot (20)

社内Git勉強会向け資料
社内Git勉強会向け資料社内Git勉強会向け資料
社内Git勉強会向け資料
 
Git flowの活用事例
Git flowの活用事例Git flowの活用事例
Git flowの活用事例
 
テスト駆動開発入門
テスト駆動開発入門テスト駆動開発入門
テスト駆動開発入門
 
NuGetの社内利用のススメ
NuGetの社内利用のススメNuGetの社内利用のススメ
NuGetの社内利用のススメ
 
Startup Metrics for Pirates
Startup Metrics for PiratesStartup Metrics for Pirates
Startup Metrics for Pirates
 
おすすめVimプラグインまとめ
おすすめVimプラグインまとめおすすめVimプラグインまとめ
おすすめVimプラグインまとめ
 
041310 class 12 and 13
041310 class 12 and 13041310 class 12 and 13
041310 class 12 and 13
 
RESTful API 入門
RESTful API 入門RESTful API 入門
RESTful API 入門
 
Unityネイティブプラグインの勧め
Unityネイティブプラグインの勧めUnityネイティブプラグインの勧め
Unityネイティブプラグインの勧め
 
DDDはオブジェクト指向を利用してどのようにメンテナブルなコードを書くか
DDDはオブジェクト指向を利用してどのようにメンテナブルなコードを書くかDDDはオブジェクト指向を利用してどのようにメンテナブルなコードを書くか
DDDはオブジェクト指向を利用してどのようにメンテナブルなコードを書くか
 
GitHubにバグ報告して賞金$500を頂いた話
GitHubにバグ報告して賞金$500を頂いた話GitHubにバグ報告して賞金$500を頂いた話
GitHubにバグ報告して賞金$500を頂いた話
 
Lean startup, customer development, and the business model canvas
Lean startup, customer development, and the business model canvasLean startup, customer development, and the business model canvas
Lean startup, customer development, and the business model canvas
 
SQL Server 資料庫版本控管
SQL Server 資料庫版本控管SQL Server 資料庫版本控管
SQL Server 資料庫版本控管
 
C# 8.0 null許容参照型
C# 8.0 null許容参照型C# 8.0 null許容参照型
C# 8.0 null許容参照型
 
オープンソースBotフレームワークではじめるChatOps
オープンソースBotフレームワークではじめるChatOpsオープンソースBotフレームワークではじめるChatOps
オープンソースBotフレームワークではじめるChatOps
 
Bounce Tracking ProtectionにおけるFederationへの課題と最新動向 #openid #technight
Bounce Tracking ProtectionにおけるFederationへの課題と最新動向 #openid #technightBounce Tracking ProtectionにおけるFederationへの課題と最新動向 #openid #technight
Bounce Tracking ProtectionにおけるFederationへの課題と最新動向 #openid #technight
 
추천시스템 이제는 돈이 되어야 한다.
추천시스템 이제는 돈이 되어야 한다.추천시스템 이제는 돈이 되어야 한다.
추천시스템 이제는 돈이 되어야 한다.
 
The Value Proposition Canvas
The Value Proposition CanvasThe Value Proposition Canvas
The Value Proposition Canvas
 
メタプログラミングって何だろう
メタプログラミングって何だろうメタプログラミングって何だろう
メタプログラミングって何だろう
 
ReactorKit으로 단방향 반응형 앱 만들기
ReactorKit으로 단방향 반응형 앱 만들기ReactorKit으로 단방향 반응형 앱 만들기
ReactorKit으로 단방향 반응형 앱 만들기
 

Recently uploaded

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 

Recently uploaded (20)

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 

Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

  • 1. Towards a Vocabulary for DQM in Semantic Web Architectures (Research in Progress) Christian Fürber and Martin Hepp christian@fuerber.com, mhepp@computer.org Presentation @ 1st International Workshop on Linked Web Data Management, March 25th, 2011, Uppsala, Sweden
  • 2. Part 1: What‘s the Problem? C. Fürber, M. Hepp: 2 Towards a Vocabulary for DQM In SemWeb Architectures
  • 3. Various Data Quality Problems Inconsistent duplicates Invalid characters Missing classification Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/ Incorrect reference Approximate duplicates Reference: Linking Open Data cloud diagram, by Character alignment violation Word transpositions Invalid substrings Mistyping / Misspelling errors Cardinality violation Missing values Referential integrity violation Misfielded values Unique value violation False values Functional Dependency Out of range values Violation Imprecise values Existence of Homonyms Meaningless values Incorrect classification Existence of Synonyms Contradictory relationships Outdated conceptual elements Untyped literals Outdated values C. Fürber, M. Hepp: 3 Towards a Vocabulary for DQM in SemWeb Architectures
  • 4. The Problem Negative Population Weird Population Values Invalid URL‘s Data retrieved on 2011-03-12 from http://loc.openlinksw.com/sparql C. Fürber, M. Hepp: 4 Towards a Vocabulary for DQM in SemWeb Architectures
  • 5. Part 2: What are high quality data? C. Fürber, M. Hepp: 5 Towards a Vocabulary for DQM In SemWeb Architectures
  • 6. What is Data Quality? • Data‘s „fitness for use by data consumers“ (Wang, Strong 1996) • „Conformance to specification“ (Kahn et al. 2002) • „Data are of high quality if they are fit for their intended uses in operations, decision making, and planning. Data are fit for use if they are free of defects and possess desired features.“ (Redman 2001) • Requirements as „Benchmark“ C. Fürber, M. Hepp: 6 Towards a Vocabulary for DQM in SemWeb Architectures
  • 7. Perspective-Neutral Data Quality Data quality is the degree to which data fulfills quality requirements …no matter who makes the quality requirements. C. Fürber, M. Hepp: 7 Towards a Vocabulary for DQM In SemWeb Architectures
  • 8. Quality- Requirements The Problem Population cannot be Negative negative Population Population is indicated by numeric values Weird Population Values URL‘s usually start with http://, https://, etc. Invalid URL‘s Data retrieved on 2011-03-12 from http://loc.openlinksw.com/sparql C. Fürber, M. Hepp: 8 Towards a Vocabulary for DQM in SemWeb Architectures
  • 9. Satisfying Quality Requirements Problem 3: Satisfying Requirements Desired State Individuals Status Quo = Desired State Groups Desired State Standards, etc. Problem 2: Harmonizing Requirements Problem 1: Expressing Quality Requirements C. Fürber, M. Hepp: 9 Towards a Vocabulary for DQM In SemWeb Architectures
  • 10. Part 3: Research Goal C. Fürber, M. Hepp: 10 Towards a Vocabulary for DQM In SemWeb Architectures
  • 11. Major Research Goal • Represent Quality-Relevant information for automated… – Data Quality Monitoring – Data Quality Assessment – Data Cleansing – Filtering of High Quality Data …in a standardized vocabulary. C. Fürber, M. Hepp: 11 Towards a Vocabulary for DQM in SemWeb Architectures
  • 12. Motives for DQM-Vocabulary • Support people to explicitly express data quality requirements in „same language“ on Web-Scale • Support the creation of consensual agreements upon quality requirements • Reduce effort for DQM-Activities • Raise transparency about assumed quality requirements • Enable consistency checks among quality requirements C. Fürber, M. Hepp: 12 Towards a Vocabulary for DQM In SemWeb Architectures
  • 13. Part 4: Our Approach C. Fürber, M. Hepp: 13 Towards a Vocabulary for DQM In SemWeb Architectures
  • 14. Basic Architecture Assessment HQ Data Problem Scores Retrieval Cleansed Classification Data SPARQL-Query-Engine DQM-Vocabulary Knowledgebase RDB A RDB B Data Acquisition C. Fürber, M. Hepp: 14 Towards a Vocabulary for DQM in SemWeb Architectures
  • 15. Main Concepts of DQM-Vocabulary Classify Quality Express Problems Requirements Annotate Quality Scores Express Cleansing Account for Tasks Task-Dependent Requirements C. Fürber, M. Hepp: 15 Towards a Vocabulary for DQM In SemWeb Architectures
  • 16. Data Quality Problem Types: Source for Potential Requirements Inconsistent duplicates Invalid characters Missing classification Incorrect reference Character alignment violation Approximate duplicates Word transpositions Invalid substrings Mistyping / Misspelling errors Cardinality violation Missing values Referential integrity violation Misfielded values Unique value violation False values Functional Dependency Out of range values Violation Imprecise values Existence of Homonyms Meaningless values Incorrect classification Existence of Synonyms Contradictory relationships Outdated conceptual elements Outdated values C. Fürber, M. Hepp: Towards a Vocabulary for DQM 16 in SemWeb Architectures
  • 17. Data Quality Requirements Syntactical Rules Semantic Rules Redundancy Rules Completeness Rules Timeliness Rules C. Fürber, M. Hepp: 17 Towards a Vocabulary for DQM In SemWeb Architectures
  • 18. Quality-Influencing Artifacts Current Focus of DQM-Vocabulary Data C. Fürber, M. Hepp: 18 Towards a Vocabulary for DQM In SemWeb Architectures
  • 19. Design Alternatives: Statements about Classes & Properties (1) Using classes and properties as subjects (2) Using datatype properties with xsd:anyURI (3) Mapping class and property URI‘s to new URI‘s C. Fürber, M. Hepp: 19 Towards a Vocabulary for DQM In SemWeb Architectures
  • 20. Part 5: Application Examples C. Fürber, M. Hepp: 20 Towards a Vocabulary for DQM In SemWeb Architectures
  • 21. Example 1: Legal Value Rule (1/3) What instances have illegal values for property foo:country ? C. Fürber, M. Hepp: 21 Towards a Vocabulary for DQM In SemWeb Architectures
  • 22. Example 1: Legal Value Rule (2/3) dqm:LegalValueRule Class Instance Literal value foo:LegalValueRule_1 “tref:Countries“ “foo:Countries“ “tref:countryName“ “foo:countryName“ C. Fürber, M. Hepp: 22 Towards a Vocabulary for DQM In SemWeb Architectures
  • 23. Example 1: Legal Value Rule (3/3) C. Fürber, M. Hepp: 23 Towards a Vocabulary for DQM In SemWeb Architectures
  • 24. Example 2: DQ-Assessment (1/2) How syntactically accurate are all properties that are subject to LegalValueRules? C. Fürber, M. Hepp: 24 Towards a Vocabulary for DQM In SemWeb Architectures
  • 25. Example 2: DQ-Assessment (2/2) C. Fürber, M. Hepp: 25 Towards a Vocabulary for DQM In SemWeb Architectures
  • 26. Part 6: Conclusions & Planned Work C. Fürber, M. Hepp: 26 Towards a Vocabulary for DQM In SemWeb Architectures
  • 27. Advantages of DQM-Voabulary • Minimizes human effort for DQM • Web-Scale sharing/reuse of data quality requirements • Consistency checks among data quality requirements • Transparency about applied data quality rules C. Fürber, M. Hepp: 27 Towards a Vocabulary for DQM In SemWeb Architectures
  • 28. Limitations • Representation of complex functional dependency rules and derivation rules • Limited experience on real world-data sets • Currently no own concepts for classes and properties • Research still in progress C. Fürber, M. Hepp: 28 Towards a Vocabulary for DQM In SemWeb Architectures
  • 29. Future Work • Evaluation of design alternatives • Development of processing framework • Representation of more complex functional dependency rules / derivation rules • Extension of DQM-Vobulary • Evaluation on real-world data sets • Publication at http://semwebquality.org C. Fürber, M. Hepp: 29 Towards a Vocabulary for DQM in SemWeb Architectures
  • 30. Christian Fürber Researcher E-Business & Web Science Research Group Werner-Heisenberg-Weg 39 85577 Neubiberg Germany skype c.fuerber email christian@fuerber.com web http://www.unibw.de/ebusiness homepage http://www.fuerber.com twitter http://www.twitter.com/cfuerber Paper available at http://bit.ly/gYEDdQ 30