Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How to clean data less through Linked (Open Data) approach?

1,672 views

Published on

1. Data Quality: data, metadata, linked data
2. The case of 840,000 cc licensed data
3. How LOD approach can improve data quality?

Published in: Technology
  • Be the first to comment

How to clean data less through Linked (Open Data) approach?

  1. 1. How to Clean data Less through Linked (Open Data) Approach Andrea Wei-Ching Huang Institute of Information Science, Academia Sinica, Taipei, Taiwan Dec. 7 2015 @ IIS R101 1. Data Quality: data, metadata, linked data 2. The case of 840,000 cc licensed data 3. How LOD approach can improve data quality?
  2. 2. 1. Data Quality: data, metadata, linked data
  3. 3. Information Quality Data Quality Metadata Quality Linked Data Quality Data Quality Vocabulary Stvilia et al.(2007): 22 dimensions Batini et al. (2009): 28 dimensions Tani et al. (2013): 10 parameters Zaveri et al. (2016): 18 dimensions W3C (2015): 10 dimensions Naturalness (I) Interoperability (RP) Statistics Accessibility (R) Accessibility Accessibility Availability (A) Availability Accuracy (R) Accuracy Accuracy (S) Semantic Accuracy (I) Accuracy Accuracy/Validity (I) Applicability Pertinence Syntactic Validity (I) Appropriate amount of data Complexity (R) Clarity Precision/Completeness(R) Completeness Completeness(S) Completeness (I) Completeness Informativeness/Redundancy(R) Comprehensiveness Understandability (C) Informativeness/Redundancy(I) Conciseness Conciseness (I) Structural Consistency (I) Consistency Similarity Consistency (I) Consistency Convenience Structural Consistency(R) Correctness Verifiability (R) Credibility Trustworthiness (C) Credibility Currency (I) Currency Semantic Consistency(I) Derivation Integrity Ease of operation Processability Naturalness (R) Interactivity Conformance(S) Interlinking (A) Conformance Semantic Consistency(R) Interpretability Interpretability (RP) Precision/Completeness(I) Maintainability Preservability Complexity(I) Objectivity Relevance/ Aboutness(R) Relevancy Relevance Relevancy (C) Relevance Authority (Reputational) Reputation Security(R) Security Security (A) Speed Performance (A) Timeliness Timeliness Timeliness (C) Timeliness Traceability RP Conciseness (RP) Cohesiveness (I) Uniqueness Significance Usability Licensing (A) Volatility(R) Volatility Versatility (RP) (I): Intrinsic; (R): Relational; (S): Metadata Spec.; (RP): Representational; (A):Accessibility; (C): Contextual
  4. 4. 1. Accessibility/Availability (可取得性) 2. Accuracy (正確性) 3. Completeness (完整性) 4. Consistency (一致性) 5. Credibility/Trustworthiness (可信度) 6. Relevance (相關性) 7. Timeliness (適時性) 7 dimensions/parameters are common ground Quantitative/ Qualitative Methodologies are mutual utilized.
  5. 5. Metadata Quality: Problems & Solutions (1) Record Problems Yasser, Chuttur M. "An analysis of problems in metadata records." Journal of Library Metadata 11.2 (2011): 51-62
  6. 6. Metadata Quality: Problems & Solutions (2) Dublin Core Semantic Problems Park, Jung-ran, and Eric Childress. "Dublin Core metadata semantics: An analysis of the perspectives of information professionals." Journal of Information Science 35.6 (2009): 727-739. • Type is a subjective value. • Source is a confusing field. It is difficult to apply it consistently. • Creator can be very varied and it can be tricky determining exactly who the creator is. • The information from the publisher is vague. • Can’t define different role of contributor. • There is often great ambiguity in terms of Type and Relation. • between Format and Type. • between Creator, Publisher, and Contributor. • between Source and Relation. • The high degree of difficulty (55.3%) engendered by the Relation field o discernment of the dynamic and interrelated nature of information objects presents challenges in using the Relation element.
  7. 7. Metadata Quality: Problems & Solutions (3) Current Solutions Tani, Alice, Leonardo Candela, and Donatella Castelli. "Dealing with metadata quality: The legacy of digital library efforts." Information Processing & Management 49.6 (2013): 1194-1205.2 Tani et al. (2013): Summary of metadata quality approaches . ------------------------------------------------------------------------------------------------------------------------------------------- Metadata guidelines, standard and Application Profiles  Pros: potentially effective; if shared among organizations, they promote cross organization interoperability  Cons: challenging to agree between different organizations; often end-up being complex combinations of features reflecting the interests of many disparate parties; they infringe autonomy of the entities adopting them Metadata evaluation approaches (analytic-oriented and crowdsourcing-oriented)  Pros: helpful to identify specific problems  Cons: based on community specific criteria Semi-automatic metadata generation approaches  Pros: helpful to deal with the data deluge  Cons: human assessment Metadata cleaning, enhancement, augmentation approaches Pros: fundamental to enable cross-community exploitation of metadata Cons: information loss; information inconsistency -------------------------------------------------------------------------------------------------------------------------------------------
  8. 8. 2. The case of 840,000 cc licensed data In Union Catalogue of Digital Archives Taiwan
  9. 9. “Fitness for Use” is the Key: Data Quality (DQ) Definition for Digital Data  Nicholas R. Chrisman (1986): “Digital data can adapt to a broader range of uses with a broader range of special demand, …The root of data abuse is not in the quality of the data, but in the awareness and understanding of the quality of the data. By converting to the fitness for use approach, the problem of data abuse is moved from producer to consumer (data user).  W3C Data Quality Vocabulary (2015) : “...quality lies in the eye of the beholder; that there is no objective, ideal definition of it. Some datasets will be judged as low-quality resources by some data consumers, while they will perfectly fit others' needs.  Quality from perspectives of supply and demand sides: ex. Data Publishers, Certification Agencies, Data Aggregators and Data Consumers.  Pragmatic  User-specific  Context-dependent
  10. 10. physical object digital object digital collection digital aggregation & publication reusing & semantic representation Creation Conversion 1 Conversion 2 Conversion 3 Conversion 4 Clean & Enrich Conversion 5 Local Curation (90 projects) DC 15 elements as the requirement for Union Catalog Locally developed schemes Digital Archive Curation (1 portal) XML HTMLTEXT/Image XLSX/Table/HTML CSV CSV RDF/Turtle Linked Open Data (globally linked & semantically represented) Globally linked, machine accessible semantics & domain knowledge vocabularies are needed for LOD. CONTEX I CONTEX II CONTEX III “Fitness for Use” in different contexts:
  11. 11. physical object digital object digital collection digital aggregation & publication reusing & semantic representation Digital Archive Curation (1 portal) XML HTMLTEXT/Image XLSX/Table/HTML CSV CSV RDF/Turtle Linked Open Data (globally linked & semantically represented) Provide metadata guidelines & standard (DC 15) Metadata Generation Local Curation (90 projects) Data Quality Data Quality Data Quality Linked Data Generation Metadata evaluation approaches Semi-automatic metadata generation approaches Metadata cleaning, enhancement, augmentation approaches Information Loss ? Interpretation Problems? Time & Resource Cost?
  12. 12. Problems identified in the case of 840,000 cc data 1. Confusion of Dublin Core (DC 定義混淆) 2. Name Ambiguity (名稱模糊) 3. Inconsistent Encoding (編碼不一致) 4. Semantic Overlaps (語意超載) 5. Duplicate Records (資料重複) 6. Insufficient Element Usage (語意缺漏) 7. Errors / Mistakes / Others (其它錯誤)
  13. 13. Considerations in the case of 840,000 cc data for LOD 1. We are not data creators. Can we clean/revise the data “correctly”?  Keep original CSV data open.  Revised/Cleaned data as diff/mapping files. 2. How can we prevent “information loss”?  Mapping activities often result in information loss.  Reconsider the value of broken links. 3. Limited Resources & Time to handel the clean tasks.
  14. 14. 3. How Linked (Open Data) approach can improve data quality?
  15. 15. 1. Raw data, New data (cleaned data, semantically refined data) can be benefited from Open Data Approach: Creation of new data based on combining data. External quality checks of data (validation). Sustainability of data (no data loss). The ability to merge, integrate and mesh public and private data. Janssen, Marijn, Yannis Charalabidis, and Anneke Zuiderwijk. "Benefits, adoption barriers and myths of open data and open government." Information Systems Management 29.4 (2012): 258-268.
  16. 16. 2. Using SPARQL Queries to identify problems: Identify DQ Problems before RDF generated:  Use W3C mapping language R2RML and RDF validation framework (RDFUnit) for mapping definitions and allow publishers to catch & correct violations before they even happened. (Dimou et al, 2015) Identify DQ Problems after RDF generated:  Using SPARQL and Public Shared LOD resources (ex. DBPedia, Geonames)as reference to identified problems. (Furber and Hepp, 2010) Fürber, Christian, and Martin Hepp. "Using semantic web resources for data quality management." Knowledge Engineering and Management by the Masses. Springer Berlin Heidelberg, 2010. 211-225. Dimou, Anastasia, et al. "Assessing and Refining Mappingsto RDF to Improve Dataset Quality." The Semantic Web-ISWC 2015. Springer International Publishing, 2015. 133-149.
  17. 17. Above five points are summarized from Furber and Hepp (2013): "Using Semantic Web Technologies for Data Quality Management." Handbook of Data Quality. Springer Berlin Heidelberg, 2013. 141-161.  Collaborative representation and use of quality-relevant knowledge  Automatic identification of conflicting data requirement  Semantic definition of data.  Use Semantic Web data as a Trusted Reference data  Content Integration with Ontologies 3. Use Vocabularies, Ontologies & LOD Knowledge Base : To improve data quality for every step of a dataset's lifecycle (ex. W3C Data Quality Vocabulary) . To enrich data semantics and increase data reused and refined values.
  18. 18. http://www.w3.org/TR/vocab-dqv/ The importance pf provenance and metadata quality. (Carata, Lucian, et al. 2014)
  19. 19. The Story of A Fish http://catalog.digitalarchives.tw/item/00/5f/ca/d5.html Parapercis kentingensis
  20. 20. http://URI of this Fish/6277845 2012 2015 2016 TEXT/Image XLSX/Table/HTML XML/HTML CSV: (raw data published as open data) 6277845 (1)12/15 triples (statements) Metadata(DC 15) (2) 12/15 triples (statements) Provence wikidataerr + one “diff” triple new (3) Mapping replace Cleaning + one “time mapping” triple time new err Place information is not described in the Coverage but Description in this stage. This should be cleaned & mapped to external resources like Geoname and TaiwanPlaceName by us, or by some others when time and resources are available. (5) When the raw CSV and DC 15 represented triples (DC 15 Version) are published, they are easily for others to detect the errors, reused and enriched by their own Fitness of Use and Interpretations. Even there are errors from the beginning, more statements about this Fish (6277845) are thus can be generated by the interests of community. (4) Refined Version: semantically enriched by using domain vocabularies like Darwin Core Terms prov r4r schema cc odw
  21. 21. 1. Keep original CSV data open. 2. Less clean with mapping more: revised/cleaned data as diff/mapping files. 3. Publish the original DC 15 statements as 15 triples and provide Provenance information. 4. Assign each item resource a URI. 5. Use domain vocabulary to enrich the resource (e.x. dwc) 6. Mapping and Linking to external databases to enrich statements. (GenNames, TaiwanPlaceNames, Encyclopedia of Life). 7. More errors or meanings will be stated by third parties and crowdsourcing for their own interests. How we clean data less through Linked (Open Data) Approach
  22. 22. 1. Batini, Carlo, et al. "Methodologies for data quality assessment and improvement." ACM Computing Surveys (CSUR) 41.3 (2009): 16. 2. Chrisman, Nicholas R. "Obtaining information on quality of digital data." Proc. AutoCarto London. Vol. 1. 1986. 3. Carata, Lucian, et al. "A primer on provenance." Communications of the ACM 57.5 (2014): 52-60. 4. Dimou, Anastasia, et al. "Assessing and Refining Mappings to RDF to Improve Dataset Quality." The Semantic Web-ISWC 2015. Springer International Publishing, 2015. 133-149 5. Fürber, Christian, and Martin Hepp. "Using semantic web resources for data quality management." Knowledge Engineering and Management by the Masses. Springer Berlin Heidelberg, 2010. 211-225. 6. Furberand Hepp(2013): "Using Semantic Web Technologies for Data Quality Management." Handbook of Data Quality. Springer Berlin Heidelberg, 2013. 141-161 7. Hooland, Seth van, and Ruben Verborgh. Linked data for libraries, archives and museums. (2014). 8. Janssen, Marijn, YannisCharalabidis, and Anneke Zuiderwijk. "Benefits, adoption barriers and myths of open data and open government." Information Systems Management 29.4 (2012): 258-268. 9. Manus, Susan, The Value of a Broken Link (2012): http://blogs.loc.gov/digitalpreservation/2012/03/the-value-of-a-broken-link/ 10. Park, Jung-ran, and Eric Childress. "Dublin Core metadata semantics: An analysis of the perspectives of information professionals." Journal of Information Science 35.6 (2009): 727-739. 11. Stvilia, Besiki, et al. "A framework for information quality assessment." Journal of the American Society for Information Science and Technology 58.12 (2007): 1720-1733. 12. Tani, Alice, Leonardo Candela, and Donatella Castelli. "Dealing with metadata quality: The legacy of digital library efforts." Information Processing & Management 49.6 (2013): 1194-1205. 13. W3C, Data Quality Vocabulary (2015), http://www.w3.org/TR/vocab-dqv/ 14. Yasser, ChutturM. "An analysis of problems in metadata records." Journal of Library Metadata 11.2 (2011): 51-62 15. Zaveri, Amrapali, et al. "Quality assessment for linked open data: A survey." Semantic Web 7.1 (2016). REFERENCE
  23. 23. Merry Christmas Happy New Year We will release the DC 15 Versions and the Refined Version (Biology) shortly.

×