Data Quality and Data Curation –a personal viewKevin AshleyDirector, Digital Curation Centrewww.dcc.ac.ukKevin.ashley@ed.a...
(An aside – terminology)2013-06-20 Kevin Ashley – CC-BY 2Data archiveData bankData libraryData centerData repositoryI will...
2013-06-20 Kevin Ashley – CC-BY 3
2013-06-20 Kevin Ashley – CC-BY 4
2013-06-20 Kevin Ashley – CC-BY 5
2013-06-20 Kevin Ashley – CC-BY 6
From the ‘content validation’ section…“Attempts to compare the dataset withsummary figures from the published reportswere ...
The Company Accounts Data• More than one version in the wild• Ours – direct from government– Authentic; well-documented; i...
Observations• Message – quality is in the eye of the beholder• Improving one aspect of quality can damageanother• Some mar...
The conjugation of quality• I want data that is accurate• You want data that is up to date• She wants data that is compreh...
Terminology - OAIS• Consumer – anyone able to access data in thearchive• Designated Community – the set ofConsumers that t...
The problem• We all want high-quality data• Every data repository believes that its curationprocesses add quality• But – w...
Engineer’s mantraFAST, GOOD, CHEAP –pick any two!2013-06-20 Kevin Ashley – CC-BY 13
Current curation practice• Only one consumer group catered for perrepository• One workflow applies one set of qualitycontr...
Clarification?• Documented fully in work by Wang & Strong– Beyond Accuracy – what data quality means to dataconsumers, Jou...
Wang & Strong’s quality dimensionsIntrinsic Believability; Accuracy; Objectivity;ReputationContextual Value-added; Relevan...
Caveats• Some important factors (e.g. cost) are hiddenin these dimensions• Some others are conflated (e.g. accuracy andpre...
What can we gain?• Greater mobility for data curationprofessionals – remove domain-specificity• Increased number of generi...
Future curation, greater reuse• Be explicit about quality metrics and curationprocesses in domain-independent ways• Allow ...
2013-06-20 Kevin Ashley – CC-BY 20I need that data now!!! Idon’t care how messy it is – Ican fix it!I’ve wasted too much o...
2013-06-20 Kevin Ashley – CC-BY 21County Cereal Acres LivestockAcres PopHerts 23456 65345 .Beds 12963 . .Bucks 42331 . .Es...
Future curation & reuse• Assertions automated, machine-readable• Facilitates automated aggregation & analysis• Big data em...
Upcoming SlideShare
Loading in …5
×

Data Quality and Data Curation - a personal view

1,350 views

Published on

Presenatation from OAI Geneva 2013-06-20

Published in: Technology

Data Quality and Data Curation - a personal view

  1. 1. Data Quality and Data Curation –a personal viewKevin AshleyDirector, Digital Curation Centrewww.dcc.ac.ukKevin.ashley@ed.ac.ukReusable with attribution: CC-BYThe DCC is supported by Jisc
  2. 2. (An aside – terminology)2013-06-20 Kevin Ashley – CC-BY 2Data archiveData bankData libraryData centerData repositoryI will use these terms interchangeably
  3. 3. 2013-06-20 Kevin Ashley – CC-BY 3
  4. 4. 2013-06-20 Kevin Ashley – CC-BY 4
  5. 5. 2013-06-20 Kevin Ashley – CC-BY 5
  6. 6. 2013-06-20 Kevin Ashley – CC-BY 6
  7. 7. From the ‘content validation’ section…“Attempts to compare the dataset withsummary figures from the published reportswere not helpful. It is not clear how the datain the reports was derived from the originaldataset, although the numbers of availableaccounts for each year are of a similar orderof magnitude to those in this dataset. “2013-06-20 Kevin Ashley – CC-BY 7
  8. 8. The Company Accounts Data• More than one version in the wild• Ours – direct from government– Authentic; well-documented; inaccurate• Others – altered by economists, socialscientists– Poor provenance; less documentation; moreaccurate2013-06-20 Kevin Ashley – CC-BY 8
  9. 9. Observations• Message – quality is in the eye of the beholder• Improving one aspect of quality can damageanother• Some markets provide many versions of thesame data– Room for more of this approach with researchdata?2013-06-20 Kevin Ashley – CC-BY 9
  10. 10. The conjugation of quality• I want data that is accurate• You want data that is up to date• She wants data that is comprehensive• They want data that is free2013-06-20 Kevin Ashley – CC-BY 10We probably cannot all be happy at the same time
  11. 11. Terminology - OAIS• Consumer – anyone able to access data in thearchive• Designated Community – the set ofConsumers that the archive aims to serve• The Designated Community need not be asingle community with a single set of interests2013-06-20 Kevin Ashley – CC-BY 11
  12. 12. The problem• We all want high-quality data• Every data repository believes that its curationprocesses add quality• But – when we talk about quality we aretalking about different things• Some aspects of quality conflict with others• What does this mean for curation processes?• How do we maximise re-use potential?2013-06-20 Kevin Ashley – CC-BY 12
  13. 13. Engineer’s mantraFAST, GOOD, CHEAP –pick any two!2013-06-20 Kevin Ashley – CC-BY 13
  14. 14. Current curation practice• Only one consumer group catered for perrepository• One workflow applies one set of qualitycontrols and produces one dataset out foreach dataset in• Quality measures are often not explicit andrarely generic• Disciplines differ greatly from the generic –contrast (e.g.) WDS with Zenodo2013-06-20 Kevin Ashley – CC-BY 14
  15. 15. Clarification?• Documented fully in work by Wang & Strong– Beyond Accuracy – what data quality means to dataconsumers, Journal of Management InformationSystems 12(4); 1996• Analysis goes further than ‘research data’• Some researchers interested in data from outsidethe academy• Some data in universities has other, non-academic uses• Data moves between government, academic,commercial and public sectors2013-06-20 Kevin Ashley – CC-BY 15
  16. 16. Wang & Strong’s quality dimensionsIntrinsic Believability; Accuracy; Objectivity;ReputationContextual Value-added; Relevancy; Timeliness;Completeness; Appropriate amountRepresentational Interpretability; Ease of understanding;Representational consistency; ConciserepresentationAccessibility Accessibility; Access security2013-06-20 Kevin Ashley – CC-BY 16Some of these are to dowith the systems ratherthan the dataResearch repositories focuson some dimensions andneglect others
  17. 17. Caveats• Some important factors (e.g. cost) are hiddenin these dimensions• Some others are conflated (e.g. accuracy andprecision)• Not the full, or only, story – but any analysis isbetter than none2013-06-20 Kevin Ashley – CC-BY 17
  18. 18. What can we gain?• Greater mobility for data curationprofessionals – remove domain-specificity• Increased number of generic data quality tools• Training that emphasises transferable skills• Ease of integration of data from disparatesources without error2013-06-20 Kevin Ashley – CC-BY 18
  19. 19. Future curation, greater reuse• Be explicit about quality metrics and curationprocesses in domain-independent ways• Allow greater choice by Consumers• Look harder at cost/benefit of qualityprocesses – adjust where necessary• Express quality in machine-readableinteroperable ways2013-06-20 Kevin Ashley – CC-BY 19
  20. 20. 2013-06-20 Kevin Ashley – CC-BY 20I need that data now!!! Idon’t care how messy it is – Ican fix it!I’ve wasted too much of my lifefixing other’s people’s baddata. I’m not interested untilyou’ve cleaned it up anddocumented it. Besides, I haveother things to think about
  21. 21. 2013-06-20 Kevin Ashley – CC-BY 21County Cereal Acres LivestockAcres PopHerts 23456 65345 .Beds 12963 . .Bucks 42331 . .Essex 78113 . .Sussex 5428 . .Norfolk 99237 . .This value ishighly accurate –to within .01%This variable ishighly precise –to 5SFThis table isincomplete –10% of recordsare missing
  22. 22. Future curation & reuse• Assertions automated, machine-readable• Facilitates automated aggregation & analysis• Big data emerges from the long tail• Data released from sub-discipline silos• Non-disciplinary repositories play a greater role• Disciplinary archives can use expertise in widerdomains• Money is spent to best effect for research andreuse2013-06-20 Kevin Ashley – CC-BY 22

×