SlideShare a Scribd company logo
Data Communities
Reusable data in and outside your organization
Prof. Paul Groth | @pgroth | pgroth.com | indelab.org
Thanks to Dr. Kathleen Gregory, Dr. Laura Koesten, Prof. Elena
Simperl, Dr. Pavlos Vougiouklis, Dr. Andrea Scharnhorst, Prof. Sally Wyatt
ConTech Forum 2021
June 15, 2021
Prof. Elena Simperl
King’s College London
Dr. Laura Koesten
King’s College London /
University of Vienna
Dr. Kathleen Gregory
KNAW DANS
Prof. Sally Wyatt
Maastricht University
Dr. Andrea Scharnhorst
KNAW DANS
Dr. Pavlos Vougiouklis
Huawei
We investigate intelligent systems that support people in
their work with data and information from diverse sources.
In this area, we perform applied and fundamental research
informed by empirical insights into data science practice.
Current topics:
• Automated Knowledge Base Construction
• Data Search + Data Provenance
• Data Management for Machine Learning
• Causality for machine learning on messy data
indelab.org
Thanks to my
collaborators on this work in
HCI, social science, humanities
Data is everywhere in your organization
Supervision Sources / Signals
• Knowledge or entity graphs: e.g. databases of facts about the target
domain.
• Aggregate statistics: e.g. tracked metrics about the target domain.
• Heuristics and rules: e.g. existing human-authored rules about the target
domain.
• Topic models, taggers, and classifiers: e.g. machine learning models about
the target domain or a related domain.
https://ai.googleblog.com/2019/03/harnessing-organizational-knowledge-for.html
What should we do as data providers to enable data reuse?
Lots of good advice
Lots of good advice
• Maybe a bit too much….
• Currently, 140 policies on fairsharing.org as
of April 5, 2021
• We reviewed 40 papers
• Cataloged 39 different features of datasets
that enable data reuse
Where should a data provider start?
• Lots of good advice!
• It would be great to do all these things
• But it’s all a bit overwhelming
• Can we help prioritize?
Getting some data
• Used Github as a case study
• ~1.4 million datasets (e.g. CSV, excel) from
~65K repos
• Use engagement metrics as proxies for data
reuse
• Map literature features to both dataset and
repository features
• Train a predictive model to see what are
features are good predictors
Dataset Features
Missing values
Size
Columns + Rows
Readme features
Issue features
Age
Description
Parsable
Where to start?
• Some ideas from this study if you’re publishing data with
Github
• provide an informative short textual summary of the
dataset
• provide a comprehensive README file in a
structured form and links to further information
• datasets should not exceed standard processable file
sizes
• datasets should be possible to open with a standard
configuration of a common library (such as Pandas)
Trained a Recurrent Neural Network. Might be better models but useful for
handling text, Not the greatest predicator (good for classifying not reuse)
but still useful for helping us tease out features
Understand your target users
How would you make sense of this data?
Koesten, L., Gregory, K., Groth, P., & Simperl, E. (2021). Talking datasets –
Understanding data sensemaking behaviours. International Journal of Human-
Computer Studies, 146, 102562. https://doi.org/10.1016/j.ijhcs.2020.102562
Patterns of data-centric sense making
• 31 research “data people”
• Brought their own data
• Presented with unknown data
• Think-out loud
• Talk about both their data and then given data
• Interview transcripts + screen captures
Inspecting unknown data
Engaging with data
Known Unknown
Acronyms
and
abbreviations
“That is a classic abbreviation in the field of hepatic surgery. AFP is
alpha feto protein. It is a marker. It’s very well known by everybody...the
AFP score is a criterion for liver transplantation. (P22)”
“I’m not sure what ‘long’ means. I wonder if it’s not
something to do with longevity. On the other hand, no, it’s
got negative numbers. I can’t make sense of this. (P7)”
Identifiying
strange
things
“Although we’ve tried really hard, because we’ve put in a coding frame
and how we manipulate all the data, I’m sure that there are things in
there which we haven’t recorded in terms of, well, what exactly does
this mean? I hope we’ve covered it all but I’m sure we haven’t. (P10)”
“Now that sounds quite high for the Falklands. I wouldn’t have
thought the population was all that great...and yet it’s only one
confirmed case. Okay [laughs]. So yes...one might need to
actually examine that a little bit more carefully, because the
population of the Falklands doesn’t reach a million, so
therefore you end up with this huge number of deaths per
million population [laughs], but only one case and one death.
(P23)”
Placing data
• P2: It’s listing the countries for which data are available, not sure if
this is truly all countries we know of...
• P8: It includes essentially every country in the world
• P29: Global data
• P30: I would like to know whether it’s complete...it says 212 rows
representing countries, whether I have data from all countries or
only from 25% or something because then it’s not really
representative.
• P7: If it was the whole country that was affected or not, affecting the
northern part, the western, eastern, southern parts
• P24: Was it sampled and then estimated for the whole country? Or
is it the exact number of deaths that were got from hospitals and
health agencies, for example? So is it a census or is it an estimate?
Activity patterns during data sense making
Recommendations
✅ for data providers
• Help users understand shape
• Provide information at the dataset level (e.g. summaries) ✅
• Column level summaries
• Make it easier to pan and zoom
• Use strange things as an entry point
• Flag and highlight strange things ✅
• Provide explanations of abbreviations and missing values ✅
• Provide metrics or links to other information structures necessary for
understanding the column’s content ✅
• Include links to basic concepts ✅
• Highlight relationships between columns or entities ✅
• Identify anchor variables that are considered most important ✅
• Help users placing data
• Embrace different levels of expertise and enable drill down
• Link to standardized definitions ✅
• Connect to broader forms of documentation ✅
Data is Social
Do you want a data community?
Gregory, K., Groth, P. Scharnhorst, A., Wyatt, S. (2020). Lost
or found? Discovering data needed for research. Harvard Data
Science Review. https://doi.org/10.1162/99608f92.e38165eb
Conclusion
• For data platforms
• Think about ways of measuring data reuse
• Tooling for summaries and overviews of data
• Automated linking to information for sense making
• For data providers
• Simple steps
• Focus on making it easy to “get to know” your data.
• Easy to load and explore (e.g. in pandas, excel, community tool)
• Links to more information
• Are you trying to be a part or build a data community?
• We still need a lot more work on data practices and methods informed by
practices
Paul Groth | @pgroth | pgroth.com | indelab.org
Backup
Enable access
Feature Description References
Access
License (1) available, (2) allows reuse W3C 3,22,45–47
Format/machine readability
(1) consistent format, (2) single value type per column, (3) human as well as
machine readable and non-proprietary format, (4) different formats available
W3C2,22,48–50
Code available for cleaning, analysis, visualizations 51–53
Unique identifier PID for the dataset/ID's within the dataset W3C2,53
Download link/API (1) available, (2) functioning W3C47,50
Document
Documentation: Methodological Choices
Methodology
description of experimental setup (sampling,
tools, etc.), link to publication or project
3,13,54,60,63,66
Units and reference systems (1) defined, (2) consistently used 54,67
Representativeness/Population in relation to a total population 21,60
Caveats
changes: classification/seasonal or special
event/sample size/coverage/rounding
48,54
Cleaning/pre-processing
(1) cleaning choices described, (2) are the raw
data available?
3,13,21,68
Biases/limitations different types of bias (i.e., sampling bias) 21,49,69
Data management (1) mode of storage, (2) duration of storage 3,70,71
Documentation: Quality
Missing values/null values
(1) defined what they mean, (2) ratio of empty
cells
W3C22,48,49,59,60
Margin of error/reliability/quality control
procedures
(1) confidence intervals, (2) estimates versus
actual measurements
54,65
Formatting
(1) consistent data type per column, (2)
consistent date format
W3C41,65
Outliers
are there data points that differ significantly from
the rest
22
Possible options/constraints on a variable
(1) value type, (2) if data contains an “other”
category
W3C72
Last update
information about data maintenance if
applicable
21,62
Documentation: Summary Representations and
Understandability
Description/README file
meaningful textual description (can also
include text, code, images)
22,54,55
Purpose purpose of data collection, context of creation 3,21,49,56,57
Summarizing statistics (1) on dataset level, (2) on column level 22,49
Visual representations statistical properties of the dataset 22,58
Headers understandable
(1) column-level documentation (e.g.,
abbreviations explained), (2) variable types, (3)
how derived (e.g., categorization, such as
labels or codes)
22,59,60
Geographical scope (1) defined, (2) level of granularity 45,54,61,62
Temporal scope (1) defined, (2) level of granularity 45,54,61,62
Time of data collection (1) when collected, (2) what time span 63–65
Situate
Connections
Relationships between variables defined (1) explained in documentation, (2) formulae 21,22
Cite sources (1) links or citation, (2) indication of link quality 21
Links to dataset being used elsewhere i.e., in publications, community-led projects 21,59
Contact person or organization, mode of contact specified W3C41,73
Provenance and Versioning
Publisher/producer/repository
(1) authoritativeness of source, (2) funding
mechanisms/other interests that influenced data
collection specified
21,49,54,59,74,
75
Version indicator version or modification of dataset documented W3C50,66,76
Version history workflow provenance W3C50,76
Prior reuse/advice on data reuse (1) example projects, (2) access to discussions 3,27,59,60
Ethics
Ethical considerations, personal data
(1) data related to individually identifiable
people, (2) if applicable, was consent
given
21,57,71,75
Semantics
Schema/Syntax/Data Model defined W3C47,67
Use of existing taxonomies/vocabularies (1) documented, (2) link W3C2

More Related Content

What's hot

Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper Provenance
Paul Groth
 
The need for a transparent data supply chain
The need for a transparent data supply chainThe need for a transparent data supply chain
The need for a transparent data supply chain
Paul Groth
 
The Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for ScienceThe Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for Science
Paul Groth
 
From Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge GraphsFrom Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge Graphs
Paul Groth
 
Knowledge Graph Futures
Knowledge Graph FuturesKnowledge Graph Futures
Knowledge Graph Futures
Paul Groth
 
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsCombining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Paul Groth
 
Sources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization SystemsSources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization Systems
Paul Groth
 
An Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities DataAn Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities Data
Rinke Hoekstra
 
Knowledge Representation on the Web
Knowledge Representation on the WebKnowledge Representation on the Web
Knowledge Representation on the Web
Rinke Hoekstra
 
Machines are people too
Machines are people tooMachines are people too
Machines are people too
Paul Groth
 
Prov-O-Viz: Interactive Provenance Visualization
Prov-O-Viz: Interactive Provenance VisualizationProv-O-Viz: Interactive Provenance Visualization
Prov-O-Viz: Interactive Provenance Visualization
Rinke Hoekstra
 
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)Rinke Hoekstra
 
The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture Data
Paul Groth
 
Knowledge graph construction for research & medicine
Knowledge graph construction for research & medicineKnowledge graph construction for research & medicine
Knowledge graph construction for research & medicine
Paul Groth
 
Managing Metadata for Science and Technology Studies: the RISIS case
Managing Metadata for Science and Technology Studies: the RISIS caseManaging Metadata for Science and Technology Studies: the RISIS case
Managing Metadata for Science and Technology Studies: the RISIS case
Rinke Hoekstra
 
Cognitive data
Cognitive dataCognitive data
Cognitive data
Sören Auer
 
Towards Knowledge Graph based Representation, Augmentation and Exploration of...
Towards Knowledge Graph based Representation, Augmentation and Exploration of...Towards Knowledge Graph based Representation, Augmentation and Exploration of...
Towards Knowledge Graph based Representation, Augmentation and Exploration of...
Sören Auer
 
Describing Scholarly Contributions semantically with the Open Research Knowle...
Describing Scholarly Contributions semantically with the Open Research Knowle...Describing Scholarly Contributions semantically with the Open Research Knowle...
Describing Scholarly Contributions semantically with the Open Research Knowle...
Sören Auer
 
Towards an Open Research Knowledge Graph
Towards an Open Research Knowledge GraphTowards an Open Research Knowledge Graph
Towards an Open Research Knowledge Graph
Sören Auer
 
The State of Linked Government Data
The State of Linked Government DataThe State of Linked Government Data
The State of Linked Government Data
Richard Cyganiak
 

What's hot (20)

Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper Provenance
 
The need for a transparent data supply chain
The need for a transparent data supply chainThe need for a transparent data supply chain
The need for a transparent data supply chain
 
The Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for ScienceThe Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for Science
 
From Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge GraphsFrom Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge Graphs
 
Knowledge Graph Futures
Knowledge Graph FuturesKnowledge Graph Futures
Knowledge Graph Futures
 
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsCombining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
 
Sources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization SystemsSources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization Systems
 
An Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities DataAn Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities Data
 
Knowledge Representation on the Web
Knowledge Representation on the WebKnowledge Representation on the Web
Knowledge Representation on the Web
 
Machines are people too
Machines are people tooMachines are people too
Machines are people too
 
Prov-O-Viz: Interactive Provenance Visualization
Prov-O-Viz: Interactive Provenance VisualizationProv-O-Viz: Interactive Provenance Visualization
Prov-O-Viz: Interactive Provenance Visualization
 
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
 
The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture Data
 
Knowledge graph construction for research & medicine
Knowledge graph construction for research & medicineKnowledge graph construction for research & medicine
Knowledge graph construction for research & medicine
 
Managing Metadata for Science and Technology Studies: the RISIS case
Managing Metadata for Science and Technology Studies: the RISIS caseManaging Metadata for Science and Technology Studies: the RISIS case
Managing Metadata for Science and Technology Studies: the RISIS case
 
Cognitive data
Cognitive dataCognitive data
Cognitive data
 
Towards Knowledge Graph based Representation, Augmentation and Exploration of...
Towards Knowledge Graph based Representation, Augmentation and Exploration of...Towards Knowledge Graph based Representation, Augmentation and Exploration of...
Towards Knowledge Graph based Representation, Augmentation and Exploration of...
 
Describing Scholarly Contributions semantically with the Open Research Knowle...
Describing Scholarly Contributions semantically with the Open Research Knowle...Describing Scholarly Contributions semantically with the Open Research Knowle...
Describing Scholarly Contributions semantically with the Open Research Knowle...
 
Towards an Open Research Knowledge Graph
Towards an Open Research Knowledge GraphTowards an Open Research Knowledge Graph
Towards an Open Research Knowledge Graph
 
The State of Linked Government Data
The State of Linked Government DataThe State of Linked Government Data
The State of Linked Government Data
 

Similar to Data Communities - reusable data in and outside your organization.

Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...
CILIP MDG
 
NC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
NC3Rs Publication Bias workshop - Sansone - Better Data = Better ScienceNC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
NC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
Susanna-Assunta Sansone
 
Full Erdmann Ruttenberg Community Approaches to Open Data at Scale
Full Erdmann Ruttenberg Community Approaches to Open Data at ScaleFull Erdmann Ruttenberg Community Approaches to Open Data at Scale
Full Erdmann Ruttenberg Community Approaches to Open Data at Scale
National Information Standards Organization (NISO)
 
Data curation issues for repositories
Data curation issues for repositoriesData curation issues for repositories
Data curation issues for repositories
Chris Rusbridge
 
Emerging Data Citation Infrastructure
Emerging Data Citation InfrastructureEmerging Data Citation Infrastructure
Emerging Data Citation Infrastructure
Micah Altman
 
NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014
NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014
NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014Susanna-Assunta Sansone
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Spark Summit
 
Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014
Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014
Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014
Susanna-Assunta Sansone
 
Dats nih-dccpc-kc7-april2018-prs-uoxf
Dats  nih-dccpc-kc7-april2018-prs-uoxfDats  nih-dccpc-kc7-april2018-prs-uoxf
Dats nih-dccpc-kc7-april2018-prs-uoxf
Philippe Rocca-Serra
 
Research Objects for FAIRer Science
Research Objects for FAIRer Science Research Objects for FAIRer Science
Research Objects for FAIRer Science
Carole Goble
 
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Bertram Ludäscher
 
The web of data: how are we doing so far
The web of data: how are we doing so farThe web of data: how are we doing so far
The web of data: how are we doing so far
Elena Simperl
 
Big Data for Library Services (2017)
Big Data for Library Services (2017)Big Data for Library Services (2017)
Big Data for Library Services (2017)
Albert Anthony Gavino, MBA
 
Introduction to research data management
Introduction to research data managementIntroduction to research data management
Introduction to research data management
dri_ireland
 
Guy avoiding-dat apocalypse
Guy avoiding-dat apocalypseGuy avoiding-dat apocalypse
Guy avoiding-dat apocalypse
ENUG
 
Converged IT and Data Commons
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data Commons
Simon Twigger
 
A Clean Slate?
A Clean Slate?A Clean Slate?
A Clean Slate?
Herbert Van de Sompel
 
Data discovery and sharing at UCLH
Data discovery and sharing at UCLHData discovery and sharing at UCLH
Data discovery and sharing at UCLH
Jisc
 
SciDataCon 2014 Data Papers and their applications workshop - NPG Scientific ...
SciDataCon 2014 Data Papers and their applications workshop - NPG Scientific ...SciDataCon 2014 Data Papers and their applications workshop - NPG Scientific ...
SciDataCon 2014 Data Papers and their applications workshop - NPG Scientific ...
Susanna-Assunta Sansone
 
ODIN Final Event - The Care and Feeding of Scientific Data
ODIN Final Event - The Care and Feeding of Scientific DataODIN Final Event - The Care and Feeding of Scientific Data
ODIN Final Event - The Care and Feeding of Scientific Data
datacite
 

Similar to Data Communities - reusable data in and outside your organization. (20)

Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...
 
NC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
NC3Rs Publication Bias workshop - Sansone - Better Data = Better ScienceNC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
NC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
 
Full Erdmann Ruttenberg Community Approaches to Open Data at Scale
Full Erdmann Ruttenberg Community Approaches to Open Data at ScaleFull Erdmann Ruttenberg Community Approaches to Open Data at Scale
Full Erdmann Ruttenberg Community Approaches to Open Data at Scale
 
Data curation issues for repositories
Data curation issues for repositoriesData curation issues for repositories
Data curation issues for repositories
 
Emerging Data Citation Infrastructure
Emerging Data Citation InfrastructureEmerging Data Citation Infrastructure
Emerging Data Citation Infrastructure
 
NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014
NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014
NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
 
Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014
Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014
Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014
 
Dats nih-dccpc-kc7-april2018-prs-uoxf
Dats  nih-dccpc-kc7-april2018-prs-uoxfDats  nih-dccpc-kc7-april2018-prs-uoxf
Dats nih-dccpc-kc7-april2018-prs-uoxf
 
Research Objects for FAIRer Science
Research Objects for FAIRer Science Research Objects for FAIRer Science
Research Objects for FAIRer Science
 
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
 
The web of data: how are we doing so far
The web of data: how are we doing so farThe web of data: how are we doing so far
The web of data: how are we doing so far
 
Big Data for Library Services (2017)
Big Data for Library Services (2017)Big Data for Library Services (2017)
Big Data for Library Services (2017)
 
Introduction to research data management
Introduction to research data managementIntroduction to research data management
Introduction to research data management
 
Guy avoiding-dat apocalypse
Guy avoiding-dat apocalypseGuy avoiding-dat apocalypse
Guy avoiding-dat apocalypse
 
Converged IT and Data Commons
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data Commons
 
A Clean Slate?
A Clean Slate?A Clean Slate?
A Clean Slate?
 
Data discovery and sharing at UCLH
Data discovery and sharing at UCLHData discovery and sharing at UCLH
Data discovery and sharing at UCLH
 
SciDataCon 2014 Data Papers and their applications workshop - NPG Scientific ...
SciDataCon 2014 Data Papers and their applications workshop - NPG Scientific ...SciDataCon 2014 Data Papers and their applications workshop - NPG Scientific ...
SciDataCon 2014 Data Papers and their applications workshop - NPG Scientific ...
 
ODIN Final Event - The Care and Feeding of Scientific Data
ODIN Final Event - The Care and Feeding of Scientific DataODIN Final Event - The Care and Feeding of Scientific Data
ODIN Final Event - The Care and Feeding of Scientific Data
 

More from Paul Groth

To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Data Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AIData Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AI
Paul Groth
 
Elsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge GraphElsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge Graph
Paul Groth
 
Diversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domainsDiversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domains
Paul Groth
 
Progressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computationProgressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computation
Paul Groth
 
Are we finally ready for transclusion?*
Are we finally ready for transclusion?*Are we finally ready for transclusion?*
Are we finally ready for transclusion?*
Paul Groth
 
Structured Data & the Future of Educational Material
Structured Data & the Future of Educational MaterialStructured Data & the Future of Educational Material
Structured Data & the Future of Educational Material
Paul Groth
 
Research Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkResearch Data Sharing: A Basic Framework
Research Data Sharing: A Basic Framework
Paul Groth
 
Data for Science: How Elsevier is using data science to empower researchers
Data for Science: How Elsevier is using data science to empower researchersData for Science: How Elsevier is using data science to empower researchers
Data for Science: How Elsevier is using data science to empower researchers
Paul Groth
 
Tradeoffs in Automatic Provenance Capture
Tradeoffs in Automatic Provenance CaptureTradeoffs in Automatic Provenance Capture
Tradeoffs in Automatic Provenance Capture
Paul Groth
 
Knowledge Graph Construction and the Role of DBPedia
Knowledge Graph Construction and the Role of DBPediaKnowledge Graph Construction and the Role of DBPedia
Knowledge Graph Construction and the Role of DBPedia
Paul Groth
 
Information architecture at Elsevier
Information architecture at ElsevierInformation architecture at Elsevier
Information architecture at Elsevier
Paul Groth
 

More from Paul Groth (12)

To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Data Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AIData Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AI
 
Elsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge GraphElsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge Graph
 
Diversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domainsDiversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domains
 
Progressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computationProgressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computation
 
Are we finally ready for transclusion?*
Are we finally ready for transclusion?*Are we finally ready for transclusion?*
Are we finally ready for transclusion?*
 
Structured Data & the Future of Educational Material
Structured Data & the Future of Educational MaterialStructured Data & the Future of Educational Material
Structured Data & the Future of Educational Material
 
Research Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkResearch Data Sharing: A Basic Framework
Research Data Sharing: A Basic Framework
 
Data for Science: How Elsevier is using data science to empower researchers
Data for Science: How Elsevier is using data science to empower researchersData for Science: How Elsevier is using data science to empower researchers
Data for Science: How Elsevier is using data science to empower researchers
 
Tradeoffs in Automatic Provenance Capture
Tradeoffs in Automatic Provenance CaptureTradeoffs in Automatic Provenance Capture
Tradeoffs in Automatic Provenance Capture
 
Knowledge Graph Construction and the Role of DBPedia
Knowledge Graph Construction and the Role of DBPediaKnowledge Graph Construction and the Role of DBPedia
Knowledge Graph Construction and the Role of DBPedia
 
Information architecture at Elsevier
Information architecture at ElsevierInformation architecture at Elsevier
Information architecture at Elsevier
 

Recently uploaded

Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 

Recently uploaded (20)

Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 

Data Communities - reusable data in and outside your organization.

  • 1. Data Communities Reusable data in and outside your organization Prof. Paul Groth | @pgroth | pgroth.com | indelab.org Thanks to Dr. Kathleen Gregory, Dr. Laura Koesten, Prof. Elena Simperl, Dr. Pavlos Vougiouklis, Dr. Andrea Scharnhorst, Prof. Sally Wyatt ConTech Forum 2021 June 15, 2021
  • 2. Prof. Elena Simperl King’s College London Dr. Laura Koesten King’s College London / University of Vienna Dr. Kathleen Gregory KNAW DANS Prof. Sally Wyatt Maastricht University Dr. Andrea Scharnhorst KNAW DANS Dr. Pavlos Vougiouklis Huawei We investigate intelligent systems that support people in their work with data and information from diverse sources. In this area, we perform applied and fundamental research informed by empirical insights into data science practice. Current topics: • Automated Knowledge Base Construction • Data Search + Data Provenance • Data Management for Machine Learning • Causality for machine learning on messy data indelab.org Thanks to my collaborators on this work in HCI, social science, humanities
  • 3. Data is everywhere in your organization Supervision Sources / Signals • Knowledge or entity graphs: e.g. databases of facts about the target domain. • Aggregate statistics: e.g. tracked metrics about the target domain. • Heuristics and rules: e.g. existing human-authored rules about the target domain. • Topic models, taggers, and classifiers: e.g. machine learning models about the target domain or a related domain. https://ai.googleblog.com/2019/03/harnessing-organizational-knowledge-for.html
  • 4. What should we do as data providers to enable data reuse?
  • 5. Lots of good advice
  • 6. Lots of good advice • Maybe a bit too much…. • Currently, 140 policies on fairsharing.org as of April 5, 2021 • We reviewed 40 papers • Cataloged 39 different features of datasets that enable data reuse
  • 7. Where should a data provider start? • Lots of good advice! • It would be great to do all these things • But it’s all a bit overwhelming • Can we help prioritize?
  • 8. Getting some data • Used Github as a case study • ~1.4 million datasets (e.g. CSV, excel) from ~65K repos • Use engagement metrics as proxies for data reuse • Map literature features to both dataset and repository features • Train a predictive model to see what are features are good predictors
  • 9. Dataset Features Missing values Size Columns + Rows Readme features Issue features Age Description Parsable
  • 10. Where to start? • Some ideas from this study if you’re publishing data with Github • provide an informative short textual summary of the dataset • provide a comprehensive README file in a structured form and links to further information • datasets should not exceed standard processable file sizes • datasets should be possible to open with a standard configuration of a common library (such as Pandas) Trained a Recurrent Neural Network. Might be better models but useful for handling text, Not the greatest predicator (good for classifying not reuse) but still useful for helping us tease out features
  • 12.
  • 13. How would you make sense of this data? Koesten, L., Gregory, K., Groth, P., & Simperl, E. (2021). Talking datasets – Understanding data sensemaking behaviours. International Journal of Human- Computer Studies, 146, 102562. https://doi.org/10.1016/j.ijhcs.2020.102562
  • 14. Patterns of data-centric sense making • 31 research “data people” • Brought their own data • Presented with unknown data • Think-out loud • Talk about both their data and then given data • Interview transcripts + screen captures
  • 16. Engaging with data Known Unknown Acronyms and abbreviations “That is a classic abbreviation in the field of hepatic surgery. AFP is alpha feto protein. It is a marker. It’s very well known by everybody...the AFP score is a criterion for liver transplantation. (P22)” “I’m not sure what ‘long’ means. I wonder if it’s not something to do with longevity. On the other hand, no, it’s got negative numbers. I can’t make sense of this. (P7)” Identifiying strange things “Although we’ve tried really hard, because we’ve put in a coding frame and how we manipulate all the data, I’m sure that there are things in there which we haven’t recorded in terms of, well, what exactly does this mean? I hope we’ve covered it all but I’m sure we haven’t. (P10)” “Now that sounds quite high for the Falklands. I wouldn’t have thought the population was all that great...and yet it’s only one confirmed case. Okay [laughs]. So yes...one might need to actually examine that a little bit more carefully, because the population of the Falklands doesn’t reach a million, so therefore you end up with this huge number of deaths per million population [laughs], but only one case and one death. (P23)”
  • 17. Placing data • P2: It’s listing the countries for which data are available, not sure if this is truly all countries we know of... • P8: It includes essentially every country in the world • P29: Global data • P30: I would like to know whether it’s complete...it says 212 rows representing countries, whether I have data from all countries or only from 25% or something because then it’s not really representative. • P7: If it was the whole country that was affected or not, affecting the northern part, the western, eastern, southern parts • P24: Was it sampled and then estimated for the whole country? Or is it the exact number of deaths that were got from hospitals and health agencies, for example? So is it a census or is it an estimate?
  • 18. Activity patterns during data sense making
  • 19. Recommendations ✅ for data providers • Help users understand shape • Provide information at the dataset level (e.g. summaries) ✅ • Column level summaries • Make it easier to pan and zoom • Use strange things as an entry point • Flag and highlight strange things ✅ • Provide explanations of abbreviations and missing values ✅ • Provide metrics or links to other information structures necessary for understanding the column’s content ✅ • Include links to basic concepts ✅ • Highlight relationships between columns or entities ✅ • Identify anchor variables that are considered most important ✅ • Help users placing data • Embrace different levels of expertise and enable drill down • Link to standardized definitions ✅ • Connect to broader forms of documentation ✅
  • 20. Data is Social Do you want a data community? Gregory, K., Groth, P. Scharnhorst, A., Wyatt, S. (2020). Lost or found? Discovering data needed for research. Harvard Data Science Review. https://doi.org/10.1162/99608f92.e38165eb
  • 21. Conclusion • For data platforms • Think about ways of measuring data reuse • Tooling for summaries and overviews of data • Automated linking to information for sense making • For data providers • Simple steps • Focus on making it easy to “get to know” your data. • Easy to load and explore (e.g. in pandas, excel, community tool) • Links to more information • Are you trying to be a part or build a data community? • We still need a lot more work on data practices and methods informed by practices Paul Groth | @pgroth | pgroth.com | indelab.org
  • 23. Enable access Feature Description References Access License (1) available, (2) allows reuse W3C 3,22,45–47 Format/machine readability (1) consistent format, (2) single value type per column, (3) human as well as machine readable and non-proprietary format, (4) different formats available W3C2,22,48–50 Code available for cleaning, analysis, visualizations 51–53 Unique identifier PID for the dataset/ID's within the dataset W3C2,53 Download link/API (1) available, (2) functioning W3C47,50
  • 24. Document Documentation: Methodological Choices Methodology description of experimental setup (sampling, tools, etc.), link to publication or project 3,13,54,60,63,66 Units and reference systems (1) defined, (2) consistently used 54,67 Representativeness/Population in relation to a total population 21,60 Caveats changes: classification/seasonal or special event/sample size/coverage/rounding 48,54 Cleaning/pre-processing (1) cleaning choices described, (2) are the raw data available? 3,13,21,68 Biases/limitations different types of bias (i.e., sampling bias) 21,49,69 Data management (1) mode of storage, (2) duration of storage 3,70,71 Documentation: Quality Missing values/null values (1) defined what they mean, (2) ratio of empty cells W3C22,48,49,59,60 Margin of error/reliability/quality control procedures (1) confidence intervals, (2) estimates versus actual measurements 54,65 Formatting (1) consistent data type per column, (2) consistent date format W3C41,65 Outliers are there data points that differ significantly from the rest 22 Possible options/constraints on a variable (1) value type, (2) if data contains an “other” category W3C72 Last update information about data maintenance if applicable 21,62 Documentation: Summary Representations and Understandability Description/README file meaningful textual description (can also include text, code, images) 22,54,55 Purpose purpose of data collection, context of creation 3,21,49,56,57 Summarizing statistics (1) on dataset level, (2) on column level 22,49 Visual representations statistical properties of the dataset 22,58 Headers understandable (1) column-level documentation (e.g., abbreviations explained), (2) variable types, (3) how derived (e.g., categorization, such as labels or codes) 22,59,60 Geographical scope (1) defined, (2) level of granularity 45,54,61,62 Temporal scope (1) defined, (2) level of granularity 45,54,61,62 Time of data collection (1) when collected, (2) what time span 63–65
  • 25. Situate Connections Relationships between variables defined (1) explained in documentation, (2) formulae 21,22 Cite sources (1) links or citation, (2) indication of link quality 21 Links to dataset being used elsewhere i.e., in publications, community-led projects 21,59 Contact person or organization, mode of contact specified W3C41,73 Provenance and Versioning Publisher/producer/repository (1) authoritativeness of source, (2) funding mechanisms/other interests that influenced data collection specified 21,49,54,59,74, 75 Version indicator version or modification of dataset documented W3C50,66,76 Version history workflow provenance W3C50,76 Prior reuse/advice on data reuse (1) example projects, (2) access to discussions 3,27,59,60 Ethics Ethical considerations, personal data (1) data related to individually identifiable people, (2) if applicable, was consent given 21,57,71,75 Semantics Schema/Syntax/Data Model defined W3C47,67 Use of existing taxonomies/vocabularies (1) documented, (2) link W3C2

Editor's Notes

  1. The majority of participants mentioned the overall topic or title as one of the first two attributes (n = 24); roughly half of participants mentioned the format or shape of the data (e.g. the number of columns, rows or observations) either first or second (n = 15).