Data Communities - reusable data in and outside your organization.

Data Communities
Reusable data in and outside your organization
Prof. Paul Groth | @pgroth | pgroth.com | indelab.org
Thanks to Dr. Kathleen Gregory, Dr. Laura Koesten, Prof. Elena
Simperl, Dr. Pavlos Vougiouklis, Dr. Andrea Scharnhorst, Prof. Sally Wyatt
ConTech Forum 2021
June 15, 2021

Prof. Elena Simperl
King’s College London
Dr. Laura Koesten
King’s College London /
University of Vienna
Dr. Kathleen Gregory
KNAW DANS
Prof. Sally Wyatt
Maastricht University
Dr. Andrea Scharnhorst
KNAW DANS
Dr. Pavlos Vougiouklis
Huawei
We investigate intelligent systems that support people in
their work with data and information from diverse sources.
In this area, we perform applied and fundamental research
informed by empirical insights into data science practice.
Current topics:
• Automated Knowledge Base Construction
• Data Search + Data Provenance
• Data Management for Machine Learning
• Causality for machine learning on messy data
indelab.org
Thanks to my
collaborators on this work in
HCI, social science, humanities

Data is everywhere in your organization
Supervision Sources / Signals
• Knowledge or entity graphs: e.g. databases of facts about the target
domain.
• Aggregate statistics: e.g. tracked metrics about the target domain.
• Heuristics and rules: e.g. existing human-authored rules about the target
domain.
• Topic models, taggers, and classifiers: e.g. machine learning models about
the target domain or a related domain.
https://ai.googleblog.com/2019/03/harnessing-organizational-knowledge-for.html

What should we do as data providers to enable data reuse?

Lots of good advice
• Maybe a bit too much….
• Currently, 140 policies on fairsharing.org as
of April 5, 2021
• We reviewed 40 papers
• Cataloged 39 different features of datasets
that enable data reuse

Where should a data provider start?
• Lots of good advice!
• It would be great to do all these things
• But it’s all a bit overwhelming
• Can we help prioritize?

Getting some data
• Used Github as a case study
• ~1.4 million datasets (e.g. CSV, excel) from
~65K repos
• Use engagement metrics as proxies for data
reuse
• Map literature features to both dataset and
repository features
• Train a predictive model to see what are
features are good predictors

Dataset Features
Missing values
Size
Columns + Rows
Readme features
Issue features
Age
Description
Parsable

Where to start?
• Some ideas from this study if you’re publishing data with
Github
• provide an informative short textual summary of the
dataset
• provide a comprehensive README file in a
structured form and links to further information
• datasets should not exceed standard processable file
sizes
• datasets should be possible to open with a standard
configuration of a common library (such as Pandas)
Trained a Recurrent Neural Network. Might be better models but useful for
handling text, Not the greatest predicator (good for classifying not reuse)
but still useful for helping us tease out features

How would you make sense of this data?
Koesten, L., Gregory, K., Groth, P., & Simperl, E. (2021). Talking datasets –
Understanding data sensemaking behaviours. International Journal of Human-
Computer Studies, 146, 102562. https://doi.org/10.1016/j.ijhcs.2020.102562

Patterns of data-centric sense making
• 31 research “data people”
• Brought their own data
• Presented with unknown data
• Think-out loud
• Talk about both their data and then given data
• Interview transcripts + screen captures

Engaging with data
Known Unknown
Acronyms
and
abbreviations
“That is a classic abbreviation in the field of hepatic surgery. AFP is
alpha feto protein. It is a marker. It’s very well known by everybody...the
AFP score is a criterion for liver transplantation. (P22)”
“I’m not sure what ‘long’ means. I wonder if it’s not
something to do with longevity. On the other hand, no, it’s
got negative numbers. I can’t make sense of this. (P7)”
Identifiying
strange
things
“Although we’ve tried really hard, because we’ve put in a coding frame
and how we manipulate all the data, I’m sure that there are things in
there which we haven’t recorded in terms of, well, what exactly does
this mean? I hope we’ve covered it all but I’m sure we haven’t. (P10)”
“Now that sounds quite high for the Falklands. I wouldn’t have
thought the population was all that great...and yet it’s only one
confirmed case. Okay [laughs]. So yes...one might need to
actually examine that a little bit more carefully, because the
population of the Falklands doesn’t reach a million, so
therefore you end up with this huge number of deaths per
million population [laughs], but only one case and one death.
(P23)”

Placing data
• P2: It’s listing the countries for which data are available, not sure if
this is truly all countries we know of...
• P8: It includes essentially every country in the world
• P29: Global data
• P30: I would like to know whether it’s complete...it says 212 rows
representing countries, whether I have data from all countries or
only from 25% or something because then it’s not really
representative.
• P7: If it was the whole country that was affected or not, affecting the
northern part, the western, eastern, southern parts
• P24: Was it sampled and then estimated for the whole country? Or
is it the exact number of deaths that were got from hospitals and
health agencies, for example? So is it a census or is it an estimate?

Activity patterns during data sense making

Recommendations
✅ for data providers
• Help users understand shape
• Provide information at the dataset level (e.g. summaries) ✅
• Column level summaries
• Make it easier to pan and zoom
• Use strange things as an entry point
• Flag and highlight strange things ✅
• Provide explanations of abbreviations and missing values ✅
• Provide metrics or links to other information structures necessary for
understanding the column’s content ✅
• Include links to basic concepts ✅
• Highlight relationships between columns or entities ✅
• Identify anchor variables that are considered most important ✅
• Help users placing data
• Embrace different levels of expertise and enable drill down
• Link to standardized definitions ✅
• Connect to broader forms of documentation ✅

Data is Social
Do you want a data community?
Gregory, K., Groth, P. Scharnhorst, A., Wyatt, S. (2020). Lost
or found? Discovering data needed for research. Harvard Data
Science Review. https://doi.org/10.1162/99608f92.e38165eb

Conclusion
• For data platforms
• Think about ways of measuring data reuse
• Tooling for summaries and overviews of data
• Automated linking to information for sense making
• For data providers
• Simple steps
• Focus on making it easy to “get to know” your data.
• Easy to load and explore (e.g. in pandas, excel, community tool)
• Links to more information
• Are you trying to be a part or build a data community?
• We still need a lot more work on data practices and methods informed by
practices
Paul Groth | @pgroth | pgroth.com | indelab.org

Enable access
Feature Description References
Access
License (1) available, (2) allows reuse W3C 3,22,45–47
Format/machine readability
(1) consistent format, (2) single value type per column, (3) human as well as
machine readable and non-proprietary format, (4) different formats available
W3C2,22,48–50
Code available for cleaning, analysis, visualizations 51–53
Unique identifier PID for the dataset/ID's within the dataset W3C2,53
Download link/API (1) available, (2) functioning W3C47,50

Document
Documentation: Methodological Choices
Methodology
description of experimental setup (sampling,
tools, etc.), link to publication or project
3,13,54,60,63,66
Units and reference systems (1) defined, (2) consistently used 54,67
Representativeness/Population in relation to a total population 21,60
Caveats
changes: classification/seasonal or special
event/sample size/coverage/rounding
48,54
Cleaning/pre-processing
(1) cleaning choices described, (2) are the raw
data available?
3,13,21,68
Biases/limitations different types of bias (i.e., sampling bias) 21,49,69
Data management (1) mode of storage, (2) duration of storage 3,70,71
Documentation: Quality
Missing values/null values
(1) defined what they mean, (2) ratio of empty
cells
W3C22,48,49,59,60
Margin of error/reliability/quality control
procedures
(1) confidence intervals, (2) estimates versus
actual measurements
54,65
Formatting
(1) consistent data type per column, (2)
consistent date format
W3C41,65
Outliers
are there data points that differ significantly from
the rest
22
Possible options/constraints on a variable
(1) value type, (2) if data contains an “other”
category
W3C72
Last update
information about data maintenance if
applicable
21,62
Documentation: Summary Representations and
Understandability
Description/README file
meaningful textual description (can also
include text, code, images)
22,54,55
Purpose purpose of data collection, context of creation 3,21,49,56,57
Summarizing statistics (1) on dataset level, (2) on column level 22,49
Visual representations statistical properties of the dataset 22,58
Headers understandable
(1) column-level documentation (e.g.,
abbreviations explained), (2) variable types, (3)
how derived (e.g., categorization, such as
labels or codes)
22,59,60
Geographical scope (1) defined, (2) level of granularity 45,54,61,62
Temporal scope (1) defined, (2) level of granularity 45,54,61,62
Time of data collection (1) when collected, (2) what time span 63–65

Situate
Connections
Relationships between variables defined (1) explained in documentation, (2) formulae 21,22
Cite sources (1) links or citation, (2) indication of link quality 21
Links to dataset being used elsewhere i.e., in publications, community-led projects 21,59
Contact person or organization, mode of contact specified W3C41,73
Provenance and Versioning
Publisher/producer/repository
(1) authoritativeness of source, (2) funding
mechanisms/other interests that influenced data
collection specified
21,49,54,59,74,
75
Version indicator version or modification of dataset documented W3C50,66,76
Version history workflow provenance W3C50,76
Prior reuse/advice on data reuse (1) example projects, (2) access to discussions 3,27,59,60
Ethics
Ethical considerations, personal data
(1) data related to individually identifiable
people, (2) if applicable, was consent
given
21,57,71,75
Semantics
Schema/Syntax/Data Model defined W3C47,67
Use of existing taxonomies/vocabularies (1) documented, (2) link W3C2

Data Communities - reusable data in and outside your organization.

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Communities - reusable data in and outside your organization.

Similar to Data Communities - reusable data in and outside your organization. (20)

More from Paul Groth

More from Paul Groth (12)

Recently uploaded

Recently uploaded (20)

Data Communities - reusable data in and outside your organization.

Editor's Notes