Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

of

Minimal viable-datareuse-czi Slide 1 Minimal viable-datareuse-czi Slide 2 Minimal viable-datareuse-czi Slide 3 Minimal viable-datareuse-czi Slide 4 Minimal viable-datareuse-czi Slide 5 Minimal viable-datareuse-czi Slide 6 Minimal viable-datareuse-czi Slide 7 Minimal viable-datareuse-czi Slide 8 Minimal viable-datareuse-czi Slide 9 Minimal viable-datareuse-czi Slide 10 Minimal viable-datareuse-czi Slide 11 Minimal viable-datareuse-czi Slide 12 Minimal viable-datareuse-czi Slide 13 Minimal viable-datareuse-czi Slide 14 Minimal viable-datareuse-czi Slide 15 Minimal viable-datareuse-czi Slide 16 Minimal viable-datareuse-czi Slide 17 Minimal viable-datareuse-czi Slide 18 Minimal viable-datareuse-czi Slide 19 Minimal viable-datareuse-czi Slide 20 Minimal viable-datareuse-czi Slide 21 Minimal viable-datareuse-czi Slide 22 Minimal viable-datareuse-czi Slide 23
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

0 Likes

Share

Download to read offline

Minimal viable-datareuse-czi

Download to read offline

The literature contains a myriad of recommendations, advice, and strictures about what data providers should do to facilitate data reuse. It can be overwhelming. Based on recent empirical work (analyzing data reuse proxies at scale, understanding data sensemaking and looking at how researchers search for data), I talk about what practices are a good place to start for helping others to reuse your data.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

Minimal viable-datareuse-czi

  1. 1. Minimal Viable Data Reuse Prof. Paul Groth | @pgroth | pgroth.com | indelab.org Thanks to Dr. Kathleen Gregory, Dr. Laura Koesten, Prof. Elena Simperl, Dr. Pavlos Vougiouklis, Dr. Andrea Scharnhorst, Prof. Sally Wyatt CZI Seed Networks Computational Biology April 6, 2021
  2. 2. Prof. Elena Simperl King’s College London Dr. Laura Koesten King’s College London / University of Vienna Dr. Kathleen Gregory KNAW DANS Prof. Sally Wyatt Maastricht University Dr. Andrea Scharnhorst KNAW DANS Dr. Pavlos Vougiouklis Huawei We investigate intelligent systems that support people in their work with data and information from diverse sources. In this area, we perform applied and fundamental research informed by empirical insights into data science practice. Current topics: • Automated Knowledge Base Construction • Data Search + Data Provenance • Data Management for Machine Learning • Causality for machine learning on messy data indelab.org Thanks to my collaborators on this work in HCI, social science, humanities
  3. 3. What should we do as data providers to enable data reuse?
  4. 4. Lots of good advice
  5. 5. Lots of good advice • Maybe a bit too much…. • Currently, 140 policies on fairsharing.org as of April 5, 2021 • We reviewed 40 papers • Cataloged 39 different features of datasets that enable data reuse
  6. 6. Enable access Feature Description References Access License (1) available, (2) allows reuse W3C 3,22,45–47 Format/machine readability (1) consistent format, (2) single value type per column, (3) human as well as machine readable and non-proprietary format, (4) different formats available W3C2,22,48–50 Code available for cleaning, analysis, visualizations 51–53 Unique identifier PID for the dataset/ID's within the dataset W3C2,53 Download link/API (1) available, (2) functioning W3C47,50
  7. 7. Document Documentation: Methodological Choices Methodology description of experimental setup (sampling, tools, etc.), link to publication or project 3,13,54,60,63,66 Units and reference systems (1) defined, (2) consistently used 54,67 Representativeness/Population in relation to a total population 21,60 Caveats changes: classification/seasonal or special event/sample size/coverage/rounding 48,54 Cleaning/pre-processing (1) cleaning choices described, (2) are the raw data available? 3,13,21,68 Biases/limitations different types of bias (i.e., sampling bias) 21,49,69 Data management (1) mode of storage, (2) duration of storage 3,70,71 Documentation: Quality Missing values/null values (1) defined what they mean, (2) ratio of empty cells W3C22,48,49,59,60 Margin of error/reliability/quality control procedures (1) confidence intervals, (2) estimates versus actual measurements 54,65 Formatting (1) consistent data type per column, (2) consistent date format W3C41,65 Outliers are there data points that differ significantly from the rest 22 Possible options/constraints on a variable (1) value type, (2) if data contains an “other” category W3C72 Last update information about data maintenance if applicable 21,62 Documentation: Summary Representations and Understandability Description/README file meaningful textual description (can also include text, code, images) 22,54,55 Purpose purpose of data collection, context of creation 3,21,49,56,57 Summarizing statistics (1) on dataset level, (2) on column level 22,49 Visual representations statistical properties of the dataset 22,58 Headers understandable (1) column-level documentation (e.g., abbreviations explained), (2) variable types, (3) how derived (e.g., categorization, such as labels or codes) 22,59,60 Geographical scope (1) defined, (2) level of granularity 45,54,61,62 Temporal scope (1) defined, (2) level of granularity 45,54,61,62 Time of data collection (1) when collected, (2) what time span 63–65
  8. 8. Situate Connections Relationships between variables defined (1) explained in documentation, (2) formulae 21,22 Cite sources (1) links or citation, (2) indication of link quality 21 Links to dataset being used elsewhere i.e., in publications, community-led projects 21,59 Contact person or organization, mode of contact specified W3C41,73 Provenance and Versioning Publisher/producer/repository (1) authoritativeness of source, (2) funding mechanisms/other interests that influenced data collection specified 21,49,54,59,74, 75 Version indicator version or modification of dataset documented W3C50,66,76 Version history workflow provenance W3C50,76 Prior reuse/advice on data reuse (1) example projects, (2) access to discussions 3,27,59,60 Ethics Ethical considerations, personal data (1) data related to individually identifiable people, (2) if applicable, was consent given 21,57,71,75 Semantics Schema/Syntax/Data Model defined W3C47,67 Use of existing taxonomies/vocabularies (1) documented, (2) link W3C2
  9. 9. Where should a data provider start? • Lots of good advice! • It would be great to do all these things • But it’s all a bit overwhelming • Can we help prioritize?
  10. 10. Getting some data • Used Github as a case study • ~1.4 million datasets (e.g. CSV, excel) from ~65K repos • Use engagement metrics as proxies for data reuse • Map literature features to both dataset and repository features • Train a predictive model to see what are features are good predictors
  11. 11. Dataset Features Missing values Size Columns + Rows Readme features Issue features Age Description Parsable
  12. 12. Where to start? • Some ideas from this study if you’re publishing data with Github • provide an informative short textual summary of the dataset • provide a comprehensive README file in a structured form and links to further information • datasets should not exceed standard processable file sizes • datasets should be possible to open with a standard configuration of a common library (such as Pandas) Trained a Recurrent Neural Network. Might be better models but useful for handling text, Not the greatest predicator (good for classifying not reuse) but still useful for helping us tease out features
  13. 13. Understand your target users
  14. 14. How would you make sense of this data? Koesten, L., Gregory, K., Groth, P., & Simperl, E. (2021). Talking datasets – Understanding data sensemaking behaviours. International Journal of Human- Computer Studies, 146, 102562. https://doi.org/10.1016/j.ijhcs.2020.102562
  15. 15. Patterns of data-centric sense making • 31 research “data people” • Brought their own data • Presented with unknown data • Think-out loud • Talk about both their data and then given data • Interview transcripts + screen captures
  16. 16. Inspecting unknown data
  17. 17. Engaging with data Known Unknown Acronyms and abbreviations “That is a classic abbreviation in the field of hepatic surgery. AFP is alpha feto protein. It is a marker. It’s very well known by everybody...the AFP score is a criterion for liver transplantation. (P22)” “I’m not sure what ‘long’ means. I wonder if it’s not something to do with longevity. On the other hand, no, it’s got negative numbers. I can’t make sense of this. (P7)” Identifiying strange things “Although we’ve tried really hard, because we’ve put in a coding frame and how we manipulate all the data, I’m sure that there are things in there which we haven’t recorded in terms of, well, what exactly does this mean? I hope we’ve covered it all but I’m sure we haven’t. (P10)” “Now that sounds quite high for the Falklands. I wouldn’t have thought the population was all that great...and yet it’s only one confirmed case. Okay [laughs]. So yes...one might need to actually examine that a little bit more carefully, because the population of the Falklands doesn’t reach a million, so therefore you end up with this huge number of deaths per million population [laughs], but only one case and one death. (P23)”
  18. 18. Placing data • P2: It’s listing the countries for which data are available, not sure if this is truly all countries we know of... • P8: It includes essentially every country in the world • P29: Global data • P30: I would like to know whether it’s complete...it says 212 rows representing countries, whether I have data from all countries or only from 25% or something because then it’s not really representative. • P7: If it was the whole country that was affected or not, affecting the northern part, the western, eastern, southern parts • P24: Was it sampled and then estimated for the whole country? Or is it the exact number of deaths that were got from hospitals and health agencies, for example? So is it a census or is it an estimate?
  19. 19. Activity patterns during data sense making
  20. 20. Recommendations ✅ for data providers • Help users understand shape • Provide information at the dataset level (e.g. summaries) ✅ • Column level summaries • Make it easier to pan and zoom • Use strange things as an entry point • Flag and highlight strange things ✅ • Provide explanations of abbreviations and missing values ✅ • Provide metrics or links to other information structures necessary for understanding the column’s content ✅ • Include links to basic concepts ✅ • Highlight relationships between columns or entities ✅ • Identify anchor variables that are considered most important ✅ • Help users placing data • Embrace different levels of expertise and enable drill down • Link to standardized definitions ✅ • Connect to broader forms of documentation ✅
  21. 21. Data is Social Do you want a data community? Gregory, K., Groth, P. Scharnhorst, A., Wyatt, S. (2020). Lost or found? Discovering data needed for research. Harvard Data Science Review. https://doi.org/10.1162/99608f92.e38165eb
  22. 22. Conclusion • For data platforms • Think about ways of measuring data reuse • Tooling for summaries and overviews of data • Automated linking to information for sense making • For data providers • Simple steps • Focus on making it easy to “get to know” your data. • Easy to load and explore (e.g. in pandas, excel, community tool) • Links to more information • Are you trying to be a part or build a data community? • We still need a lot more work on data practices and methods informed by practices Paul Groth | @pgroth | pgroth.com | indelab.org

The literature contains a myriad of recommendations, advice, and strictures about what data providers should do to facilitate data reuse. It can be overwhelming. Based on recent empirical work (analyzing data reuse proxies at scale, understanding data sensemaking and looking at how researchers search for data), I talk about what practices are a good place to start for helping others to reuse your data.

Views

Total views

141

On Slideshare

0

From embeds

0

Number of embeds

2

Actions

Downloads

2

Shares

0

Comments

0

Likes

0

×