Successfully reported this slideshow.
Your SlideShare is downloading. ×

Data Curation and Debugging for Data Centric AI

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Minimal viable data reuse
Minimal viable data reuse
Loading in …3
×

Check these out next

1 of 36 Ad

Data Curation and Debugging for Data Centric AI

Download to read offline

It is increasingly recognized that data is a central challenge for AI systems - whether training an entirely new model, discovering data for a model, or applying an existing model to new data. Given this centrality of data, there is need to provide new tools that are able to help data teams create, curate and debug datasets in the context of complex machine learning pipelines. In this talk, I outline the underlying challenges for data debugging and curation in these environments. I then discuss our recent research that both takes advantage of ML to improve datasets but also uses core database techniques for debugging in such complex ML pipelines.

Presented at DBML 2022 at ICDE - https://www.wis.ewi.tudelft.nl/dbml2022

It is increasingly recognized that data is a central challenge for AI systems - whether training an entirely new model, discovering data for a model, or applying an existing model to new data. Given this centrality of data, there is need to provide new tools that are able to help data teams create, curate and debug datasets in the context of complex machine learning pipelines. In this talk, I outline the underlying challenges for data debugging and curation in these environments. I then discuss our recent research that both takes advantage of ML to improve datasets but also uses core database techniques for debugging in such complex ML pipelines.

Presented at DBML 2022 at ICDE - https://www.wis.ewi.tudelft.nl/dbml2022

Advertisement
Advertisement

More Related Content

Similar to Data Curation and Debugging for Data Centric AI (20)

More from Paul Groth (20)

Advertisement

Recently uploaded (20)

Data Curation and Debugging for Data Centric AI

  1. 1. Data Curation and Debugging for Data Centric AI Prof. Paul Groth | @pgroth | pgroth.com | indelab.org Thanks to Stefan Grafberger, Dr. Julia Stoyanovich, Dr. Sebastian Schelter, Dr. Laura Koesten, Prof. Elena Simperl, Dr. Pavlos Vougiouklis, Madelon Hulsebos, Dr. Çağatay Demiralp, Dr. Juan Sequeda, Prof. George Fletcher DBML - May 8, 2022
  2. 2. The making of data is important
  3. 3. Finding digital truth—that is, identifying and combining data that accurately represent reality—is becoming more difficult and more important. More difficult because data and their sources are multiplying. And more important because firms need to get their data house in order to benefit from AI, which they must to stay competitive. -- The Economist, February 2020
  4. 4. Data interoperability and quality, as well as their structure, authenticity and integrity are key for the exploitation of the data value, especially in the context of AI deployment -- European Commission, “A European strategy for data”, February 2020 (andrio/Shutterstock)
  5. 5. Source: http://veekaybee.github.io/2019/02/13/data-science-is-di ff erent/
  6. 6. Source: https://www.youtube.com/watch?v=06-AZXmwHjo
  7. 7. Bottlenecks • Manual • Di ffi culty in creating fl exible reusable work fl ows • Lack of transparency Paul Groth."The Knowledge-Remixing Bottleneck," Intelligent Systems, IEEE , vol.28, no.5, pp.44,48,  Sept.-Oct. 2013 doi: 10.1109/MIS.2013.138 Paul Groth, "Transparency and Reliability in the Data Supply Chain," Internet Computing, IEEE, vol.17, no.2, pp.69,71, March-April 2013 doi: 10.1109/MIC.2013.41
  8. 8. Debugging Sebastian Schelter UvA Julia Stoyanovich NYU Stefan Grafberger UvA Credits
  9. 9. ML Pipelines in the Real World 9 Integration & Cleaning
 of Data Feature Encoding Pipelines
 & Data Augmentation Model Training &
 Evaluation Heterogeneous
 Datasources ⋈ σ π ⋈ The “last mile” of end-to-end ML make_pipeline([ (‘encoding’, ColumnTransformer([ ('num', StandardScaler, …), (‘cat', OneHotEncoder, …)])),
 ’learner’, KerasClassifier(…)) 1 2 3
  10. 10. ML Pipelines in the Real World 10 Integration & Cleaning
 of Data Feature Encoding Pipelines
 & Data Augmentation Model Training &
 Evaluation Heterogeneous
 Datasources ⋈ σ π ⋈ The “last mile” of end-to-end ML make_pipeline([ (‘encoding’, ColumnTransformer([ ('num', StandardScaler, …), (‘cat', OneHotEncoder, …)])),
 ’learner’, KerasClassifier(…)) 1 2 3 Data Representation 
 Bugs
  11. 11. ML Pipelines in the Real World 11 Integration & Cleaning
 of Data Feature Encoding Pipelines
 & Data Augmentation Model Training &
 Evaluation Heterogeneous
 Datasources ⋈ σ π ⋈ The “last mile” of end-to-end ML make_pipeline([ (‘encoding’, ColumnTransformer([ ('num', StandardScaler, …), (‘cat', OneHotEncoder, …)])),
 ’learner’, KerasClassifier(…)) 1 2 3 Data Representation 
 Bugs Schema Violations
 & Missing Data
  12. 12. ML Pipelines in the Real World 12 Integration & Cleaning
 of Data Feature Encoding Pipelines
 & Data Augmentation Model Training &
 Evaluation Heterogeneous
 Datasources ⋈ σ π ⋈ The “last mile” of end-to-end ML make_pipeline([ (‘encoding’, ColumnTransformer([ ('num', StandardScaler, …), (‘cat', OneHotEncoder, …)])),
 ’learner’, KerasClassifier(…)) 1 2 3 Data Representation 
 Bugs Schema Violations
 & Missing Data Unsound
 Experimentation
  13. 13. The Way Forward • First approach: invent new programming languages + runtime systems to regain control (e.g. SystemDS) -> would require to rewrite all existing code • Second approach: manually annotate and instrument existing code (ml fl ow) -> does not happen in practice • Our approach: retro fi t inspection techniques into the existing DS landscape • Observation: declarative speci fi cation of operations for preprocessing present in some popular ML libraries: • Pandas mostly applies relational operations • Estimator / Transformer pipelines (scikit-learn / SparkML / Tensor fl ow Transform) o ff er nestable and composable way to declaratively specify feature transformations 13
  14. 14. Example 14 Can we fi nd ways to automatically hint data 
 scientists at potentially problematic operations 
 in the preprocessing code of their ML pipelines? Inspiration from software engineering, e.g. 
 code inspection in modern IDE’s
  15. 15. Example 15
  16. 16. mlinspect • Library to instrument ML preprocessing code with custom inspections • available on GitHub: https://github.com/stefan-grafberger/mlinspect • Works with “native” preprocessing pipelines (no annotation / manual instrumentation required) in pandas / sklearn • Representation of preprocessing operations based on data fl ow graph • Allows users to implement inspections as user-de fi ned functions which are automatically applied to the inputs and outputs of certain operations • Allow for the propagation of annotations per record through the program 16 Grafberger, S., Groth, P., Stoyanovich, J., & Schelter, S. (2022). Data distribution debugging in machine learning pipelines. The VLDB Journal, 1-24.
  17. 17. Example Inspections • Change detection for the proportions of protected groups: compute histograms of operator outputs
 
 
 
 
 17 age_group county 60 CountyA 60 CountyA 20 CountyA 60 CountyB 20 CountyB 20 CountyB data = data[data.county = “CountyA”] age_group county 60 CountyA 60 CountyA 20 CountyA • Lineage tracking: generate identi fi er annotations for records and propagate them through operators 50% vs 50% 66% vs 33% ssn smoke 123 Y 456 N 789 Y ssn cost 123 100 789 200 ssn smoke cost 123 Y 100 789 N 200 smoke cost Y 100 N 200 data = pd.merge([patient, cost],
 on=“ssn”) data = data[[“smoke”, “cost”]] [p1] [p2] [p3] [c1] [c2] [p1, c1] [p3, c2] [p1, c1] [p3, c2]
  18. 18. Summary • mlinspect is a general runtime for ML pipeline analysis available on GitHub: 
 https://github.com/stefan-grafberger/mlinspect • Limitation: Our approach relies on“declaratively” written ML pipelines, where we can identify the semantics of the operations • Enables many use cases like ArgusEyes, a CI tool
 https://github.com/schelterlabs/arguseyes 18
  19. 19. Curation Prof. Elena Simperl King’s College London Dr. Laura Koesten King’s College London / University of Vienna Dr. Pavlos Vougiouklis Huawei Credits Madelon Hulsebos UvA Sigma Computing Çağatay Demiralp Sigma Computing
  20. 20. What curation should data providers prioritise to facilitate reuse?
  21. 21. Lots of good advice Editorial Ten Simple Rules for the Care and Feeding of Scientific Data Alyssa Goodman1 , Alberto Pepe1 *, Alexander W. Blocker1 , Christine L. Borgman2 , Kyle Cranmer3 , Merce Crosas1 , Rosanne Di Stefano1 , Yolanda Gil4 , Paul Groth5 , Margaret Hedstrom6 , David W. Hogg3 , Vinay Kashyap1 , Ashish Mahabal7 , Aneta Siemiginowska1 , Aleksandra Slavkovic8 1 Harvard University, Cambridge, Massachusetts, United States of America, 2 University of California, Los Angeles, Los Angeles, California, United States of America, 3 New York University, New York, New York, United States of America, 4 University of Southern California, Los Angeles, Los Angeles, California, United States of America, 5 Vrije Universiteit Amsterdam, Amsterdam, The Netherlands, 6 University of Michigan, Ann Arbor, Michigan, United States of America, 7 California Institute of Technology, Pasadena, California, United States of America, 8 Pennsylvania State University, State College, Pennsylvania, United States of America Introduction In the early 1600s, Galileo Galilei turned a telescope toward Jupiter. In his log book each night, he drew to-scale schematic diagrams of Jupiter and some oddly moving points of light near it. Galileo labeled each drawing with the date. Eventually he used his observations to conclude that the Earth orbits the Sun, just as the four Galilean moons orbit Jupiter. History shows Galileo to be much more than an astronomical hero, though. His clear and careful record keeping and publication style not only let Galileo understand the solar system, they continue to let anyone understand how Galileo did it. Galileo’s notes directly integrated his data (drawings of Jupiter and its moons), key metadata (timing of each observation, weather, and telescope properties), and text (descriptions of methods, analysis, and conclusions). Critically, when Galileo included the information from those notes in Sidereus Nuncius [1], this integration of text, data, and metadata was preserved, as shown in Figure 1. Galileo’s work ad- vanced the ‘‘Scientific Revolution,’’ and his approach to observation and analysis contributed significantly to the shaping of today’s modern ‘‘scientific method’’ [2,3]. Today, most research projects are considered complete when a journal article based on the analysis has been written and published. The trouble is, unlike Galileo’s report in Sidereus Nuncius, the amount of real data and data descrip- tion in modern publications is almost never sufficient to repeat or even statisti- cally verify a study being presented. Worse, researchers wishing to build upon and extend work presented in the litera- ture often have trouble recovering data associated with an article after it has been published. More often than scientists would like to admit, they cannot even recover the data associated with their own published works. Complicating the modern situation, the words ‘‘data’’ and ‘‘analysis’’ have a wider variety of definitions today than at the time of Galileo. Theoretical investigations can create large ‘‘data’’ sets through simulations (e.g., The Millennium Simu- lation Project: http://www.mpa-garching. mpg.de/galform/virgo/millennium/). Large-scale data collection often takes place as a community-wide effort (e.g., The Human Genome project: http:// www.genome.gov/10001772), which leads to gigantic online ‘‘databases’’ (organized collections of data). Computers are so essential in simulations, and in the pro- cessing of experimental and observational data, that it is also often hard to draw a dividing line between ‘‘data’’ and ‘‘analy- sis’’ (or ‘‘code’’) when discussing the care and feeding of ‘‘data.’’ Sometimes, a copy of the code used to create or process data is so essential to the use of those data that the code should almost be thought of as part of the ‘‘metadata’’ description of the data. Other times, the code used in a scientific study is more separable from the data, but even then, many preservation and sharing principles apply to code just as well as they do to data. So how do we go about caring for and feeding data? Extra work, no doubt, is associated with nurturing your data, but care up front will save time and increase insight later. Even though a growing number of researchers, especially in large collabora- tions, know that conducting research with sharing and reuse in mind is essential, it still requires a paradigm shift. Most people are still motivated by piling up publications and by getting to the next one as soon as possible. But, the more we scientists find ourselves wishing we had access to extant but now unfindable data [4], the more we will realize why bad data management is bad for science. How can we improve? This article offers a short guide to the steps scientists can take to ensure that their data and associat- ed analyses continue to be of value and to be recognized. In just the past few years, hundreds of scholarly papers and reports have been written on ques- tions of data sharing, data provenance, research reproducibility, licensing, attribu- tion, privacy, and more—but our goal here is not to review that literature. Instead, we present a short guide intended for researchers who want to know why it is important to ‘‘care for and feed’’ data, with some practical advice on how to do that. The final section at the close of this work (Links to Useful Resources) offers links to the types of services referred to throughout the text. Boldface lettering below highlights actions one can take to follow the suggested rules. Rule 1. Love Your Data, and Help Others Love It, Too Data management is a repeat-play game. If you take care to make your data Citation: Goodman A, Pepe A, Blocker AW, Borgman CL, Cranmer K, et al. (2014) Ten Simple Rules for the Care and Feeding of Scientific Data. PLoS Comput Biol 10(4): e1003542. doi:10.1371/journal.pcbi.1003542 Published April 24, 2014 Copyright: ! 2014 Goodman et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: The authors received no specific funding for writing this manuscript. Competing Interests: The authors have declared that no competing interests exist. * E-mail: alberto.pepe@gmail.com Editor: Philip E. Bourne, University of California San Diego, United States of America PLOS Computational Biology | www.ploscompbiol.org 1 April 2014 | Volume 10 | Issue 4 | e1003542
  22. 22. Article Dataset Reuse: Toward Translating Principles to Practice Laura Koesten,1,* Pavlos Vougiouklis,2 Elena Simperl,1 and Paul Groth3,4,* 1King’s College London, London WC2B 4BG, UK 2Huawei Technologies, Edinburgh EH9 3BF, UK 3University of Amsterdam, Amsterdam 1090 GH, the Netherlands 4Lead Contact *Correspondence: laura.koesten@kcl.ac.uk (L.K.), p.groth@uva.nl (P.G.) https://doi.org/10.1016/j.patter.2020.100136 SUMMARY The web provides access to millions of datasets that can have additional impact when used beyond their original context. We have little empirical insight into what makes a dataset more reusable than others and which of the existing guidelines and frameworks, if any, make a difference. In this paper, we explore potential reuse features through a literature review and present a case study on datasets on GitHub, a popular open platform for sharing code and data. We describe a corpus of more than 1.4 million data files, from over 65,000 repositories. Using GitHub’s engagement metrics as proxies for dataset reuse, we relate them to reuse features from the literature and devise an initial model, using deep neural networks, to predict a data- set’s reusability. This demonstrates the practical gap between principles and actionable insights that allow data publishers and tools designers to implement functionalities that provably facilitate reuse. 1 INTRODUCTION There has been a gradual shift in the last years from viewing da- tasets as byproducts of (digital) work to critical assets, whose value increases the more they are used.1,2 However, our under- standing of how this value emerges, and of the factors that demonstrably affect the reusability of a dataset is still limited. Using a dataset beyond the context where it originated re- mains challenging for a variety of socio-technical reasons, which have been discussed in the literature;3,4 the bottom line is that simply making data available, even when complying with existing guidance and best practices, does not mean it can be easily used by others.5 At the same time, making data reusable to a diverse audience, in terms of domain, skill sets, and purposes, is an important way to realize its potential value (and recover some of the, sometimes considerable, resources invested in policy and infrastructure support). This is one of the reasons why scientific journals and research-funding organizations are increasingly calling for further data sharing6 or why industry bodies, such as the Interna- tional Data Spaces Association (IDSA) (https://www. internationaldataspaces.org/) are investing in reference archi- tectures to smooth data flows from one business to another. There is plenty of advice on how to make data easier to reuse, including technical standards, legal frameworks, and guidelines. Much work places focus on machine readability THE BIGGER PICTURE The web provides access to millions of datasets. These data can have additional impact when it is used beyond the context for which it was originally created. We have little empirical insight into what makes a dataset more reusable than others, and which of the existing guidelines and frameworks, if any, make a difference. In this paper, we explore potential reuse features through a literature review and present a case study on datasets on GitHub, a popular open platform for sharing code and data. We describe a corpus of more than 1.4 million data files, from over 65,000 repositories. Using GitHub’s engage- ment metrics as proxies for dataset reuse, we relate them to reuse features from the literature and devise an initial model, using deep neural networks, to predict a dataset’s reusability. This work demonstrates the practical gap between principles and actionable insights that allow data publishers and tools designers to implement functionalities that provably facilitate reuse. Proof-of-Concept: Data science output has been formulated, implemented, and tested for one domain/problem Patterns 1, 100136, November 13, 2020 ª 2020 The Author(s). 1 ll OPEN ACCESS Lots of good advice • Maybe a bit too much…. • Currently, 140 policies on fairsharing.org as of April 5, 2021 • We reviewed 40 papers • Cataloged 39 di ff erent features of datasets that enable data reuse
  23. 23. Where should a data provider start? • Lots of good advice! • It would be great to do all these things • But it’s all a bit overwhelming • Can we help prioritize?
  24. 24. Getting some data • Used Github as a case study • ~1.4 million datasets (e.g. CSV, excel) from ~65K repos • Use engagement metrics as proxies for data reuse • Map literature features to both dataset and repository features • Train a predictive model to see what are features are good predictors
  25. 25. Dataset Features Missing values Size Columns + Rows Readme features Issue features Age Description Parsable
  26. 26. Where to start? • Some ideas from this study if you’re publishing data with Github • provide an informative short textual summary of the dataset 
 • provide a comprehensive README fi le in a structured form and links to further information 
 • datasets should not exceed standard processable fi le sizes 
 • datasets should be possible to open with a standard con fi guration of a common library (such as Pandas)
 Trained a Recurrent Neural Network. Might be better models but useful for handling text, Not the greatest predicator (good for classifying not reuse) but still useful for helping us tease out features
  27. 27. Can we help automate curation?
  28. 28. Madelon Hulsebos https://madelonhulsebos.github.io
  29. 29. Example: semantic column type detection Sherlock [Hulsebos et al., KDD, 2019] DL method for semantic data type detection of table columns https://github.com/mitmedialab/sherlock-project
  30. 30. Need for a new corpora • Database-like table content and structure (semantics, data types, size). • Large-scale to facilitate table representation models. • Broad coverage to generalize to a diversity of domains. • Table semantics (e.g. column types).
  31. 31. CSVs from Github https://gittables.github.io
  32. 32. Tools to improve data supply chains Groth, Paul, "Transparency and Reliability in the Data Supply Chain," Internet Computing, IEEE, vol.17, no.2, pp.69,71, March-April 2013 doi: 10.1109/MIC.2013.41
  33. 33. Conclusion • AI is data centric • Need tools that help users debug and curate their data for ML • Way forward: Conversation between ML, DB, and HCI research • We are hiring :-) Paul Groth | @pgroth | pgroth.com | indelab.org

×