Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The need for a transparent data supply chain

927 views

Published on

Illustrating data supply chains and motivating the need for a more transparent data supply chain in the context of responsible data science. Presented at the 2018 KNAW-Royal Society bilateral meeting on responsible data science.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

The need for a transparent data supply chain

  1. 1. THE NEED FOR A TRANSPARENT DATA SUPPLY CHAIN Paul Groth (@pgroth) pgroth.com Disruptive Technology Director Elsevier Labs (@elsevierlabs) KNAW – Royal Society Bilateral Meeting on Responsible Data Science Feb 20-22, 2018 Contributions: Brad Allen, Sujit Pal, Craig Stanley, Ron Daniel, Alex de Jong, Corey Harper
  2. 2. ILLUSTRATING THE DATA SUPPLY CHAIN • Data through models to applications • Data ”are” • Raw Data is an Oxymoron
  3. 3. REUSING MODELS
  4. 4. REUSING DATA From: Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B. and Vijayanarasimhan, S. YouTube-8M: a large-scale video classification benchmark. arXiv:1609.08675.
  5. 5. BIAS IN DATA  BIAS IN MODELS
  6. 6. PERFORMANCE TOO Paul Groth, Michael Lauruhn, Antony Scerri: “Open Information Extraction on Scientific Text: An Evaluation”, 2018; [http://arxiv.org/abs/1802.05574 arXiv:1802.05574]
  7. 7. 8 CrowdFlower 2016 Data Science Report
  8. 8. 9 INTEGRATION OF LARGE NUMBERS OF DATA SOURCES Groth, Paul, "The Knowledge-Remixing Bottleneck," Intelligent Systems, IEEE , vol.28, no.5, pp.44,48, Sept.-Oct. 2013 doi: 10.1109/MIS.2013.138 • 10 different extractors • E.g mapping-based infobox extractor • Infobox uses a hand-built ontology based on the 350 • Based on acommonly used English language infoboxes • Integrates with Yago • Yago relies on Wikipedia + Wordnet • Upper ontology from Wordnet and then a mapping to Wikipedia categories based frequencies • Wordnet is built by psycholinguists
  9. 9. Enrichment though integration using linked data
  10. 10. “The goal of auditability is to clearly document when decisions are made and, if necessary, backtrack to an earlier dataset and address the issue at the root” 1. Acknowledge that data are people and can do harm 2. Recognize that privacy is more than a binary value 3. Guard against the reidentification of your data 4. Practice ethical data sharing 5. Consider the strengths and limitations of your data; big does not automatically mean better 6. Debate the tough, ethical choices 7. Develop a code of conduct for your organization, research community, or industry 8. Design your data and systems for auditability 9. Engage with the broader consequences of data and analysis practices 10. Know when to break these rules Zook M, Barocas S, boyd d, Crawford K, Keller E, et al. (2017) Ten simple rules for responsible big data research. PLOS Computational Biology 13(3): e1005399. https://doi.org/10.1371/journal.pcbi.1005399
  11. 11. THE RIGHT TO AN EXPLANATION “The data subject shall have the right to obtain … the existence of automated decision-making, including profiling … meaningful information about the logic involved, as well as the significance and the envisaged consequences of such processing for the data subject.” EU General Data Protection Chapter 3, Article 15
  12. 12. WIKIDATA VOCABULARY
  13. 13. • Total concepts = 540,632 • 100+ person years of clinical expert knowledge EMMeT Ontology
  14. 14. http://www.publicbooks.org/justice-for-data-janitors/
  15. 15. CHANGE PROPAGATES Concept1 Concept2 Concept3 KOS Professional Curators Literature Software Non-professional contributors 1. dealing with changing cultural and societal norms, specifically to address or correct bias; 2. political influence 3. new concepts and terminology arising from discoveries or change in perspective within a technical/scientific community 4. gardening 5. incremental contributorship 6. progressive formalization 7. software and automation 8. integration of large numbers of data sources 9. variance in algorithm training data Data ⚐Society & Politics (4, 5, 6) (7, 8, 9) (3) (1, 2) Lauruhn, Michael, and Paul Groth. "Sources of Change for Modern Knowledge Organization Systems." Knowledge Organization 43, no. 8 (2016).
  16. 16. A MORE TRANSPARENT DATA SUPPLY CHAIN Groth, Paul, "Transparency and Reliability in the Data Supply Chain," Internet Computing, IEEE, vol.17, no.2, pp.69,71, March- April 2013 doi: 10.1109/MIC.2013.41
  17. 17. TRANSPARENCY ACKNOWLEDGES MESSINESS M. C. Elish & danah boyd (2018) Situating methods in the magic of Big Data and AI, Communication Monographs, 85:1, 57-80, DOI: 10.1080/03637751.2017.1375130

×