1. The document discusses the need for transparency in data supply chains. It notes that data goes through multiple steps as it is collected, modeled, and applied in applications.
2. It illustrates the complexity of data supply chains using examples of how data is reused and integrated from multiple sources to build models and how bias can propagate.
3. The document argues that transparency is important to understand where data comes from, how it has been processed, and help address issues like bias, privacy, or other problems at their source in the data supply chain.
1. THE NEED FOR A TRANSPARENT DATA
SUPPLY CHAIN
Paul Groth (@pgroth)
pgroth.com
Disruptive Technology Director
Elsevier Labs (@elsevierlabs)
KNAW – Royal Society Bilateral Meeting on Responsible Data Science
Feb 20-22, 2018
Contributions: Brad Allen, Sujit Pal, Craig Stanley, Ron Daniel, Alex de Jong, Corey Harper
2. ILLUSTRATING THE DATA SUPPLY CHAIN
• Data through models to applications
• Data ”are”
• Raw Data is an Oxymoron
5. REUSING DATA
From: Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B.
and Vijayanarasimhan, S. YouTube-8M: a large-scale video classification benchmark.
arXiv:1609.08675.
7. PERFORMANCE TOO
Paul Groth, Michael Lauruhn, Antony Scerri: “Open Information Extraction on
Scientific Text: An Evaluation”, 2018; [http://arxiv.org/abs/1802.05574
arXiv:1802.05574]
9. 9
INTEGRATION OF LARGE NUMBERS OF DATA SOURCES
Groth, Paul, "The Knowledge-Remixing Bottleneck," Intelligent Systems, IEEE
, vol.28, no.5, pp.44,48, Sept.-Oct. 2013 doi: 10.1109/MIS.2013.138
• 10 different extractors
• E.g mapping-based infobox extractor
• Infobox uses a hand-built ontology based on the 350
• Based on acommonly used English language
infoboxes
• Integrates with Yago
• Yago relies on Wikipedia + Wordnet
• Upper ontology from Wordnet and then a mapping to
Wikipedia categories based frequencies
• Wordnet is built by psycholinguists
12. “The goal of auditability is to
clearly document when decisions
are made and, if necessary,
backtrack to an earlier dataset and
address the issue at the root”
1. Acknowledge that data are people and can do harm
2. Recognize that privacy is more than a binary value
3. Guard against the reidentification of your data
4. Practice ethical data sharing
5. Consider the strengths and limitations of your data; big does not
automatically mean better
6. Debate the tough, ethical choices
7. Develop a code of conduct for your organization, research community,
or industry
8. Design your data and systems for auditability
9. Engage with the broader consequences of data and analysis practices
10. Know when to break these rules
Zook M, Barocas S, boyd d, Crawford K, Keller E, et al. (2017) Ten simple rules for
responsible big data research. PLOS Computational Biology 13(3): e1005399.
https://doi.org/10.1371/journal.pcbi.1005399
13. THE RIGHT TO AN EXPLANATION
“The data subject shall have the right to obtain … the
existence of automated decision-making, including profiling
… meaningful information about the logic involved, as
well as the significance and the envisaged consequences
of such processing for the data subject.”
EU General Data Protection Chapter 3, Article 15
18. CHANGE PROPAGATES
Concept1
Concept2 Concept3
KOS
Professional
Curators
Literature
Software
Non-professional
contributors
1. dealing with changing cultural and societal
norms, specifically to address or correct bias;
2. political influence
3. new concepts and terminology arising from
discoveries or change in perspective within a
technical/scientific community
4. gardening
5. incremental contributorship
6. progressive formalization
7. software and automation
8. integration of large numbers of data sources
9. variance in algorithm training data
Data
⚐Society & Politics
(4, 5, 6)
(7, 8, 9)
(3)
(1, 2)
Lauruhn, Michael, and Paul Groth. "Sources of Change for Modern Knowledge
Organization Systems." Knowledge Organization 43, no. 8 (2016).
19. A MORE TRANSPARENT DATA SUPPLY CHAIN
Groth, Paul, "Transparency and Reliability in the Data Supply
Chain," Internet Computing, IEEE, vol.17, no.2, pp.69,71, March-
April 2013 doi: 10.1109/MIC.2013.41
20. TRANSPARENCY ACKNOWLEDGES
MESSINESS
M. C. Elish & danah boyd (2018) Situating methods in the magic of
Big Data and AI, Communication Monographs, 85:1, 57-80, DOI:
10.1080/03637751.2017.1375130