Sabina Leonelli
Exeter Centre for the Study of Life Sciences (Egenis)
& Department of Sociology, Philosophy and Anthropology
University of Exeter
@sabinaleonelli
 New technologies for
producing and storing lots of
data, fast, about anything and
everything
 New institutions and
communication platforms for
disseminating data
 New forms of analysis,
computing and automation
= Gateway to new social
behaviors, services, self-
understanding
= Novel status of data in research
 Potential to improve
 Pathways to and quality of discoveries: data
mining helps spot gaps and opportunities
 Collaboration across sites, disciplines and
countries
 Uptake of new technologies
 Research evaluation, debate and
transparency
 Significance of research components
beyond papers and patents
 Fight against fraud, low quality and
duplication of efforts
 Legitimacy of science, public trust and
engagement
 Potential to improve
 Pathways to and quality of discoveries: data
mining helps spot gaps and opportunities
 Collaboration across sites, disciplines and
countries
 Uptake of new technologies
 Research evaluation, debate and
transparency
 Significance of research components
beyond papers and patents
 Fight against fraud, low quality and
duplication of efforts
 Legitimacy of science, public trust and
engagement
Tracking data journeys
To understand how data move from sites of production to sites of
dissemination and interpretation/use, and with which consequences
• Focus:
1. Databases as windows on material/conceptual/institutional
labor required to make data widely accessible and useable
• labels & software to classify, model, visualize, retrieve data
• management of infrastructure and communications
2. Data re-use cases to investigate
• conditions under which data can be interpreted
• implications for discovery & what counts as good research
• role of Open Science movement in knowledge generation
 Empirical sources:
 Archives, scientific literature
 Interviews and participant observation to document research
practices and attitudes to data openness, curation, re-use
 Collaboration & direct involvement in data journeys
 Comparative analysis across research areas and countries:
• Varieties of data types, research goals, methods and instruments
• Area-specific requirements, political economy and ethos
• Regulatory frameworks, research environments & infrastructure
• Data sharing and re-use across high- and low-income countries
Interoperable
Data
Infrastructure
s
Data sources
Many
Other DBs
 Health and
environmental
data
 Sensitive
biomedical data
 Cross-species
(from yeast to
human)
 Plant data across
lab and field,
North and South
1. Favoring conservatism over
innovation
2. Making bias invisible
3. Building on unreliable data
4. Prioritising commercial interests
5. Encouraging research that is
irrelevant for or damaging to society
1. Favoring conservatism over innovation
2. Making bias invisible
3. Building on unreliable data
4. Prioritising commercial interests
5. Encouraging research that is irrelevant for or
damaging to society
 Nested, inter-dependent databases collecting different data
types and approaches
 Not easily interoperable! Critical role of
 language used to order and retrieve data (e.g. ontologies)
 meta-data and experimental protocols (e.g. Minimal Information
and ISA-Tab standard)
 standards for what counts as reliable data and evidential
significance
 Real challenges in developing and updating database
content (formats, software, knowledge base)
 Common standards help.. when complemented by trained
human judgement on how to apply them
 No sustainability for databases and related curation
 Lack of long-term funds and willingness to invest
 Vision of ‘data curation’ as technical service rather than research
 Researchers rarely receiving expert support on data management
 In the absence of intelligent human curation, many
databases disappear or (worse!) stagnate  major
hurdles in re-using badly kept old data
 General focus on re-using instead of creating: what
does it mean for creativity and innovation?
 Esp. where objects of inquiry keep changing and evolving
1. Favoring conservatism over innovation
2. Building on unreliable data
3. Making bias invisible
4. Prioritising commercial interests
5. Encouraging research that is irrelevant for or
damaging to society
 Difficulties in locating error and evaluating data
provenance and quality, esp. when data travel beyond
specific communities of practice
 Re-contextualising information (meta-data) is crucial
to data interpretation, but often insufficient or badly
selected/annotated
 Data quality assessment
 varies depending on specific use (e.g. microarrays)
 often depends on access to original materials or
instruments, yet
▪ sample collections are unsystematic, underfunded, and not
interlinked (which makes samples hard to locate and relate to data)
▪ old instruments are not kept, unless for historical purposes
 Data sharing and re-use needs data curation that is
 Intelligent (Royal Society, 2012)
 Informed by familiarity with research objects and target
systems
 Trustworthy (able to weed out error and unreliable data)
 Consistent across time and space (e.g. longitudinal data
collection in oceanography, epidemiology, environmental
science)
 Databases are rarely regarded as trustworthy
beyond relatively small epistemic communities with
strong social bounds and user participation
1. Favoring conservatism over innovation
2. Building on unreliable data
3. Making bias invisible
4. Prioritising commercial interests
5. Encouraging research that is irrelevant for or
damaging to society
 Big data collections tend to be extremely selective:
 Databases display the outputs of rich, English-speaking labs within visible and
popular research traditions, which deal with ‘tractable’ data formats
 Involvement of poor/unfashionable labs, developing countries & non-scientists
is low and at the ‘receiving’ end
 Increasing digital divide in research too: Inequalities of visibility, power and
location can be reinforced, rather than mitigated, by big data dissemination
 Huge disparity in data sources and types of data that can be
curated/disseminated/reused
 Inequality in representation - e.g. sampling across populations for
health research, diverse applications of data protection laws
 Sampling based on convenience and institutional/financial
factors: not a novelty per se, yet often the resulting bias remains
unaccounted for in research methods and analysis
1. Favoring conservatism over innovation
2. Building on unreliable data
3. Making bias invisible
4. Prioritising commercial interests
5. Encouraging research that is irrelevant for or
damaging to society
 Triumph of commercial and opportunistic concerns
over scientific reasoning and investigative decisions
 Data choice, processing and dissemination mechanisms are
governed by non-epistemic factors
 More ‘tractable’ data in digital formats are more easily
shared, accumulated and exchanged as commodities..
 .. while complex and expensive datasets often become or
remain private
 In US, lack of appropriate regulation over data
dissemination and commercialization
 Lack of clarity over legal regimes, esp. for research
data
1. Favoring conservatism over innovation
2. Building on unreliable data
3. Making bias invisible
4. Prioritising commercial interests
5. Encouraging research that is irrelevant for or
damaging to society
 BOD presented as
opportunity to shake up
the research system and
make it participatory and
socially responsive
 E.g. increased citizen
engagement in data
collection, sharing and
analysis:
 Citizen Science movement
 the ‘right to science’ in
medicine
 DIY Biology and Fab Labs
 Data linkage across data sources involves serious
risks to individuals and communities (e.g. privacy,
medical assessment, representation in government and
social services)
 Research outputs may damage society instead of
leading to human flourishing (e.g. Royal Society Data
Governance Report 2017, Luciano Floridi’s work)
 Importance of preserving human rights, not corrupting
them (e.g. ‘right to science’ movement, EffyVayena’s work)
 Data sharing and re-use needs to be ethically
sound
 AI-enabled mining of Big Data: alternative to
extensive human interventions & decision-
making currently required for data analysis
 Big Data + AI = automation of inquiry
 Promise of faster, better, more reliable, more
sustainable knowledge production
 AI-enabled mining of Big Data: alternative to
extensive human interventions & decision-
making currently required for data analysis
 Big Data + AI = automation of inquiry
 Promise of faster, better, more reliable, more
sustainable knowledge production
 Big data in and of themselves do not provide a
reliable evidence base for computational
analysis:
 All data are FAIR (vs: Lovell, Leonelli et al under review)
 Big Data are comprehensive, and thus
▪ Big data counters bias in data collection and interpretation
(vs: Leonelli 2018)
▪ Big data makes debate over sampling redundant, as we
have data about everything (vs: Leonelli 2014 in Big Data & Society)
 Big data in and of themselves do not provide a
reliable evidence base for computational
analysis:
 All data are FAIR (vs: Lovell, Leonelli et al under review)
 Big Data are comprehensive, and thus
▪ Big data counters bias in data collection and interpretation
(vs: Leonelli 2018)
▪ Big data makes debate over sampling redundant, as we
have data about everything (vs: Leonelli 2014 in Big Data & Society)
 Big data in and of themselves do not provide a
reliable evidence base for computational
analysis:
 All data are FAIR (vs: Lovell, Leonelli et al under review)
 Big Data are comprehensive, and thus
▪ Big data counters bias in data collection and interpretation
(vs: Leonelli 2018)
▪ Big data makes debate over sampling redundant, as we
have data about everything (vs: Leonelli 2014 in Big Data & Society)
 Big data in and of themselves do not provide a
reliable evidence base for computational
analysis:
 All data are FAIR (vs: Lovell, Leonelli et al under review)
 Big Data are comprehensive, and thus
▪ Big data counters bias in data collection and interpretation
(vs: Leonelli 2018)
▪ Big data makes debate over sampling redundant, as we
have data about everything (vs: Leonelli 2014 in Big Data & Society)
 Forging tools for unregulated mass surveillance of human
behavior at individual as well as community levels
 Producing unreliable knowledge that does not help to tackle
urgent social challenges
 Expanding existing divides and silencing scientific traditions
from low-resourced environments and ‘unfashionable’ topics
 Eroding scientific expertise and centuries-old methodological
wisdom: ‘anything online goes’
 Eroding trust and credibility of science: exponential growth of
opportunities for marketing “alternative facts”
 Recognise local and situated nature of data selection, sampling
methods and data quality assessment  informing decisions
about the scope of inferential claims (what they apply to)
 Promote effective, context-specific, sustainable data curation,
and explicit criteria for data and meta-data inclusion and
formatting
 Build accountability for choice and sources of data into data
infrastructures and analytic tools
 Track data histories and journeys through meta-data
 Avoid irreversible data linkage
 Strengthen links between digital data and research materials
 Build safeguards for social/ethical concerns to improve research
methods: ethical data handling enables data linkage, supports
information security, facilitates data mining
Special thanks to the hundreds of
scientists who participated in this
research, Michel Durinx for the figures,
the Data Studies group in Egenis and
the many colleagues in philosophy,
history and social studies of science
who provided crucial and generous
support and insights.
This research was supported by funding
from:
• European Research Council (grant
335925)
• Leverhulme Trust
• Economic and Social Research
Council
• Max Planck Institute for the History
of Science
• British Academy
• Australian Research Council
• University of Ghent
1. Context-specific curation is key to data re-
use
2. Long-term maintenance is key to
trustworthiness
3. Which data and why?
4. Data and materials
5. Role of ethics, humanities and social
sciences in data management
 Data curation is essential to BOD re-use and
interpretation, yet badly underestimated and
not rewarded
 Value of long-standing research traditions and
reviewing methods
 Crucial for data infrastructures to be user-friendly and
receptive to user feedback
 Need case-by-case judgments on data quality and
fruitful modes of data sharing
 Pluralism in methods and standards contributes
robustness to data analysis, and reduces risk of losing
system-specific knowledge
 Standards and formats are key
 Yet reliance on overly rigid standards creates exclusions and
obliterates system-specific knowledge
 Data linkage methods are best when it is possible to
disaggregate
 Interoperability is preferable to integration
 Regular updates across nested infrastructures
 Business plan for long-term sustainability
 For BOD, this means:
 Clear relation between international field-specific
databases, international clouds, national clouds,
institutional repositories
 Make sure each node is resilient and system is not
crippled by individual node failure (now all
independently funded, typically in the short term)
 Particularly important since hard to guarantee
data quality
 Re-use often linked to participation in developing
data infrastructures
 rarely the case for busy practitioners, considering also
gap in skills
 Role of confidence assessments on data quality
and reliability (again: expert curation is key)
 Indiscriminate calls for open data can lead to
serendipity in what data are circulated and when
 Need explicit rationale around priority given to specific data
types and sources (e.g. ‘omics’ in biology)
 Substantive disagreements over data management:
 Methods, terminologies, standards involved in data
production and interpretation
 To be useable, data are handled by several individuals with
diverse expertise: the evidential value and representational
power attributed to a given dataset can vary
 Criteria for what counts as good data – even as data
altogether – vary dramatically even within the same
field
 Data producers, curators, users make choices about
what constitutes data at each stage of their journeys
 Data as relational: what counts as data varies in
relation to research situations [Leonelli 2015, 2016, 2019]
 Any object can be considered as a datum as long as (1) it is
treated as potential evidence for one or more claims about
phenomena, and (2) it is possible to circulate it among
individuals/groups
Inference as a process of situating data in
relation to elements of relevance to interpretation
(materials, instruments, interests, norms)
 developing a context for inquiry that aligns
one’s purpose with existing theoretical
commitments and selected properties of data and
target system
 E.g. Databases as enabling comparison among ways
to organise data
 “All inductive inference is local” (Norton 2003)
 What makes for ‘good’ inference?Triangulation
of different data sources often cited by BOD
advocates
 My view: triangulation is necessary, but not
sufficient
 Difficulties in accounting for partiality in data sources
(Leonelli 2014, 2016, 2018)
 Efforts to maintain continuity and commensurability
when re-assessing same dataset with different
methods across time (Wylie 2017, Leonelli 2018)
 Building on Alison Wylie’s conditions for robust evidential
reasoning..
(1) security, (2) causal anchoring and causal independence, (3)
conceptual independence, (4) grounds for calibration and (5)
addressing divergence [Chapman andWylie 2016]
 .. And adding:
(6) diversity of sources and subsequent handling methods (make
data journeys trackable)
(7) explicit valuing criteria for data production, dissemination and
reuse (debate which data get to travel; ethics and social concerns
are key to data re-usability and security)
(8) material anchoring where possible (link digital databases and
sample collections)
(9) critical use of standards (balance standardization with local
solutions to preserve system-specific methods)
 Choice of meta-data and link to research materials
 Reliable stock centers and collections: rarely available &
coordinated with databases
 E.g. model organism stock centres, biobanks
 Ethical, social and security concerns increase
quality and re-usability of data/infrastructures
 Data re-use requires well-informed, sustainable,
inclusive, participative development and use of data
infrastructures
 Related skills are as central to big data use as
computational skills
 Data management training requires input from all
fields, esp. social science and humanities
 Recognise local and situated nature of data selection, sampling
methods and data quality assessment  informing decisions
about the scope of inferential claims (what they apply to)
 Promote effective, context-specific, sustainable data curation,
and explicit criteria for data and meta-data inclusion and
formatting
 Build accountability for choice and sources of data into data
infrastructures and analytic tools
 Track data histories and journeys through meta-data
 Avoid irreversible data linkage
 Strengthen links between digital data and research materials
 Build safeguards for social/ethical concerns to improve research
methods: ethical data handling enables data linkage, supports
information security, facilitates data mining
Special thanks to the hundreds of
scientists who participated in this
research, Michel Durinx for the figures,
the Data Studies group in Egenis and
the many colleagues in philosophy,
history and social studies of science
who provided crucial and generous
support and insights.
This research was supported by funding
from:
• European Research Council (grant
335925)
• Leverhulme Trust
• Economic and Social Research
Council
• Max Planck Institute for the History
of Science
• British Academy
• Australian Research Council
• University of Ghent
Interactions
with the world
Knowledge
Models
representing
the world
DataObjects

La ricerca scientifica nell'era dei Big Data - Sabina Leonelli

  • 1.
    Sabina Leonelli Exeter Centrefor the Study of Life Sciences (Egenis) & Department of Sociology, Philosophy and Anthropology University of Exeter @sabinaleonelli
  • 2.
     New technologiesfor producing and storing lots of data, fast, about anything and everything  New institutions and communication platforms for disseminating data  New forms of analysis, computing and automation = Gateway to new social behaviors, services, self- understanding = Novel status of data in research
  • 3.
     Potential toimprove  Pathways to and quality of discoveries: data mining helps spot gaps and opportunities  Collaboration across sites, disciplines and countries  Uptake of new technologies  Research evaluation, debate and transparency  Significance of research components beyond papers and patents  Fight against fraud, low quality and duplication of efforts  Legitimacy of science, public trust and engagement
  • 4.
     Potential toimprove  Pathways to and quality of discoveries: data mining helps spot gaps and opportunities  Collaboration across sites, disciplines and countries  Uptake of new technologies  Research evaluation, debate and transparency  Significance of research components beyond papers and patents  Fight against fraud, low quality and duplication of efforts  Legitimacy of science, public trust and engagement
  • 5.
    Tracking data journeys Tounderstand how data move from sites of production to sites of dissemination and interpretation/use, and with which consequences • Focus: 1. Databases as windows on material/conceptual/institutional labor required to make data widely accessible and useable • labels & software to classify, model, visualize, retrieve data • management of infrastructure and communications 2. Data re-use cases to investigate • conditions under which data can be interpreted • implications for discovery & what counts as good research • role of Open Science movement in knowledge generation
  • 6.
     Empirical sources: Archives, scientific literature  Interviews and participant observation to document research practices and attitudes to data openness, curation, re-use  Collaboration & direct involvement in data journeys  Comparative analysis across research areas and countries: • Varieties of data types, research goals, methods and instruments • Area-specific requirements, political economy and ethos • Regulatory frameworks, research environments & infrastructure • Data sharing and re-use across high- and low-income countries
  • 7.
  • 8.
  • 9.
     Health and environmental data Sensitive biomedical data  Cross-species (from yeast to human)  Plant data across lab and field, North and South
  • 10.
    1. Favoring conservatismover innovation 2. Making bias invisible 3. Building on unreliable data 4. Prioritising commercial interests 5. Encouraging research that is irrelevant for or damaging to society
  • 11.
    1. Favoring conservatismover innovation 2. Making bias invisible 3. Building on unreliable data 4. Prioritising commercial interests 5. Encouraging research that is irrelevant for or damaging to society
  • 12.
     Nested, inter-dependentdatabases collecting different data types and approaches  Not easily interoperable! Critical role of  language used to order and retrieve data (e.g. ontologies)  meta-data and experimental protocols (e.g. Minimal Information and ISA-Tab standard)  standards for what counts as reliable data and evidential significance  Real challenges in developing and updating database content (formats, software, knowledge base)  Common standards help.. when complemented by trained human judgement on how to apply them
  • 14.
     No sustainabilityfor databases and related curation  Lack of long-term funds and willingness to invest  Vision of ‘data curation’ as technical service rather than research  Researchers rarely receiving expert support on data management  In the absence of intelligent human curation, many databases disappear or (worse!) stagnate  major hurdles in re-using badly kept old data  General focus on re-using instead of creating: what does it mean for creativity and innovation?  Esp. where objects of inquiry keep changing and evolving
  • 15.
    1. Favoring conservatismover innovation 2. Building on unreliable data 3. Making bias invisible 4. Prioritising commercial interests 5. Encouraging research that is irrelevant for or damaging to society
  • 16.
     Difficulties inlocating error and evaluating data provenance and quality, esp. when data travel beyond specific communities of practice  Re-contextualising information (meta-data) is crucial to data interpretation, but often insufficient or badly selected/annotated  Data quality assessment  varies depending on specific use (e.g. microarrays)  often depends on access to original materials or instruments, yet ▪ sample collections are unsystematic, underfunded, and not interlinked (which makes samples hard to locate and relate to data) ▪ old instruments are not kept, unless for historical purposes
  • 17.
     Data sharingand re-use needs data curation that is  Intelligent (Royal Society, 2012)  Informed by familiarity with research objects and target systems  Trustworthy (able to weed out error and unreliable data)  Consistent across time and space (e.g. longitudinal data collection in oceanography, epidemiology, environmental science)  Databases are rarely regarded as trustworthy beyond relatively small epistemic communities with strong social bounds and user participation
  • 18.
    1. Favoring conservatismover innovation 2. Building on unreliable data 3. Making bias invisible 4. Prioritising commercial interests 5. Encouraging research that is irrelevant for or damaging to society
  • 19.
     Big datacollections tend to be extremely selective:  Databases display the outputs of rich, English-speaking labs within visible and popular research traditions, which deal with ‘tractable’ data formats  Involvement of poor/unfashionable labs, developing countries & non-scientists is low and at the ‘receiving’ end  Increasing digital divide in research too: Inequalities of visibility, power and location can be reinforced, rather than mitigated, by big data dissemination  Huge disparity in data sources and types of data that can be curated/disseminated/reused  Inequality in representation - e.g. sampling across populations for health research, diverse applications of data protection laws  Sampling based on convenience and institutional/financial factors: not a novelty per se, yet often the resulting bias remains unaccounted for in research methods and analysis
  • 20.
    1. Favoring conservatismover innovation 2. Building on unreliable data 3. Making bias invisible 4. Prioritising commercial interests 5. Encouraging research that is irrelevant for or damaging to society
  • 21.
     Triumph ofcommercial and opportunistic concerns over scientific reasoning and investigative decisions  Data choice, processing and dissemination mechanisms are governed by non-epistemic factors  More ‘tractable’ data in digital formats are more easily shared, accumulated and exchanged as commodities..  .. while complex and expensive datasets often become or remain private  In US, lack of appropriate regulation over data dissemination and commercialization  Lack of clarity over legal regimes, esp. for research data
  • 22.
    1. Favoring conservatismover innovation 2. Building on unreliable data 3. Making bias invisible 4. Prioritising commercial interests 5. Encouraging research that is irrelevant for or damaging to society
  • 23.
     BOD presentedas opportunity to shake up the research system and make it participatory and socially responsive  E.g. increased citizen engagement in data collection, sharing and analysis:  Citizen Science movement  the ‘right to science’ in medicine  DIY Biology and Fab Labs
  • 25.
     Data linkageacross data sources involves serious risks to individuals and communities (e.g. privacy, medical assessment, representation in government and social services)  Research outputs may damage society instead of leading to human flourishing (e.g. Royal Society Data Governance Report 2017, Luciano Floridi’s work)  Importance of preserving human rights, not corrupting them (e.g. ‘right to science’ movement, EffyVayena’s work)  Data sharing and re-use needs to be ethically sound
  • 26.
     AI-enabled miningof Big Data: alternative to extensive human interventions & decision- making currently required for data analysis  Big Data + AI = automation of inquiry  Promise of faster, better, more reliable, more sustainable knowledge production
  • 27.
     AI-enabled miningof Big Data: alternative to extensive human interventions & decision- making currently required for data analysis  Big Data + AI = automation of inquiry  Promise of faster, better, more reliable, more sustainable knowledge production
  • 28.
     Big datain and of themselves do not provide a reliable evidence base for computational analysis:  All data are FAIR (vs: Lovell, Leonelli et al under review)  Big Data are comprehensive, and thus ▪ Big data counters bias in data collection and interpretation (vs: Leonelli 2018) ▪ Big data makes debate over sampling redundant, as we have data about everything (vs: Leonelli 2014 in Big Data & Society)
  • 29.
     Big datain and of themselves do not provide a reliable evidence base for computational analysis:  All data are FAIR (vs: Lovell, Leonelli et al under review)  Big Data are comprehensive, and thus ▪ Big data counters bias in data collection and interpretation (vs: Leonelli 2018) ▪ Big data makes debate over sampling redundant, as we have data about everything (vs: Leonelli 2014 in Big Data & Society)
  • 30.
     Big datain and of themselves do not provide a reliable evidence base for computational analysis:  All data are FAIR (vs: Lovell, Leonelli et al under review)  Big Data are comprehensive, and thus ▪ Big data counters bias in data collection and interpretation (vs: Leonelli 2018) ▪ Big data makes debate over sampling redundant, as we have data about everything (vs: Leonelli 2014 in Big Data & Society)
  • 31.
     Big datain and of themselves do not provide a reliable evidence base for computational analysis:  All data are FAIR (vs: Lovell, Leonelli et al under review)  Big Data are comprehensive, and thus ▪ Big data counters bias in data collection and interpretation (vs: Leonelli 2018) ▪ Big data makes debate over sampling redundant, as we have data about everything (vs: Leonelli 2014 in Big Data & Society)
  • 32.
     Forging toolsfor unregulated mass surveillance of human behavior at individual as well as community levels  Producing unreliable knowledge that does not help to tackle urgent social challenges  Expanding existing divides and silencing scientific traditions from low-resourced environments and ‘unfashionable’ topics  Eroding scientific expertise and centuries-old methodological wisdom: ‘anything online goes’  Eroding trust and credibility of science: exponential growth of opportunities for marketing “alternative facts”
  • 33.
     Recognise localand situated nature of data selection, sampling methods and data quality assessment  informing decisions about the scope of inferential claims (what they apply to)  Promote effective, context-specific, sustainable data curation, and explicit criteria for data and meta-data inclusion and formatting  Build accountability for choice and sources of data into data infrastructures and analytic tools  Track data histories and journeys through meta-data  Avoid irreversible data linkage  Strengthen links between digital data and research materials  Build safeguards for social/ethical concerns to improve research methods: ethical data handling enables data linkage, supports information security, facilitates data mining
  • 34.
    Special thanks tothe hundreds of scientists who participated in this research, Michel Durinx for the figures, the Data Studies group in Egenis and the many colleagues in philosophy, history and social studies of science who provided crucial and generous support and insights. This research was supported by funding from: • European Research Council (grant 335925) • Leverhulme Trust • Economic and Social Research Council • Max Planck Institute for the History of Science • British Academy • Australian Research Council • University of Ghent
  • 35.
    1. Context-specific curationis key to data re- use 2. Long-term maintenance is key to trustworthiness 3. Which data and why? 4. Data and materials 5. Role of ethics, humanities and social sciences in data management
  • 36.
     Data curationis essential to BOD re-use and interpretation, yet badly underestimated and not rewarded  Value of long-standing research traditions and reviewing methods  Crucial for data infrastructures to be user-friendly and receptive to user feedback  Need case-by-case judgments on data quality and fruitful modes of data sharing
  • 37.
     Pluralism inmethods and standards contributes robustness to data analysis, and reduces risk of losing system-specific knowledge  Standards and formats are key  Yet reliance on overly rigid standards creates exclusions and obliterates system-specific knowledge  Data linkage methods are best when it is possible to disaggregate  Interoperability is preferable to integration
  • 38.
     Regular updatesacross nested infrastructures  Business plan for long-term sustainability  For BOD, this means:  Clear relation between international field-specific databases, international clouds, national clouds, institutional repositories  Make sure each node is resilient and system is not crippled by individual node failure (now all independently funded, typically in the short term)
  • 39.
     Particularly importantsince hard to guarantee data quality  Re-use often linked to participation in developing data infrastructures  rarely the case for busy practitioners, considering also gap in skills  Role of confidence assessments on data quality and reliability (again: expert curation is key)
  • 41.
     Indiscriminate callsfor open data can lead to serendipity in what data are circulated and when  Need explicit rationale around priority given to specific data types and sources (e.g. ‘omics’ in biology)  Substantive disagreements over data management:  Methods, terminologies, standards involved in data production and interpretation  To be useable, data are handled by several individuals with diverse expertise: the evidential value and representational power attributed to a given dataset can vary
  • 42.
     Criteria forwhat counts as good data – even as data altogether – vary dramatically even within the same field  Data producers, curators, users make choices about what constitutes data at each stage of their journeys  Data as relational: what counts as data varies in relation to research situations [Leonelli 2015, 2016, 2019]  Any object can be considered as a datum as long as (1) it is treated as potential evidence for one or more claims about phenomena, and (2) it is possible to circulate it among individuals/groups
  • 43.
    Inference as aprocess of situating data in relation to elements of relevance to interpretation (materials, instruments, interests, norms)  developing a context for inquiry that aligns one’s purpose with existing theoretical commitments and selected properties of data and target system  E.g. Databases as enabling comparison among ways to organise data  “All inductive inference is local” (Norton 2003)
  • 44.
     What makesfor ‘good’ inference?Triangulation of different data sources often cited by BOD advocates  My view: triangulation is necessary, but not sufficient  Difficulties in accounting for partiality in data sources (Leonelli 2014, 2016, 2018)  Efforts to maintain continuity and commensurability when re-assessing same dataset with different methods across time (Wylie 2017, Leonelli 2018)
  • 45.
     Building onAlison Wylie’s conditions for robust evidential reasoning.. (1) security, (2) causal anchoring and causal independence, (3) conceptual independence, (4) grounds for calibration and (5) addressing divergence [Chapman andWylie 2016]  .. And adding: (6) diversity of sources and subsequent handling methods (make data journeys trackable) (7) explicit valuing criteria for data production, dissemination and reuse (debate which data get to travel; ethics and social concerns are key to data re-usability and security) (8) material anchoring where possible (link digital databases and sample collections) (9) critical use of standards (balance standardization with local solutions to preserve system-specific methods)
  • 46.
     Choice ofmeta-data and link to research materials  Reliable stock centers and collections: rarely available & coordinated with databases  E.g. model organism stock centres, biobanks
  • 48.
     Ethical, socialand security concerns increase quality and re-usability of data/infrastructures  Data re-use requires well-informed, sustainable, inclusive, participative development and use of data infrastructures  Related skills are as central to big data use as computational skills  Data management training requires input from all fields, esp. social science and humanities
  • 49.
     Recognise localand situated nature of data selection, sampling methods and data quality assessment  informing decisions about the scope of inferential claims (what they apply to)  Promote effective, context-specific, sustainable data curation, and explicit criteria for data and meta-data inclusion and formatting  Build accountability for choice and sources of data into data infrastructures and analytic tools  Track data histories and journeys through meta-data  Avoid irreversible data linkage  Strengthen links between digital data and research materials  Build safeguards for social/ethical concerns to improve research methods: ethical data handling enables data linkage, supports information security, facilitates data mining
  • 50.
    Special thanks tothe hundreds of scientists who participated in this research, Michel Durinx for the figures, the Data Studies group in Egenis and the many colleagues in philosophy, history and social studies of science who provided crucial and generous support and insights. This research was supported by funding from: • European Research Council (grant 335925) • Leverhulme Trust • Economic and Social Research Council • Max Planck Institute for the History of Science • British Academy • Australian Research Council • University of Ghent
  • 51.

Editor's Notes

  • #10 I got worried.
  • #11 Responses to Mael: Differences between discipines (e.g. physics where there is a well-established theoretical framework vs biology which is epistemologically vulnerable..?) BUT data cleaning… point to MIT volume Big data techniques do not sidestep contextuality of data Bernard Stigler: techniques are pharmakon (remedy) ..
  • #44 move away from universal and a priori models of inductive inference