La ricerca scientifica nell'era dei Big Data - Sabina Leonelli

Sabina Leonelli
Exeter Centre for the Study of Life Sciences (Egenis)
& Department of Sociology, Philosophy and Anthropology
University of Exeter
@sabinaleonelli

 New technologies for
producing and storing lots of
data, fast, about anything and
everything
 New institutions and
communication platforms for
disseminating data
 New forms of analysis,
computing and automation
= Gateway to new social
behaviors, services, self-
understanding
= Novel status of data in research

 Potential to improve
 Pathways to and quality of discoveries: data
mining helps spot gaps and opportunities
 Collaboration across sites, disciplines and
countries
 Uptake of new technologies
 Research evaluation, debate and
transparency
 Significance of research components
beyond papers and patents
 Fight against fraud, low quality and
duplication of efforts
 Legitimacy of science, public trust and
engagement

Tracking data journeys
To understand how data move from sites of production to sites of
dissemination and interpretation/use, and with which consequences
• Focus:
1. Databases as windows on material/conceptual/institutional
labor required to make data widely accessible and useable
• labels & software to classify, model, visualize, retrieve data
• management of infrastructure and communications
2. Data re-use cases to investigate
• conditions under which data can be interpreted
• implications for discovery & what counts as good research
• role of Open Science movement in knowledge generation

 Empirical sources:
 Archives, scientific literature
 Interviews and participant observation to document research
practices and attitudes to data openness, curation, re-use
 Collaboration & direct involvement in data journeys
 Comparative analysis across research areas and countries:
• Varieties of data types, research goals, methods and instruments
• Area-specific requirements, political economy and ethos
• Regulatory frameworks, research environments & infrastructure
• Data sharing and re-use across high- and low-income countries

Interoperable
Data
Infrastructure
s
Data sources

 Health and
environmental
data
 Sensitive
biomedical data
 Cross-species
(from yeast to
human)
 Plant data across
lab and field,
North and South

1. Favoring conservatism over
innovation
2. Making bias invisible
3. Building on unreliable data
4. Prioritising commercial interests
5. Encouraging research that is
irrelevant for or damaging to society

1. Favoring conservatism over innovation
5. Encouraging research that is irrelevant for or
damaging to society

 Nested, inter-dependent databases collecting different data
types and approaches
 Not easily interoperable! Critical role of
 language used to order and retrieve data (e.g. ontologies)
 meta-data and experimental protocols (e.g. Minimal Information
and ISA-Tab standard)
 standards for what counts as reliable data and evidential
significance
 Real challenges in developing and updating database
content (formats, software, knowledge base)
 Common standards help.. when complemented by trained
human judgement on how to apply them

 No sustainability for databases and related curation
 Lack of long-term funds and willingness to invest
 Vision of ‘data curation’ as technical service rather than research
 Researchers rarely receiving expert support on data management
 In the absence of intelligent human curation, many
databases disappear or (worse!) stagnate  major
hurdles in re-using badly kept old data
 General focus on re-using instead of creating: what
does it mean for creativity and innovation?
 Esp. where objects of inquiry keep changing and evolving

1. Favoring conservatism over innovation
5. Encouraging research that is irrelevant for or
damaging to society

 Difficulties in locating error and evaluating data
provenance and quality, esp. when data travel beyond
specific communities of practice
 Re-contextualising information (meta-data) is crucial
to data interpretation, but often insufficient or badly
selected/annotated
 Data quality assessment
 varies depending on specific use (e.g. microarrays)
 often depends on access to original materials or
instruments, yet
▪ sample collections are unsystematic, underfunded, and not
interlinked (which makes samples hard to locate and relate to data)
▪ old instruments are not kept, unless for historical purposes

 Data sharing and re-use needs data curation that is
 Intelligent (Royal Society, 2012)
 Informed by familiarity with research objects and target
systems
 Trustworthy (able to weed out error and unreliable data)
 Consistent across time and space (e.g. longitudinal data
collection in oceanography, epidemiology, environmental
science)
 Databases are rarely regarded as trustworthy
beyond relatively small epistemic communities with
strong social bounds and user participation

 Big data collections tend to be extremely selective:
 Databases display the outputs of rich, English-speaking labs within visible and
popular research traditions, which deal with ‘tractable’ data formats
 Involvement of poor/unfashionable labs, developing countries & non-scientists
is low and at the ‘receiving’ end
 Increasing digital divide in research too: Inequalities of visibility, power and
location can be reinforced, rather than mitigated, by big data dissemination
 Huge disparity in data sources and types of data that can be
curated/disseminated/reused
 Inequality in representation - e.g. sampling across populations for
health research, diverse applications of data protection laws
 Sampling based on convenience and institutional/financial
factors: not a novelty per se, yet often the resulting bias remains
unaccounted for in research methods and analysis

 Triumph of commercial and opportunistic concerns
over scientific reasoning and investigative decisions
 Data choice, processing and dissemination mechanisms are
governed by non-epistemic factors
 More ‘tractable’ data in digital formats are more easily
shared, accumulated and exchanged as commodities..
 .. while complex and expensive datasets often become or
remain private
 In US, lack of appropriate regulation over data
dissemination and commercialization
 Lack of clarity over legal regimes, esp. for research
data

 BOD presented as
opportunity to shake up
the research system and
make it participatory and
socially responsive
 E.g. increased citizen
engagement in data
collection, sharing and
analysis:
 Citizen Science movement
 the ‘right to science’ in
medicine
 DIY Biology and Fab Labs

 Data linkage across data sources involves serious
risks to individuals and communities (e.g. privacy,
medical assessment, representation in government and
social services)
 Research outputs may damage society instead of
leading to human flourishing (e.g. Royal Society Data
Governance Report 2017, Luciano Floridi’s work)
 Importance of preserving human rights, not corrupting
them (e.g. ‘right to science’ movement, EffyVayena’s work)
 Data sharing and re-use needs to be ethically
sound

 AI-enabled mining of Big Data: alternative to
extensive human interventions & decision-
making currently required for data analysis
 Big Data + AI = automation of inquiry
 Promise of faster, better, more reliable, more
sustainable knowledge production

 Big data in and of themselves do not provide a
reliable evidence base for computational
analysis:
 All data are FAIR (vs: Lovell, Leonelli et al under review)
 Big Data are comprehensive, and thus
▪ Big data counters bias in data collection and interpretation
(vs: Leonelli 2018)
▪ Big data makes debate over sampling redundant, as we
have data about everything (vs: Leonelli 2014 in Big Data & Society)

 Forging tools for unregulated mass surveillance of human
behavior at individual as well as community levels
 Producing unreliable knowledge that does not help to tackle
urgent social challenges
 Expanding existing divides and silencing scientific traditions
from low-resourced environments and ‘unfashionable’ topics
 Eroding scientific expertise and centuries-old methodological
wisdom: ‘anything online goes’
 Eroding trust and credibility of science: exponential growth of
opportunities for marketing “alternative facts”

 Recognise local and situated nature of data selection, sampling
methods and data quality assessment  informing decisions
about the scope of inferential claims (what they apply to)
 Promote effective, context-specific, sustainable data curation,
and explicit criteria for data and meta-data inclusion and
formatting
 Build accountability for choice and sources of data into data
infrastructures and analytic tools
 Track data histories and journeys through meta-data
 Avoid irreversible data linkage
 Strengthen links between digital data and research materials
 Build safeguards for social/ethical concerns to improve research
methods: ethical data handling enables data linkage, supports
information security, facilitates data mining

Special thanks to the hundreds of
scientists who participated in this
research, Michel Durinx for the figures,
the Data Studies group in Egenis and
the many colleagues in philosophy,
history and social studies of science
who provided crucial and generous
support and insights.
This research was supported by funding
from:
• European Research Council (grant
335925)
• Leverhulme Trust
• Economic and Social Research
Council
• Max Planck Institute for the History
of Science
• British Academy
• Australian Research Council
• University of Ghent

1. Context-specific curation is key to data re-
use
2. Long-term maintenance is key to
trustworthiness
3. Which data and why?
4. Data and materials
5. Role of ethics, humanities and social
sciences in data management

 Data curation is essential to BOD re-use and
interpretation, yet badly underestimated and
not rewarded
 Value of long-standing research traditions and
reviewing methods
 Crucial for data infrastructures to be user-friendly and
receptive to user feedback
 Need case-by-case judgments on data quality and
fruitful modes of data sharing

 Pluralism in methods and standards contributes
robustness to data analysis, and reduces risk of losing
system-specific knowledge
 Standards and formats are key
 Yet reliance on overly rigid standards creates exclusions and
obliterates system-specific knowledge
 Data linkage methods are best when it is possible to
disaggregate
 Interoperability is preferable to integration

 Regular updates across nested infrastructures
 Business plan for long-term sustainability
 For BOD, this means:
 Clear relation between international field-specific
databases, international clouds, national clouds,
institutional repositories
 Make sure each node is resilient and system is not
crippled by individual node failure (now all
independently funded, typically in the short term)

 Particularly important since hard to guarantee
data quality
 Re-use often linked to participation in developing
data infrastructures
 rarely the case for busy practitioners, considering also
gap in skills
 Role of confidence assessments on data quality
and reliability (again: expert curation is key)

 Indiscriminate calls for open data can lead to
serendipity in what data are circulated and when
 Need explicit rationale around priority given to specific data
types and sources (e.g. ‘omics’ in biology)
 Substantive disagreements over data management:
 Methods, terminologies, standards involved in data
production and interpretation
 To be useable, data are handled by several individuals with
diverse expertise: the evidential value and representational
power attributed to a given dataset can vary

 Criteria for what counts as good data – even as data
altogether – vary dramatically even within the same
field
 Data producers, curators, users make choices about
what constitutes data at each stage of their journeys
 Data as relational: what counts as data varies in
relation to research situations [Leonelli 2015, 2016, 2019]
 Any object can be considered as a datum as long as (1) it is
treated as potential evidence for one or more claims about
phenomena, and (2) it is possible to circulate it among
individuals/groups

Inference as a process of situating data in
relation to elements of relevance to interpretation
(materials, instruments, interests, norms)
 developing a context for inquiry that aligns
one’s purpose with existing theoretical
commitments and selected properties of data and
target system
 E.g. Databases as enabling comparison among ways
to organise data
 “All inductive inference is local” (Norton 2003)

 What makes for ‘good’ inference?Triangulation
of different data sources often cited by BOD
advocates
 My view: triangulation is necessary, but not
sufficient
 Difficulties in accounting for partiality in data sources
(Leonelli 2014, 2016, 2018)
 Efforts to maintain continuity and commensurability
when re-assessing same dataset with different
methods across time (Wylie 2017, Leonelli 2018)

 Building on Alison Wylie’s conditions for robust evidential
reasoning..
(1) security, (2) causal anchoring and causal independence, (3)
conceptual independence, (4) grounds for calibration and (5)
addressing divergence [Chapman andWylie 2016]
 .. And adding:
(6) diversity of sources and subsequent handling methods (make
data journeys trackable)
(7) explicit valuing criteria for data production, dissemination and
reuse (debate which data get to travel; ethics and social concerns
are key to data re-usability and security)
(8) material anchoring where possible (link digital databases and
sample collections)
(9) critical use of standards (balance standardization with local
solutions to preserve system-specific methods)

 Choice of meta-data and link to research materials
 Reliable stock centers and collections: rarely available &
coordinated with databases
 E.g. model organism stock centres, biobanks

 Ethical, social and security concerns increase
quality and re-usability of data/infrastructures
 Data re-use requires well-informed, sustainable,
inclusive, participative development and use of data
infrastructures
 Related skills are as central to big data use as
computational skills
 Data management training requires input from all
fields, esp. social science and humanities

Interactions
with the world
Knowledge
Models
representing
the world
DataObjects

La ricerca scientifica nell'era dei Big Data - Sabina Leonelli

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to La ricerca scientifica nell'era dei Big Data - Sabina Leonelli

Similar to La ricerca scientifica nell'era dei Big Data - Sabina Leonelli (20)

More from Ismel - Istituto per la Memoria e la Cultura del Lavoro, dell'Impresa e dei Diritti Sociali

More from Ismel - Istituto per la Memoria e la Cultura del Lavoro, dell'Impresa e dei Diritti Sociali (20)

Recently uploaded

Recently uploaded (20)

La ricerca scientifica nell'era dei Big Data - Sabina Leonelli

Editor's Notes