Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Big data and the question of objectivity
1. Big Data and
the Question of Objectivity
Federica Russoa & Jean-Christophe Plantinb
aUniversity of Amsterdam | @federicarusso
bLondon School of Economics and Political Science | @JCPlantin
2. Overview
The social sciences go big
Quantitative social science
A big question: what conception(s) of objectivity in big-data social science practices?
What practices?
What objectivity?
Why is this relevant?
Conceptual issues
Practical implications
2
4. Big and quantitative
The ‘first’ big data revolution in social science:
Positivism and the birth of quantitative social science
Possibility of analysing more data, using the tools of statistics
Going quantitative helped the social science reach the ‘realm of the sciences’
And yet questions related to the objectivity of the social sciences didn’t settle
4
5. Big and problematic
Current debates around big data and scholarship
Borgman (2015):
Don’t conflate “ease of acquisition [of data] for ease of analysis”
Need theoretical as well as methodological framework
A choir of old and new data-philosophers (e.g. Sellars, Floridi, Leonelli …)
Data are not given
Data, information, knowledge are not all the same
Data are relational 5
6. Lots of questions already asked
How big / fast is ‘big’?
How much theory in big automated algorithms?
What kind of reasoning? Inductive?
What implications does the ‘big’ have at social / technical / scholarly level?
...
6
7. Our investigation
The question:
What exactly do data curators want to achieve with big-data practices?
A two-step answer:
1. Analysis of big-data practices in social science
2. Problematisation of 2 aspects of big-data practices:
a. Making the data curator visible / invisible [(in)visibility]
b. Standardisation of processes for data curation [standardisation]
In a nutshell: [a-b] force us to re-think the notion of objectivity
7
10. “Taylorism” in the data archive
“We're more of an assembly line, and so it's
production type of work” Paul, Archive Manager
Employment conditions that characterize “invisible technicians” in science
(Shapin, 1989; Barley, Bechky, 1994; Star, Strauss, 1999)
–Strict division of roles
–Rhythm of work
–No skills development
–Short term employment and turn over
–Highly standardized work, routine
–Invisible contribution
11. Making data ‘pristine’
“We want [the datasets] to be right, and everything
to read properly […] Trying to get that, so that the
future users when they get [the datasets], they get
everything in a pristine manner.” Paul, Archive Manager
12. Data processing and invisible labor
• Complete invisibility outside the archive
– No critique allowed of the datasets: “Don’t get carried away”
– Contacting the PI only as last resort
– Strict formatting for standardized output
• Complete visibility inside the archive
– Making all processing techniques explicit
– Processing history file + Quality check
– Homogenization of practices
13. Interrogating ‘pristineness’
• Cleaning data twice: traces of original context + traces of
cleaning
• Reproduces erroneous conception of ‘raw data’ (Gitelman,
2013)
• Conceals contributions of data processors: protocol work
(Downey, 2014) data packaging (Leonelli, 2016)
16. The data curator must be invisible from the outside
Data users don’t know / need to know about the process
Focus on ‘end product’ (rather than process)
➔Data are objectively clean, ready to (re)use
● No (interfering) curator behind data curation
● Objectivity is a property of data, not of the process
➔An old ideal of objectivity: objects’ objectivity
◆ Kitcher: the Legend View of Neopositivism
16
17. The data curator must be visible from the inside
At any time in the data curation process, who the data curator is and what s/he does must
be visible, traceable, transparent
➔The process is objective as long as procedures are respected
The curator is present at all times
Objectivity lies in the procedure
➔A more recent idea of procedural objectivity
◆ Montuschi, Little, Cardano, … :
● the social sciences can attain objectivity;
● objectivity is in the process, not in the object of science 17
18. Procedural objectivity: pulling in opposite directions?
A good tool to have in the kit
• Liberate social sciences from inferiority
complex
• Can value role of data curators
• Helps understand where the process can
go wrong
• Increases objectivity of the ‘end product’
by self-reflectively work on process
...
A ‘procedural drift’ towards obsessive
standardisation?
• Can / should we be flexible about
procedure
• If so, do we lose or gain on objectivity?
• Is objectivity just a matter of procedure?
• What role is left to the data curator then?
...
What else does ‘obsessive procedural
objectivity’ presuppose? 18
19. Strong procedures and data pristineness
Much of [(in)visibility] and of [standardisation] rest on
the myth of raw data and of clean data
Pristineness: data are cleaned twice (original PI and of traces of cleaning)
Here we sing with the choir of data-philosophers
No, data is not raw or clean
No, you can’t just assume their cleanness abstracting from curation procedures
No, maybe they shouldn’t be cleaned up so much after all
Yes, perhaps the social science need somewhat dirty data
19
21. Social science practices go big
The social sciences grew big already since Positivism
Introduction and development of quantitative methods
Demography and sociology; understanding and acting on social phenomena
In the era of big data, they grow even bigger
More data: social media provide tons
More practices: data curation and automated data analyses
21
22. Big-data practices strive for objectivity
Two relevant aspects of these practices:
[(in)visbility] and [standardisation]
Two notions of objectivity at play:
[invisible] curators: the objects are objective
[visible] curators: the procedures are objective
[standardisation] of procedures: procedures are objectives … too objective?
22
23. Relevance of the discussion
An interesting ‘philosophy of science in practice’ question
From the practice, bottom up crucial philosophical issues
Objectivity, an evergreen of phil sci. But what new is at stake with big data?
Beyond scholarly questions
Open data and open science
Can we abstract from the alleged objectivity of these practices?
When are data objective enough to be safely re-used?
If [standardisation] doesn’t ensure it, what does?
Should we strive for that kind of objectivity? 23
Editor's Notes
Briefly on our previous collaboration
Why this one. Also a question of philosophical methodology: phil sci *in practice* >> JC provides a lot of the ‘in practice’
Contextualisation: birth of quantitative social science. Going quantitative was a way of going big. And of making soc sci a *science*
Focus on question of objectivity in big data practices in soc sci
Describe the practices
Problematise notion(s) of objectivity
Why bother
Certainly intriguing and interesting conceptual issues
But also very important practical implications
A bod claim, sure. But helpful to contextualise. Big social science data in the context of longer term development of the discipline
Look at birth of quantitative social science: why going quantitative
To be objective, a science like physics (e.g. Quetelet); reach the ‘positive’ stage (Comte)
To be able to analyse more data and compare (e.g. Quetelet, Durkheim)
To provide a solid (objective!) basis for decision (see e.g. historical reconstructions of Porter)
A long standing debate on status of soc science. Does it reach the realm of the sciences? Can they establish laws? Does the mathematization ensure reliable predictions? …
Many of these questions aren’t settled in phil soc sci debates, especially objectivity [will come back to that]
There is already a flourishing literature on big data. Scholar in science studies got there first, sure. Philosophers are on the way.
Two points that are of relevance to our discussion -- form the background / lend support to later claims about objectivity
Comment on Borgman’s point made in ‘Big data little data no data’
Mention that some of our later points are part of a large choir of data philosophers
Comment briefly on all these questions. Acknowledge slightly different interests of scholars from various fields
Spell out the question. Some elements:
Talk about ‘practices’. So important to describe them. In fact, kind of dissect the process of data curation
Emphasise: not the end product, but the process is of interest to us
In the process, we also pay attention to the curators, so the subjects
Anticipate briefly the two points that will matter for us. Caveat about terminology. As for suggestion about ‘labelling’.
How I encounter this notion of objectivity in practice -- what elements can fieldwork bring to the understanding of such concept
Major US Data archive
Was created due to rise of SS survey data – data archive in the UK
Forster archive and secondary analysis of existing datasets
Since inception, problem of unequal quality and idiosyncrasies of data
Having a group of dedicated workers to clean the data!
Particiatory ethnography of 6 month there
-look at the work processes (pipeline) to prepare this data -- with the idea that through this fieldwork we have access to values and practices of what “good data” is for reuse and data sharing -- entrance point to how the notion of objectivity exists in this social world.
-called the pipeline of datasets –
This is in a nutshell the data processing pipeline, the way the ICPSR ‘takes care’ of the data that are deposited in their archive, mostly survey data:
-each dedicated processors inside the archive, goes through all this pipeline, from deposit of the dataset by a PI (1) to the publication on the ICPSR website for dowload by future reusers (7):
-what is esp important here are the steps 3 and 5:
-repair: the data processors uses a series of software and manual labor to review the datasets, identify the missing values, wild codes, problems wth the codebook, and fix them. The ICPSE uses the comparison with « restoring an old book or a painting »
-prepare: make sure the dataset fits with the standards of the ICPSR, in terms of how the dataset is presented, and if all the documents are present (, codebook, etc), all the metadata following the ICPSR catalogue
-Factory metaphor: the pileline – a work described to me by the head of the unit at « assembly line »
QUOTE
-no control over rhythm of submission // tight schedule: they have to process it very quickly
-BA social science required, but no time to develop skills
-short term employment, high turn over, no career prospect
-highly standardized, no agency, routine, boring
-contribution is made invisible
What is the goal of this division of labor == making data “pristine”
Goal is to get data pristine --
How is this pristine-ness achieved? Through a process of invisibilisation of the data processors, that is double:
Work procedures aim to make the work of data processing completely invisible outside // completely visible inside.
-pristineness: inaccurate (the agency of processors cannot be completely erased), but also it reproduces idea of “raw data” -- that there is such a thing that data that could be made raw to be reused
-it negates the very specificity and richness of data archives, that is, to have dedicated employees that are paid to enhance the data that they receive before reuse -- protocol work, data packaging
As long as the institution commit to such conception of data, it will continue to conceal all the care provided by its processors, and will hereby keep them invisible instead of recognizing their essential contribution to the circulation of data in social science.
Preliminary impressions, to be further substantiate
Kitcher’s book: The Advancements of Science
Mention especially Montuschi:
Discussion of objectivity of:
The ‘objects’ (out there vs constructed),
The descriptions thereof (neutral vs value-laden)
The methods (quantitative / experimental vs qualitative)
Here problematise the fact that having clean data is an objective to pursue.
Instill the idea that perhaps there a degree of dirtiness that has to remain in social science. How much important is the context??
So in the end try to ask the question whether it *desirable* to have so much cleaned data?
So far, discussion at conceptual level, based on analyses of the practices