1) The document discusses the challenges and opportunities of analyzing large datasets known as "Big Data" from a social science perspective.
2) It defines Big Data and explores how the approach could undermine traditional research methods but also presents new opportunities.
3) The key to effectively studying Big Data is developing a strong understanding of the data, collaborating across disciplines, and using mixed quantitative and qualitative methods to provide context and identify meaningful relationships for further study.
1. Drinking from the fire hose?
The pitfalls & potential of Big Data
Josh Cowls, Oxford Internet Institute
with contributions from Eric Meyer, Ralph Schroeder and
Linnet Taylor
t2i Lab, Chalmers, 27th March 2014
3. The Oxford Internet Institute
• Department of University of Oxford
• MO: ‘Understanding life online’
• Multi-disciplinary mix (social sciences plus physical and medical sciences,
and humanities)
• 45 researchers (and growing)
• 50 students (MSc Social Science of Internet; PhD programme)
• Generating big data on social, political and economic behaviour from
social media
www.oii.ox.ac.uk
4. • Funded by the Alfred P. Sloan Foundation
• 2012 – 2014
• Data sources:
• 120 interviews, mainly with social scientists but some
interviewees from business, government
• Reports, workshops, publications
• No representative sample, but some patterns of
disciplinary and skills background and career trajectory
NB where unattributed, quotes used in this presentation are excerpted from
interviews conducted as part of this project.
Accessing and Using Big Data to Advance Social
Science Knowledge
5. Big Data: our definition
Big data are data that are
unprecedented in scale and scope in
relation to a given phenomenon.
They are often streams of data (rather than fixed
datasets), accumulating large volumes, often at high
velocity.
6. Big Data: other definitions
• ‘Transactional’ (Margetts et al)
• ‘Things that one can do at a large scale that
cannot be done at a smaller one’ (Mayer-
Shonberger and Cukier)
• The ‘3 Vs’: volume, velocity, variety – but also
veracity, visualisability, viscosity? (Gartner)
7. ... what Big Data isn’t
• A generalisable, quantifiable ‘amount’ of data
• A race to the top (Mutually Assured Distraction)
• The same for every discipline, field or sector
8. A ‘working’ definition
• The Big Data phenomenon might be less about
what the dataset is and more about how we
work with it
• (Even if this is indistinguishable in practice)
9. Shifts in mindset
From Mayer-Shonberger and Cukier:
• “The ability to analyse vast amounts of data
about a topic rather than be forced to settle for
smaller sets”
• “A willingness to embrace data’s real-world
messiness rather than privilege exactitude”
• “A growing respect for correlations rather than a
continuing quest for elusive causality”
11. Implications for research
Whither the sample?
“sampling is like an analog photographic print. It looks good
from a distance, but as you stare closer, zooming in on a
particular detail, it gets blurry ... Often, the really interesting
things in life are found in places that samples fail to fully
catch”
Mayer-Shonberger and Cukier 2012
12. Implications for research
More or mess?
“social media is really, really fascinating, and the reason is
because it ... falls into this category of there’s something
there but we don’t know what it is. So you can measure
public opinion on Twitter and clearly that’s indicative of
something, but we don’t quite know what it’s representative
of”
Brandon Stewart, Harvard University Department of
Government
13. Implications for research
More or mess?
“the problem with the hashtag stuff [is that] we have
wonderful case studies but we don’t know what they sit in
essentially, what the framework is, if that’s 1% or 10% or
100% of the current conversation in Australia or whatever”
Axel Bruns, Queensland University of Technology
14. Implications for research
More or mess?
“the big problem that we haven’t cracked is that if
someone tweets a sentiment it’s not necessarily what
they’re feeling, it can be for a variety of reasons, so it doesn’t
really reflect what they feel necessarily”
Mike Thelwall, University of Wolverhampton
15. Implications for research
Do we care about causes?
“Big Data is all about correlation; it’s not about causation,
which means that you don’t need to have a theory
beforehand. You just start looking for correlation … so you
don’t have any idea about the structure of the data, you just
find a funny correlation.”
Sara Esposti, Open University Business School
16. Implications for research
Do we care about causes?
“a central concern of social science is, we don’t just want to
find statistical associations, we actually want to uncover the
underlying causal processes by which social systems work ...
The data themselves don’t tell you about cause and effect,
there’s actually a very complex often, complex inferential
process you have to go through in order to extract from the
data the things that you really care about
David Jensen, University of Massachusetts
17. Implications for research
Do we care about causes?
“I’ve been talking to some computer scientists who are
rising stars, they’re really doing well, and they acknowledge
that the way in which the field works, novelty is the key
issue. And so there’s always an incentive or a pressure to
keep on doing new stuff with new data, even though they
might have wanted to go into more depth into something.
Sandra Gonzalez-Bailon, Annenberg School of
Communication, University of Pennsylvania
18. The challenge
How can we extract meaning from Big Data – learn
to drink from the fire hose?
19. Drinking from the fire hose
• Understanding the data
• Collaborating
• Mixing methods
20. Drinking from the fire hose: understanding the data
The rise of the information society has given us
myriad new forms of data and accompanying ways
of analysing it.
The challenging part is abstracting meaning about
society in general from data created and harvested
online.
21. Drinking from the fire hose: understanding the data
Example: it’s hard to predict elections using Twitter
“[Of] 14 different attempts to predict elections
based on Twitter data ... Only half of them were
successful ... All of this looks close to mere chance”
Gayo-Avello 2012
22. Drinking from the fire hose: understanding the data
Example: Facebook isn’t going anywhere, and
neither is Princeton
Canarella and Spechler 2014 Develin 2014
23. Drinking from the fire hose: understanding the data
But it’s much simpler, conceptually speaking, to
analyse online phenomena on their own terms
Yasseri, Hale & Margetts 2013
24. Drinking from the fire hose: understanding the data
But it’s much simpler, conceptually speaking, to
analyse online phenomena on their own terms
Hale, Yasseri, Cowls, Meyer,
Schroeder & Margetts (submitted)
25. Drinking from the fire hose: understanding the data
Of course, online data can still provide insights into
offline life, but these must be well-grounded.
e.g. Seth Stephens-Davidowitz, ‘The Cost of Racial
Animus on a Black Candidate: Evidence Using
Google Data’
• Google accounts for >50% of search engine market (less
concern over representativeness)
• Google searches are private and anonymous (less
concern over reliability)
• This method uncovers a social phenomenon, racism,
which would be harder to detect in pre-Internet
approaches e.g. interviews or surveys
26. Drinking from the fire hose: understanding the data
Beware false prophets
XKCD
27. Drinking from the fire hose: understanding the data
Beware false prophets: analyses using thousands of
variables can generate millions or billions of
possible relationships – not all (or most) will be valid
or meaningful
28. Drinking from the fire hose: understanding the data
Beware false prophets
“if you look at the data long enough you’ll find predictive
signals that are in fact completely spurious...for about, I think
a 20 or 25 year period, the US stock market was perfectly
correlated with the level of butter production in Bangladesh
… if you look at hundreds and hundreds of these indicators,
whether it’s the level of Bangladesh butter production or the
number of cars in New York City or whatever it is, eventually
you'll find something that just by pure chance matches what
you're looking for. ”
Mike Cafarella, University of Michigan
29. Drinking from the fire hose: collaborating
Big data research often necessitates a wide variety
of skills and perspectives. The growth of teams in
academic research has been increasing for decades:
30. Drinking from the fire hose: collaborating
This trend is likely to persist as big data research
becomes more common
“the best research will often merge in collaboration
between computer scientists who will have access to the
tools and the background to further develop and apply
those, and with social scientists who will have, sort of, good
pressing social questions that we can get insight into with
the data that is now available. ”
Scott Hale, Oxford Internet Institute
31. Drinking from the fire hose: collaborating
This trend is likely to persist as big data research
becomes more common
“I can find someone to optimise an algorithm, I can pay
someone to build a website but what I want is someone that
is going to be thinking the human side through every step of
the way, and when you build an algorithm and when you
write a line of code you ask, does this make sense in terms of
the phenomena that I am trying to model or trying to
interpret.”
Josh Introne, Michigan State University
32. Drinking from the fire hose: mixing methods
While Big Data is necessarily quantitative, it can be
used in conjunction with other methods.
“For me, I think if I only look at the numbers I don’t get the
whole picture … if we look at, for example, Twitter data, you
can see some tendencies, but if you want to answer the right
question then I think it’s necessary to do more qualitative
studies … So I’m doing interviews with political parties, I’m
also doing interviews with journalists, in order to talk about
how they are using social media as journalistic tools. ”
Bente Kalsnes, University of Oslo
33. Drinking from the fire hose: mixing methods
This means correlations can point the way for
deeper causal explanatory research.
“So you start off with the patterns and then what you
should be doing is saying ‘Well, here’s some possible
reasons’, and then when you’ve found some relationships
which really deserve more study then you would go off and
do a more detailed qualitative assessment as to whether this
was true or not. . ”
Richard Webber, King’s College London
34. Conclusion: learning to drink from the fire hose
The major question around Big Data is not what the
data looks like and more about what we do with it.
The Big Data approach seems to challenge basic tenets
of academic research, undermining precision, validity
and explanatory power
However, with a greater understanding of the nature
of data, a collaborative approach and a willingness to
employ multiple methods, we’ll be better equipped to
drink from the Big Data fire hose.