“Big data” is high-volume, -velocity and –variety
information assets that demand cost-effective,
innovative forms of information processing for
enhanced insight and decision making’ (Gartner in
Huge industry now built around ‘social data’ and
‘listening platforms’ feeding on this data (Many
tools not suitable for academic use, black box).
• Technology: maximizing computation power and
algorithmic accuracy to gather, analyze, link, and
compare large data sets.
• Analysis: drawing on large data sets to identify
patterns in order to make economic, social,
technical, and legal claims.
• Mythology: the widespread belief that large data
sets offer a higher form of intelligence and
knowledge that can generate insights that were
previously impossible, with the aura of truth,
objectivity, and accuracy.
(boyd and Crawford p. 663).
Critiques of Big Data
• Important to make visible inherent claims about
• Problematic focus on quantitative methods
• How can data answer questions it was not
designed to answer?
• How can the right questions be asked?
• Inherent biases in large linked error prone
• Focus on text and numbers that can be mined
• Data fundamentalism
The notion that correlation always indicates causation,
and that massive data sets and predictive analytics
always reflect ‘objective truth’. Idea and belief in the
existence of an objective ‘truth’, that something can be
fully understood from a single perspective, again brings
to light tensions about how the social world can be made
How do we ground online data?
In the offline: assessing findings against what we
know about an offline population (census data) in
order to better understand online data. Problems
with over/under representation in online data?
In the online: premised on the idea that data
derived from social media should be grounded in
other online data in order to understand it. So
comparing Facebook use to what we know about
Facebook use, rather than connecting it to offline
measurements about citizens. Richard Rogers
1. Asking the right question – research should
be question driven rather than data driven.
2. Accept poor data quality & users gaming
metrics – once online metrics have value users
will try to game them.
3. Limitations of tools (often built in
4. Transparency – researchers should be upfront
about limitations of research and research
design. Can the data answer the questions?
A critical reflection on Big Data:
considering APIs, researchers and
tools as data makers
Rather than assuming data already exists ‘out there’, waiting to
simply be recovered and turned into findings, the article
examines how data is co-produced through dynamic research
intersections. A particular focus is the intersections between the
Application Programming Interface (API), the researcher
collecting the data as well as the tools used to process it. In light
of this, the article offers three new ways to define and think
about Big Data and proposes a series of practical suggestions for
(First Monday, October 2013, http://firstmonday.org/)
Standard API sampling problems
Sampling from the FIREHOSE
1% random sample of the firehose
If not rate limited – all data collected?
New API sampling problems
New business models: enriched metadata
Social media vs social data
Datasift, GNIP and Topsy
Social media VS social data
• Social Media: User-generated content where one user
communicates and expresses themselves and that content
is delivered to other users. Examples of this are platforms
such as Twitter, Facebook, YouTube, Tumblr and Disqus.
Social media is delivered in a great user experience, and is
focused on sharing and content discovery. Social media
also offers both public and private experiences with the
ability to share messages privately.
• Social Data: Expresses social media in a computer-readable
format (e.g. JSON) and shares metadata about the content
to help provide not only content, but context. Metadata
often includes information about location, engagement and
links shared. Unlike social media, social data is focused
strictly on publicly shared experiences. (Cairns, 2013)
Gold standard geo data
Problem: only 1% of users
-> Only 2% of firehose tweets
Early adopters, highly skewed
Where in the world are you?
No Lat/long coordinates
Text field – enter anything
Advantage: more than half of all
tweets contain profile location
Much more evenly distributed
Profile Geo Enrichment
‘our customers can now hear from the whole world of Twitter users
and not just 1%’ (Cairs, 2013 on Gnip company Blog)
• Activity Location – 1% that provide lat/long
• Profile Location – Place provided in their profile. May or may not
be posting from there.
• Mentioned Location – Places a user talks about
‘Both the tweet text and Profile fields contain geographic information,
but not in substantial quantities and have poor accuracy’ (Leetaru et
al, First Monday, May 2013)
Problem with deleted tweets
‘A deleted tweet effectively disappears from the results of searching
Twitter, although a short delay sometimes occurs between deletion
and disappearance. A status deletion notice is distributed via the
Twitter streaming API to relevant users’ clients so that they, in turn,
remove deleted tweets from their records.’
‘Twitter does not provide a bulk-deletion of user’s tweets. It provides,
however, a one-click bulk-deletion of all location data that were
attached to user’s tweets, without deleting the tweets. By clicking on
the “Delete all location information” button on user’s account settings
page, all locations attached to all previous tweets are deleted.
(Almuhimedi et al, 2013)
Profile Geo Enrichment
‘Profile location data can be used to unlock demographic data
and other information that is not otherwise possible with
activity location. For instance, US Census Bureau statistics are
aggregated at the locality level and can provide basic stats like
household income. Profile location is also a strong indicator of
activity location when one isn’t provided. (Cairns, 2013)
Fake followers: Mitt Romney’s 100,000 extra followers in one day
As many as 20 million fake follower accounts (200 million active users)
This doesn’t take into account the issue of spoof accounts (clearly in
evidence in riot tweets) (Perlroth, 2013)
Ability to describe the limitations of our data:
- APIs as data makers. Once data is linked very
hard to untangle how metadata is constructed
and where problems might be. Included in terms
of deleted content.
- Researchers and tools as data makers
- When creating a dataset important to
describe how it was made, what the
limitations are. What the sampling limitations
(both in terms of the API, but also related to
offline ‘population’. What other limitations re:
enriched metadata needs to be described?)
- When creating a dataset how complete is it?
- Limitations need to be known in order to
describe them. This is a real problem.
Tools as data makers
In answering complex questions about social media
data, we need:
1. Know the questions! And know how they might be
2. Problem with tools: not question driven. Often
developed around available (poor quality) data, often
by non social media experts, but those with data
3. Tools therefore become data-makers in that they limit
the scope of possibility in the questions researchers
imagine. This is a huge problem!
Need better understanding of complex
ever changing dynamics between
Organic data / data in the wild
SOCIAL MEDIA SIMPLY AS (BIG)
SOCIAL MEDIA AS A RESEARCH
US 65% smartphone penetration
Smartphones overtaken desktop usage to access the internet
Mobile internet accounts for majority of internet use in US (57%)
Users typically access the internet via apps on mobile devices
All figures from comScore, US Digital Future in Focus, 2014
UK: The over-55s will experience the fastest year-on-year rises in
Smartphone ownership should increase to about 50% by year-end, a
25% increase from 2013, but trailing 70% penetration among 18-54s.
The difference in smartphone penetration by age will disappear, but
differences in usage of smartphones remain substantial. Many over
55s use smartphones like feature phones.
All figures from Deloitte, predictions for 2014
Rise of platforms and apps focused on visual content
‘Mobile first… and only’ | simple easy, user friendly design
Facebook daily image uploads: 350 million (November 2013)
Instagram daily image uploads: 60 million (March 2014)
Twitter: 500 million tweets daily (March 2014)
Snapchat daily snaps: 400 million (November 2013)
Images largely ignored in
social media research
Not easy to ‘mine’
Hard to figure out meaning
Huge interest in industry
WHAT DOES THE
FUTURE OF SOCIAL
(TOOL AWARE) + CRITICAL
MORE CROSS SECTOR?
• Hazim Almuhimedia, Shomir Wilsona, Bin Liua, Norman Sadeha, Alessandro Acquistib, 2012. ‘Tweets Are
Forever: A Large-Scale Quantitative Analysis of Deleted Tweets’, CSCW’13, February 23–27, 2013, San
Antonio, Texas, USA, http://www.cs.cmu.edu/~shomir/cscw2013_tweets_are_forever.pdf, accessed 18
• Ian Cairns, 2013. ‘Get More Geodata From Gnip With Our New Profile Geo Enrichment’, Gnip Company
Blog, 22 August, at http://blog.gnip.com/tag/geolocation/, accessed 13 September 2013.
• Grcommunication, 2012, ‘I will help raise your Klout score by sending you 10Ks and will tweet it out to my
50K+ followers from my 80+ Klout score for $5, http://fiverr.com/grcommunication/help-raise-your-klout-
19 September 2013.
• Anthony Ha, 2013. ‘Gnip Expands Its Partnership With Klout, Becoming The Exclusive Provider Of
Klout Topics’, TechCrunch, 8 August, http://techcrunch.com/2013/08/08/gnip-klout/, accessed 19
• Martin Hawksey, 2013. ‘Twitter throws a bone: Increased hits and metadata in Twitter Search API 1.1,’
March 28, at http://mashe.hawksey.info/2013/03/twitterthrows-a-bone-increased-hits-and-metadata-in-
twitter-search-api-1-1/ , accessed 10 September 2013.
• Kalev H. Leetaru, Shaowen Wang, Guofeng Cao, Anand Padmanabhan, and Eric Shook, 2013, ‘Mapping the
global Twitter heartbeat: The geography of Twitter, First Monday, Volume 18, Number 5-6 May,
• Nicole Perlroth, 2013. ‘Fake Twitter Followers Become Multimillion-Dollar Business’, New York Times, Bits
blog, 5 April, http://bits.blogs.nytimes.com/2013/04/05/fake-twitter-followers-becomes-multimillion-
dollar-business/, accessed 19 September 2013.
• Farida Vis, 2013. ‘A critical reflection on Big Data: considering APIs, researchers and tools as data makers’,
First Monday, 7 October, http://firstmonday.org