Slicing Big Data: Gambling, Twitter & Time Sensitive Information
Oct. 26, 2013•0 likes
4 likes
Be the first to like this
Show More
•1,890 views
views
Total views
0
On Slideshare
0
From embeds
0
Number of embeds
0
Download to read offline
Report
Education
Technology
Entertainment & Humor
Presented at the Internet Researchers conference in Denver, CO -- 26 October 2013. Discusses Gambling, Reality TV, and World Events in the Context of Twitter Data, and selecting usable data from big data.
Slicing Big Data: Gambling, Twitter & Time Sensitive Information
Gambling, Twitter & Time
Sensitive Information
IR14 - Denver,CO
dp.woodford@qut.edu.au
@dpwoodford
Wednesday, 23 October 13
FORMAT
• Not going to simply repeat the paper.
• I will get to the gambling (& fantasy sports) examples, but want
to discuss our wider work with large datasets.
• Happy to answer more specific questions about the use in the
gambling industry.
• Examples from Sport, TV, Gambling & Fantasy Sports. A tourde-force of current research projects
Wednesday, 23 October 13
DEALING WITH THE TITLE: TWITTER
• Twitter => Large Data Sets, but specific research
questions often require a small data set:
– Australian users
– Users registering on the platform during natural disasters
– ‘Experts’ on Fantasy Sports
– Sporting Participants: Golf, Tennis, NFL, College Football, etc..
– Reality TV ‘fanatics’
– Almost infinite examples
• Goal is to get from “Big Data” to what I’ve been calling
“useful data”
Wednesday, 23 October 13
DEALING WITH THE TITLE: GAMBLING
• Long term interest in the gambling industry (one case
study in my prior work on games).
• Many parallels between Gambling and Fantasy Sports
(another current research project).
• When I was an ‘active participant’, Twitter was just
becoming popular (2006-2010).
• It quickly became a crucial source of information, and
websites started aggregating it.
Wednesday, 23 October 13
DEALING WITH THE TITLE: TIME SENSITIVE
INFORMATION
• Lines move incredibly fast: Just
as much a market as day-trading
on the stock exchange
Wednesday, 23 October 13
WHY IS DATA SLICED?
• Streaming API is limited to ~1% of total tweets per second
& Firehose access is expensive.
• Large data sets are not easily malleable, or visually
analyzed (e.g. with Tableau):
– Our database of Twitter users is ~3.7TB, and growing.
– A weeks worth of selected TV data (current US shows) in JSON
format is 750MB, and 600MB in TSV (selected fields). And
millions of rows.
• Analyzing large data sets is slow, if it’s even possible =>
“Usable Data”
Wednesday, 23 October 13
HOW IS DATA SLICED: COMPULSORY
Wednesday, 23 October 13
HOW IS DATA SLICED: SELECTING FOR
AUTHENTICITY -- WTA
Wednesday, 23 October 13
HOW IS DATA SLICED: SELECTING FOR
AUTHENTICITY -- FANTASY SPORTS
Wednesday, 23 October 13
HOW IS DATA SLICED: SELECTING FOR
AUTHENTICITY -- FANTASY SPORTS
CLIP
FROM
YAHOO
FANTASY
FOOTBALL
RE:
CALVIN
JOHNSON
INJURY
&
TWITTER
REPORTS
Wednesday, 23 October 13
BUT YOU STILL NEED A SANITY CHECK
Wednesday, 23 October 13
BUT YOU STILL NEED A SANITY CHECK
Wednesday, 23 October 13
HOW IS DATA SLICED: RANDOM SAMPLING
Source:
Tony
Hirst
(Open
University
UK)
Wednesday, 23 October 13
BUT SOMETIMES YOU NEED THE FULL
SAMPLE & REPEATED CAPTURE
Source:
Bruns
/
Woodford
[Mapping
Online
Publics]
Wednesday, 23 October 13
HOW IS DATA SLICED: ONLY A SMALL
SAMPLE MATTERS
Floods,
Earthquake,
Tsunami
Media
Coverage
Wednesday, 23 October 13
HOW IS DATA SLICED: TV -- SEASONAL DATA
VS EPISODIC
Impact
of
Live
Feed
Wednesday, 23 October 13
HOW IS DATA SLICED: TV -- SEASONAL DATA
VS EPISODIC
Wednesday, 23 October 13
HOW IS DATA SLICED: TV -- SEASONAL DATA
VS EPISODIC
Delayed
TV
sucks
Wednesday, 23 October 13
HOW IS DATA SLICED: MOST ACTIVE ≠
REPRESENTATIVE
• Most active (#BB15, #BBLF) users often defend a HM to
the death (akin to sporting tribalism), but most users are
attackers (forthcoming paper w/ Katie Prowd)
Disclaimer:
Scale
changed
to
fit
on
slide
Source:
Woodford
/
Prowd
[Fan
Cultures
and
Hatred
in
Big
Brother
15:
Race
Rows,
EliMsm
&
SporMng
Tribalism
-‐-‐
Forthcoming]
Wednesday, 23 October 13
TIME SLICES OF TWEET CONTENT IS
ENLIGHTENING
Source:
Woodford
/
Prowd
[Fan
Cultures
and
Hatred
in
Big
Brother
15:
Race
Rows,
EliMsm
&
SporMng
Tribalism
-‐-‐
Forthcoming]
Wednesday, 23 October 13
TIME SLICES OF TWEET CONTENT IS
ENLIGHTENING
Source:
Woodford
/
Prowd
[Fan
Cultures
and
Hatred
in
Big
Brother
15:
Race
Rows,
EliMsm
&
SporMng
Tribalism
-‐-‐
Forthcoming]
Wednesday, 23 October 13
HOW IS DATA SLICED: MOST ACTIVE ≠
REPRESENTATIVE
Source:
Woodford
/
Prowd
[Fan
Cultures
and
Hatred
in
Big
Brother
15:
Race
Rows,
EliMsm
&
SporMng
Tribalism
-‐-‐
Forthcoming]
Wednesday, 23 October 13
HOW IS DATA SLICED: MOST ACTIVE ≠
REPRESENTATIVE
• Twitter closed these quickly, yet the BB15 accounts
remained active for much of the season...
Wednesday, 23 October 13
AND A QUICK NOTE ON NON-TWITTER
ANALYTICS
Wednesday, 23 October 13
AND A QUICK NOTE ON NON-TWITTER
ANALYTICS
• There’s lots of data out there,
but it needs to be sliced to be
usable.
• You can work with large,
original, data sets, but often
this adds extra complexity
that isn’t necessary to answer
your research questions.
• But don’t delete the data you
don’t need!
Wednesday, 23 October 13
AND A QUICK NOTE ON NON-TWITTER
ANALYTICS
Wednesday, 23 October 13
ACKNOWLEDGEMENTS
• ARC Centre for Excellence in Creative Industries and
Innovation (CCI) - http://www.cci.edu.au & http://
www.mappingonlinepublics.net
• Social Media Research Group -- http://
socialmedia.qut.edu.au
• Queensland University of Technology
Wednesday, 23 October 13