Prelims of Kant get Marx 2.0: a general politics quiz
Mining the Social Web - Lecture 3 - T61.6020
1. Mining
the
Social
Web
Aris%des
Gionis
Michael
Mathioudakis
firstname.lastname@aalto.fi
Aalto
University
Spring
2015
2. Mining
the
Social
Web
-‐
Aalto
-‐
2015
2
T-61.6020: Mining the social web — lecture #2
structure of social networks
social networks and social-media data can be
represented as graphs (or networks)
how these graphs look like?
what is their structure
data contain additional information
(actions, interactions, dynamics, attributes,…)
mining this additional information as part of
the network structure
6
T-61.6020: Mining the social web — lecture #2
community structure in social networks
12
dolphins network and its NPC
Community structure
dolphins network and its NCP
(source [Leskovec et al., 2009])
Frieze, Gionis, Tsourakakis Algorithmic Techniques for Modeling and Mining Large Graphs 34 / 277
Previously
on
T.61-‐6020
structure
and
dynamics
of
social
networks
what
does
a
social
network
look
like?
how
do
social
networks
evolve
over
Hme?
how
does
informaHon
spread?
do
users
influence
each
other?
Figure 4: Top 50 threads in the news cycle with highest volume for the period Aug. 1 – Oct. 31, 2008. Each thread consists of all news
articles and blog posts containing a textual variant of a particular quoted phrases. (Phrase variants for the two largest threads in
each week are shown as labels pointing to the corresponding thread.) The data is drawn as a stacked plot in which the thickness of the
strand corresponding to each thread indicates its volume over time. Interactive visualization is available at http://memetracker.org.
Figure 5: Temporal dynamics of top threads as generated by our model. Only two ingredients, namely imitation and a preference to
recent threads, are enough to qualitatively reproduce the observed dynamics of the news cycle.
3. GLOBAL ANALYSIS: TEMPORAL VARI-
ATION AND A PROBABILISTIC MODEL
periods when the upper envelope of the curve are high correspond
to times when there is a greater degree of convergence on key sto-
ries, while the low periods indicate that attention is more diffuse,
threads dynamics
3. Today
poliHcs
does
network
structure
reflect
poliHcal
divisions?
can
we
infer
poliHcal
affiliaHon?
financial
senHment
can
twiLer
predict
the
stock
market?
urban
compuHng
what
does
online
acHvity
say
about
how
we
live
in
ciHes?
Mining
the
Social
Web
-‐
Aalto
-‐
2015
3
4. PoliHcs
Online
www
&
‘democraHzaHon’
of
informaHon
ciHzen
journalism
(e.g.
HaiH
earthquake,
Arab
spring)
poliHcians
and
tradiHonal
media
parHcipate,
too
Mining
the
Social
Web
-‐
Aalto
-‐
2015
4
5. US
PoliHcs
&
the
Web
Websites
-‐
1996
(Email)
-‐
1998
Online
Fund
raising
-‐
2000
Blogs
-‐
2004
TwiLer
&
FB
-‐
2008
Jesse Ventura - MN governor
Mining
the
Social
Web
-‐
Aalto
-‐
2015
5
6. Mining
the
Social
Web
-‐
Aalto
-‐
2015
6
social
media
mining
vs
tradiHonal
poliHcal
polls
new
content
&
biases
7. International Workshop on Link Discovery, 2005
The Political Blogosphere and the 2004 U.S. Election:
Divided They Blog
Lada Adamic
HP Labs
1501 Page Mill Road
Palo Alto, CA 94304
lada.adamic@hp.com
Natalie Glance
Intelliseek Applied Research Center
5001 Baum Blvd.
Pittsburgh, PA 15217
nglance@intelliseek.com
4 March 2005
Abstract
In this paper, we study the linking patterns and discussion topics of political bloggers.
Our aim is to measure the degree of interaction between liberal and conservative blogs, and
to uncover any differences in the structure of the two communities. Specifically, we analyze
the posts of 40 “A-list” blogs over the period of two months preceding the U.S. Presidential
Election of 2004, to study how often they referred to one another and to quantify the overlap in
the topics they discussed, both within the liberal and conservative communities, and also across
communities. We also study a single day snapshot of over 1,000 political blogs. This snapshot
captures blogrolls (the list of links to other blogs frequently found in sidebars), and presents
a more static picture of a broader blogosphere. Most significantly, we find differences in the
behavior of liberal and conservative blogs, with conservative blogs linking to each other more
frequently and in a denser pattern.
1 Introduction
The 2004 U.S. Presidential Election was the first Presidential Election in the United States in which
blogging played an important role. Although the term weblog was coined in 1997, it was not until
after 9/11 that blogs gained readership and influence in the U.S. The next major trend in political
blogging was “warblogging”: blogs centered around discussion of the invasion of Iraq by the U.S.1
The year 2004 saw a rapid rise in the popularity and proliferation of blogs. According to a
report from the Pew Internet & American Life Project published in January 2005, 32 million U.S.
Mining
the
Social
Web
-‐
Aalto
-‐
2015
7
8. Some
History
before
facebook
and
twiLer,
there
were
blogs
rise
a^er
9/11
‘war-‐blogging’
2004
32
million
US
ciHzens
read
blogs
62%
of
US
ciHzens
do
not
know
them
Mining
the
Social
Web
-‐
Aalto
-‐
2015
8
9. Jan 25, 2004
Mining
the
Social
Web
-‐
Aalto
-‐
2015
9
everyone
talks
with
everyone?...
or
‘echo
chambers’?
10. In
This
Work...
task:
extract
network
structure
of
poliHcal
interacHons
quesHon:
one
or
separate
communiHes?
...previous
evidence
on
two
blogs
(Instapundit
and
Atrios)
show
‘neighborhoods’
have
no
overlap
in
cited
urls
...same
with
book
purchases
on
amazon.com
on
the
other
hand...
...
it
is
now
easier
to
interact
Mining
the
Social
Web
-‐
Aalto
-‐
2015
10
11. used
blog
directories
for
lists
of
poliHcal
blogs
parsed
front
page
for
links
to
discover
more
blogs
labeled
them
manually
only
liberal
&
conservaHve
759
liberal
&
735
conservaHve
blogs
Mining
the
Social
Web
-‐
Aalto
-‐
2015
11
Data
12. Findings
Figure 1: Community structure of political blogs (expanded set), shown using utilizing a GEM
layout [11] in the GUESS[3] visualization and analysis tool. The colors reflect political orientation,
red for conservative, and blue for liberal. Orange links go from liberal to conservative, and purple
Mining
the
Social
Web
-‐
Aalto
-‐
2015
12
13. Findings
10
0
10
1
10
2
10
−2
10
−1
10
0
incoming links (k)
fractionofblogswithatlastklinks
conservative
liberal
Lognormal fit
k−036
e−k/57
ative distribution of incoming links for political blogs, separated by category. As
ognormal, shown as a dashed line, to be a fairly good fit. A power-law with an
ff, shown as a solid line, is an even better fit.
This is on par with the 277 links received by the most linked to conservative
1 2
3
4
56
7
8
9
10
11
1213
1415
16
17
18
19
20
21
22 23
24
25
26
27
28
29 30
31
32
33
34 35 36
37 38 39
40
1 Digbys Blog
2 James Walcott
3 Pandagon
4 blog.johnkerry.com
5 Oliver Willis
6 America Blog
7 Crooked Timber
8 Daily Kos
9 American Prospect
10 Eschaton
11 Wonkette
12 Talk Left
13 Political Wire
14 Talking Points Memo
15 Matthew Yglesias
16 Washington Monthly
17 MyDD
18 Juan Cole
19 Left Coaster
20 Bradford DeLong
21 JawaReport
22 Voka Pundit
23 Roger L Simon
24 Tim Blair
25 Andrew Sullivan
26 Instapundit
27 Blogs for Bush
28 Little Green Footballs
29 Belmont Club
30 Captain’s Quarters
31 Powerline
32 Hugh Hewitt
33 INDC Journal
34 Real Clear Politics
35 Winds of Change
36 Allahpundit
37 Michelle Malkin
38 WizBang
39 Dean’s World
40 Volokh
(C)
(B)
(A)
Mining
the
Social
Web
-‐
Aalto
-‐
2015
13
similar
picture
for
top
20
blogs
small
number
of
blogs
aLract
most
links
14. 0 200 400 600 800 1000 1200 1400 1600
nytimes.com
washingtonpost.com
news.yahoo.com
msnbc.msn.com
nationalreview.com
cnn.com
latimes.com
boston.com
usatoday.com
washingtontimes.com
apnews.myway.com
guardian.co.uk
foxnews.com
cbsnews.com
slate.msn.com
nypost.com
news.bbc.co.uk
tnr.com
opinionjournal.com
online.wsj.com
salon.com
# citations from weblog posts
Left
Right
Figure 4: Most linked to news sources by the top 20 conservative and top 20 liberal blogs during
8/29/2004 - 11/15/2004.
1. CBS News poll of uncommitted voters shows Kerry winning 43% to 28%
2. Sun Times article: Bob Novak predicts that George Bush will retreat from Iraq if reelected
0 200 400 600 800 1000 1200 1400 1600
nytimes.com
washingtonpost.com
news.yahoo.com
msnbc.msn.com
nationalreview.com
cnn.com
latimes.com
boston.com
usatoday.com
washingtontimes.com
apnews.myway.com
guardian.co.uk
foxnews.com
cbsnews.com
slate.msn.com
nypost.com
# citations from weblog posts
Figure 4: Most linked to news sources by the top 20 conservative and top 20 liberal blogs during
8/29/2004 - 11/15/2004.
1. CBS News poll of uncommitted voters shows Kerry winning 43% to 28%
2. Sun Times article: Bob Novak predicts that George Bush will retreat from Iraq if reelected
3. CBS News article on forged memos
4. New York Daily News article on Osama Bin Laden videotope, “gift” for the President
5. Time Magazine poll: Bush opens double-digit lead on post convention bounce
In contrast, the top news articles cited by right leaning bloggers are:
1. CBS News article on forged memos
2. Time Magazine poll: Bush opens double-digit lead on post convention bounce
3. National Review article refuting the case about missing explosives
4. ABC News article refuting the case about missing explosives
5. Washington Post article reporting on Kerry’s proposal to allow Iran to keep its nuclear power
plants in exchange for giving up the right to retain the nuclear fuel that could be used for
bomb-making
A time series chart further shows how quickly and strongly conservative bloggers responded to
forged CBS documents (Figure 5). The conservative bloggers saw Dan Rather’s report as an attempt
by the left to discredit President Bush. They acted quickly to debunk the report, with the charge
led by PowerLine and seconded by Wizbangblog and others. In contrast, the pick-up among liberal
Mining
the
Social
Web
-‐
Aalto
-‐
2015
14
Findings
0 200 400 600 800 1000 1200 1400 1600
nytimes.com
washingtonpost.com
news.yahoo.
# citations from weblog posts
Figure 4: Most linked to news sources by the top 20 conservative and top 20 liberal blogs during
8/29/2004 - 11/15/2004.
1. CBS News poll of uncommitted voters shows Kerry winning 43% to 28%
2. Sun Times article: Bob Novak predicts that George Bush will retreat from Iraq if reelected
3. CBS News article on forged memos
4. New York Daily News article on Osama Bin Laden videotope, “gift” for the President
5. Time Magazine poll: Bush opens double-digit lead on post convention bounce
In contrast, the top news articles cited by right leaning bloggers are:
1. CBS News article on forged memos
2. Time Magazine poll: Bush opens double-digit lead on post convention bounce
3. National Review article refuting the case about missing explosives
4. ABC News article refuting the case about missing explosives
5. Washington Post article reporting on Kerry’s proposal to allow Iran to keep its nuclear power
plants in exchange for giving up the right to retain the nuclear fuel that could be used for
bomb-making
A time series chart further shows how quickly and strongly conservative bloggers responded to
forged CBS documents (Figure 5). The conservative bloggers saw Dan Rather’s report as an attempt
by the left to discredit President Bush. They acted quickly to debunk the report, with the charge
led by PowerLine and seconded by Wizbangblog and others. In contrast, the pick-up among liberal
bloggers occurred later, with lower volume. The most vocal left leaning bloggers on the subject were
TalkLeft and AMERICAblog.
11
top
20
blogs
news
links
15. 0 20 40 60 80 100 120 140
buzzflash.com
cursor.org
mediamatters.org
commondreams.org
alternet.org
airamericaradio.co
salon.com
thenation.com
theonion.com
guardian.co.uk
nytimes.com
news.google.com
washingtonpost.com
cnn.com
foxnews.com
weeklystandard.com
command-post.org
townhall.com
opinionjournal.com
nationalreview.com
number of blogs linking
Left
Right
Figure 7: Most linked to news sources (online and off), showing proportionally how many liberal
and conservative blogs link to them.Mining
the
Social
Web
-‐
Aalto
-‐
2015
15
Findings
all
blogs
links
to
other
websites
16. Summary
quesHon:
one
or
separate
communiHes?
data:
1.5k
blogs,
manually
labeled
methodology:
simple
link
analysis
impact:
showed
divide
in
‘online’
world
reproducibility:
?
Mining
the
Social
Web
-‐
Aalto
-‐
2015
16
17. Political Polarization on Twitter
M. D. Conover, J. Ratkiewicz, M. Francisco, B. Gonc¸alves, A. Flammini, F. Menczer
Center for Complex Networks and Systems Research
School of Informatics and Computing
Indiana University, Bloomington, IN, USA
Abstract
In this study we investigate how social media shape the
networked public sphere and facilitate communication be-
tween communities with different political orientations. We
examine two networks of political communication on Twit-
ter, comprised of more than 250,000 tweets from the six
weeks leading up to the 2010 U.S. congressional midterm
elections. Using a combination of network clustering algo-
rithms and manually-annotated data we demonstrate that the
network of political retweets exhibits a highly segregated par-
tisan structure, with extremely limited connectivity between
left- and right-leaning users. Surprisingly this is not the case
for the user-to-user mention network, which is dominated by
a single politically heterogeneous cluster of users in which
ideologically-opposed individuals interact at a much higher
rate compared to the network of retweets. To explain the dis-
tinct topologies of the retweet and mention networks we con-
jecture that politically motivated individuals provoke inter-
action by injecting partisan content into information streams
whose primary audience consists of ideologically-opposed
users. We conclude with statistical evidence in support of this
hypothesis.
1 Introduction
Social media play an important role in shaping political dis-
course in the U.S. and around the world (Bennett 2003;
Benkler 2006; Sunstein 2007; Farrell and Drezner 2008;
Aday et al. 2010; Tumasjan et al. 2010; O’Connor et al.
2010). According to the Pew Internet and American Life
Gallo, and Kane (2007). Consumers of online political in-
formation tend to behave similarly, choosing to read blogs
that share their political beliefs, with 26% more users do-
ing so in 2008 than 2004 (Pew Internet and American Life
Project 2008).
In its own right, the formation of online communities is
not necessarily a serious problem. The concern is that when
politically active individuals can avoid people and informa-
tion they would not have chosen in advance, their opinions
are likely to become increasingly extreme as a result of being
exposed to more homogeneous viewpoints and fewer credi-
ble opposing opinions. The implications for the political pro-
cess in this case are clear. A deliberative democracy relies on
a broadly informed public and a healthy ecosystem of com-
peting ideas. If individuals are exposed exclusively to people
or facts that reinforce their pre-existing beliefs, democracy
suffers (Sunstein 2002; 2007).
In this study we examine networks of political commu-
nication on the Twitter microblogging service during the
six weeks prior to the 2010 U.S. midterm elections. Sam-
pling data from the Twitter ‘gardenhose’ API, we identi-
fied 250,000 politically relevant messages (tweets) produced
by more than 45,000 users. From these tweets we isolated
two networks of political communication — the retweet
network, in which users are connected if one has rebroad-
cast content produced by another, and the mention network,
where users are connected if one has mentioned another in a
post, including the case of tweet replies.
We demonstrate that the retweet network exhibits a highly
Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media
Mining
the
Social
Web
-‐
Aalto
-‐
2015
17
International Conference on Weblogs and Social Media (ICWSM) 2011
18. In
This
Work...
quesHon
check
if
same
paLern
holds
in
twiLer
however
retweets
and
men8ons
instead
of
links
Mining
the
Social
Web
-‐
Aalto
-‐
2015
18
19. Data
Mining
the
Social
Web
-‐
Aalto
-‐
2015
19
6
weeks
of
tweets
before
US
2010
elecHons
to
disHnguish
poliHcal
tweets
1. parse
all
tweets
(355
million)
2. select
subset
with
hashtags
#p2
and
#tcot
3. find
frequently
co-‐occuring
hashtags
(66)
4. keep
tweets
that
contain
those
hashtags
>>
250k
tweets
The major contributions of this work are:
• Creation and release of a network and text dataset derived
from more than 250,000 politically-related Twitter posts
authored in the weeks preceeding the 2010 U.S. midterm
elections (§ 2).
• Cluster analysis of networks derived from this corpus
showing that the network of retweet exhibits clear seg-
regation, while the mention network is dominated by a
single large community (§ 3.1).
• Manual classification of Twitter users by political align-
ment, demonstrating that the retweet network clusters cor-
respond to the political left and right. These data also
show the mention network to be politically heteroge-
neous, with users of opposing political views interacting
at a much higher rate than in the retweet network (§ 3.3).
• An interpretation of the observed community structures
Table 1: Hashtags related to #p2, #tcot, or both. Tweets
containing any of these were included in our sample.
Just #p2 #casen #dadt #dc10210 #democrats #du1
#fem2 #gotv #kysen #lgf #ofa #onenation
#p2b #pledge #rebelleft #truthout #vote
#vote2010 #whyimvotingdemocrat #youcut
Both #cspj #dem #dems #desen #gop #hcr
#nvsen #obama #ocra #p2 #p21 #phnm
#politics #sgp #tcot #teaparty #tlot
#topprog #tpp #twisters #votedem
Just #tcot #912 #ampat #ftrs #glennbeck #hhrs
#iamthemob #ma04 #mapoli #palin
#palin12 #spwbt #tsot #tweetcongress
#ucot #wethepeople
Table 2: Hashtags excluded from the analysis due to ambigu-
20. Extract
Networks
Mining
the
Social
Web
-‐
Aalto
-‐
2015
20
node
u
node
v
v
retweets
u
u
menHons
v
two
networks
clustering
1.
max
modularity
for
two
clusters
to
assign
iniHal
labels
to
nodes
(a
and
b)
2.
label
propagaHon
re-‐assign
nodes
to
label
of
most
neighbors
unHl
no
change
21. Findings
Figure 1: The political retweet (left) and mention (right) networks, laid out using a force-directed algorithm. Node colors reflect
cluster assignments (see § 3.1). Community structure is evident in the retweet network, but less so in the mention network. We
show in § 3.3 that in the retweet network, the red cluster A is made of 93% right-leaning users, while the blue cluster B is made
of 80% left-leaning users.
tive Twitter users. This structural difference is of particular
importance with respect to political communication, as we
now have statistical evidence to suggest that mentions and Retweet Mention
A↔A 0.31 0.31
10
0
10
1
s(a,b))
Clusters A
Cluster B
Different clusters
retweets
men%ons
Mining
the
Social
Web
-‐
Aalto
-‐
2015
21
separate
clusters
for
retweets
one
big
cluster
for
men8ons
22. Summary
quesHon:
does
network
structure
reflect
poliHcal
divisions
data:
250k
poliHcal
tweets
before
US
2010
elecHons
methodology:
clustering
&
label
propagaHon
impact:
showed
different
modes
of
interacHon
between
individuals
in
same
&
different
sides
reproducibility:
data
are
online
-‐-‐
http://truthy.indiana.edu/projects/data-and-software.html
Mining
the
Social
Web
-‐
Aalto
-‐
2015
22
23. Mining
the
Social
Web
-‐
Aalto
-‐
2015
23
This
network
graph
details
the
landscape
of
Twi=er
handles
responding
to
the
UNWRA
school
bombing.
Source:
https://medium.com/i-data/israel-gaza-war-data-a54969aeb23e
24. Mining
the
Social
Web
-‐
Aalto
-‐
2015
24
Instagram
co-‐tag
graph,
highligh8ng
three
dis8nct
topical
communi8es:
1)
pro-‐Israeli
(Orange),
2)
pro-‐Pales8nian
(Yellow),
and
3)
Religious
/
muslim
(Purple)
Source:
https://medium.com/i-data/israel-gaza-war-data-a54969aeb23e
25. Mining
the
Social
Web
-‐
Aalto
-‐
2015
25
International Conference on Weblogs and Social Media (ICWSM) 2013
Classifying Political Orientation on Twitter: It’s Not Easy!
Raviv Cohen and Derek Ruths
School of Computer Science
McGill University
raviv.cohen@mail.mcgill.ca, derek.ruths@mcgill.ca
Abstract
Numerous papers have reported great success at infer-
ring the political orientation of Twitter users. This paper
has some unfortunate news to deliver: while past work
has been sound and often methodologically novel, we
have discovered that reported accuracies have been sys-
temically overoptimistic due to the way in which vali-
dation datasets have been collected, reporting accuracy
levels nearly 30% higher than can be expected in popu-
lations of general Twitter users.
Using careful and novel data collection and annotation
techniques, we collected three different sets of Twitter
users, each characterizing a different degree of political
engagement on Twitter — from politicians (highly po-
litically vocal) to “normal” users (those who rarely dis-
cuss politics). Applying standard techniques for infer-
ring political orientation, we show that methods which
previously reported greater than 90% inference accu-
racy, actually achieve barely 65% accuracy on normal
users. We also show that classifiers cannot be used to
classify users outside the narrow range of political ori-
entation on which they were trained.
While a sobering finding, our results quantify and call
attention to overlooked problems in the latent attribute
inference literature that, no doubt, extend beyond polit-
ical orientation inference: the way in which datasets are
assembled and the transferability of classifiers.
Introduction
Much of the promise of online social media studies, analyt-
ics, and commerce depends on knowing various attributes
of individual and groups of users. For a variety of reasons,
few intrinsic attributes of individuals are explicitly revealed
in their user account profiles. As a result, latent attribute in-
ference, the computational discovery of “hidden” attributes,
has become a topic of significant interest among social me-
including gender, age, education, political orientation, and
even coffee preferences (Zamal, Liu, and Ruths 2012;
Conover et al. 2011b; 2011a; Rao and Yarowsky 2010;
Pennacchiotti and Popescu 2011; Wong et al. 2013; Liu and
Ruths 2013; Golbeck and Hansen 2011; Burger, Henderson,
and Zarrella 2011). In general, inference algorithms have
achieved accuracy rates in the range of 85%, but have strug-
gled to improve beyond this point. To date, the great suc-
cess story of this area is political orientation inference for
which a number of papers have boasted inference accuracy
reaching and even surpassing 90% (Conover et al. 2011b;
Zamal, Liu, and Ruths 2012).
By any reasonable measure, the existing work on political
orientation is sound and represents a sincere and successful
effort to advance the technology of latent attribute inference.
Furthermore, a number of the works have yielded notable
insights into the nature of political orientation in online en-
vironments (Conover et al. 2011b; 2011a). In this paper, we
examine the question of whether existing political orienta-
tion inference systems actual perform as well as reported on
the general Twitter population. Our findings indicate that,
without exception, they do not, even when the general pop-
ulation consider is restricted only to those who discuss pol-
itics (since inferring the political orientation of a user who
never speaks about politics is, certainly, very hard if not im-
possible).
We consider this an important question and finding for
two reasons. Foremost, nearly all applications of latent at-
tribute inference involve its use on large populations of un-
known users. As a result, quantifying its performance on the
general Twitter population is arguably the best way of eval-
uating its practical utility. Second, the existing literature on
this topic reports its accuracy in inferring political orienta-
tion without qualification or caveats (author’s note: includ-
26. In
This
Work...
task:
classify
poliHcal
orientaHon
on
twiLer
goal:
invesHgate
claims
of
previous
work
previous
work
high
classificaHon
accuracy
but
was
the
task
too
easy?
heavily
poliHcal
user
accounts
what
about
‘modestly
poliHcal’
users?
Mining
the
Social
Web
-‐
Aalto
-‐
2015
26
27. Data
3
datasets
PoliHcal
Figures
PoliHcally
AcHve
PoliHcally
Modest
Mining
the
Social
Web
-‐
Aalto
-‐
2015
27
pe (e.g.
ocrats).
s from
further
riginal
ployed
2012).
ance to
restric-
y con-
rmore,
Table 1: Basic statistics on the different datasets used. Total size
of the Figures dataset was limited by the number of federal level
politicians; size of the Modest dataset was limited by the number
of users that satisfied our stringent conditions - these were culled
from a dataset of 10,000 random individuals.
Dataset Republicans Democrats Total
Figures 203 194 397
Active 860 977 1837
Modest 105 157 262
Conover 107 89 196
28. Data
Poli%cal
Figures
twiLer
accounts
for
US
governors
&
congressmen
latest
1000
tweets
also...
parHHoned
used
hashtags
into
poliHcally
discriminaHve
&
neutral
Mining
the
Social
Web
-‐
Aalto
-‐
2015
28
29. Data
Poli%cally
Ac%ve
self-‐declared,
according
to
profile
democrats/liberals,
conservaHves/republicans
manual
inspecHon
only
US
residents
poliHcal
topics
not
dominaHng
their
tweets
Mining
the
Social
Web
-‐
Aalto
-‐
2015
29
30. Data
Poli%cally
Modest
>>
10000
random
users
filter
ones
that
use
poliHcally
neutral
hashtag
>>
1500
users
use
amazon
turk
to
classify
based
on
10
tweets
with
hashtags
democrats/liberals
vs
republicans/conserva8ves
>>
327
classified
users
Mining
the
Social
Web
-‐
Aalto
-‐
2015
30
31. ClassificaHon
SVM
with
10-‐fold
cross
validaHon
Features
(from
previous
work):
– tweet/rt/hashtag/link/menHon
frequencies,
– number
of
friends,
followers
– usage
of
top-‐k
Rep/Dem
1-‐2-‐3-‐grams,
hashtags
Mining
the
Social
Web
-‐
Aalto
-‐
2015
31
32. Findings
Mining
the
Social
Web
-‐
Aalto
-‐
2015
32
The average of ten 10-fold cross-validation SVM itera-
est was performed on each one of our datasets respec-
Dataset SVM Accuracy
Figures 91%
Active 84%
Modest 68%
Conover 87%
case, a less c
to summariz
the vocabula
We evalua
we looked at
inating hash
model of inte
Republicans
ticular word
from both g
Democrat/R
96
by Republi-
istically fa-
Table 6: Performance results of training our SVM on one datas
and inferring on another, italicized are the averaged 10-fold cros
validation results
Dataset Figures Active Modest
Figures 91% 72% 66%
Active 62% 84% 69%
Modest 54% 57% 68%
cross-‐dataset
classifica%on
33. Summary
goal:
study
classificaHon
accuracy
for
poliHcal
leaning
data:
targeted
twiLer
sample
methodology:
parHHoned
subsets
of
data,
AMT,
SVM
impact:
showed
difficulty
of
task
reproducibility:
data
available
per
request
-‐
http://www.icwsm.org/2015/datasets/datasets/
Mining
the
Social
Web
-‐
Aalto
-‐
2015
33
34. Mining
the
Social
Web
-‐
Aalto
-‐
2015
34
1
Twitter mood predicts the stock market.
Johan Bollen1,?,Huina Mao1,?,Xiao-Jun Zeng2.
?: authors made equal contributions.
Abstract—Behavioral economics tells us that emotions can
profoundly affect individual behavior and decision-making. Does
this also apply to societies at large, i.e. can societies experience
mood states that affect their collective decision making? By
extension is the public mood correlated or even predictive of
economic indicators? Here we investigate whether measurements
of collective mood states derived from large-scale Twitter feeds
are correlated to the value of the Dow Jones Industrial Average
(DJIA) over time. We analyze the text content of daily Twitter
feeds by two mood tracking tools, namely OpinionFinder that
measures positive vs. negative mood and Google-Profile of Mood
States (GPOMS) that measures mood in terms of 6 dimensions
(Calm, Alert, Sure, Vital, Kind, and Happy). We cross-validate
the resulting mood time series by comparing their ability to
detect the public’s response to the presidential election and
Thanksgiving day in 2008. A Granger causality analysis and
a Self-Organizing Fuzzy Neural Network are then used to
investigate the hypothesis that public mood states, as measured by
the OpinionFinder and GPOMS mood time series, are predictive
of changes in DJIA closing values. Our results indicate that the
accuracy of DJIA predictions can be significantly improved by
the inclusion of specific public mood dimensions but not others.
We find an accuracy of 87.6% in predicting the daily up and
down changes in the closing values of the DJIA and a reduction
of the Mean Average Percentage Error by more than 6%.
Index Terms—stock market prediction — twitter — mood
analysis.
I. INTRODUCTION
STOCK market prediction has attracted much attention
from academia as well as business. But can the stock
market really be predicted? Early research on stock market
prediction [1], [2], [3] was based on random walk theory
and the Efficient Market Hypothesis (EMH) [4]. According
to the EMH stock market prices are largely driven by new
sentiment from blogs. In addition, Google search queries have
been shown to provide early indicators of disease infection
rates and consumer spending [14]. [9] investigates the relations
between breaking financial news and stock price changes.
Most recently [13] provide a ground-breaking demonstration
of how public sentiment related to movies, as expressed on
Twitter, can actually predict box office receipts.
Although news most certainly influences stock market
prices, public mood states or sentiment may play an equally
important role. We know from psychological research that
emotions, in addition to information, play an significant role
in human decision-making [16], [18], [39]. Behavioral finance
has provided further proof that financial decisions are sig-
nificantly driven by emotion and mood [19]. It is therefore
reasonable to assume that the public mood and sentiment can
drive stock market values as much as news. This is supported
by recent research by [10] who extract an indicator of public
anxiety from LiveJournal posts and investigate whether its
variations can predict S&P500 values.
However, if it is our goal to study how public mood
influences the stock markets, we need reliable, scalable and
early assessments of the public mood at a time-scale and
resolution appropriate for practical stock market prediction.
Large surveys of public mood over representative samples of
the population are generally expensive and time-consuming
to conduct, cf. Gallup’s opinion polls and various consumer
and well-being indices. Some have therefore proposed indirect
assessment of public mood or sentiment from the results of
soccer games [20] and from weather conditions [21]. The
accuracy of these methods is however limited by the low
degree to which the chosen indicators are expected to be
correlated with public mood.
rXiv:1010.3003v1[cs.CE]14Oct2010
Journal of Computational Science - 2011
35. In
This
Work...
Efficient
Market
Hypothesis
‘you
can’t
beat
the
market’
all
informaHon
is
already
taken
into
account
But
maybe
twiLer
can?
measure
the
people’s
mood
as
reflected
on
twiLer
use
for
predicHon
Mining
the
Social
Web
-‐
Aalto
-‐
2015
35
36. Data
about
10M
tweets,
February
to
December
2008
Hme-‐series
of
senHment
scores
OpinionFinder:
posiHve
-‐
negaHve
scale
POMS:
lexicon-‐based
mood
score
>
calm,
alert,
sure,
vital,
kind,
happy
Hme-‐series
of
DJIA
from
Yahoo!
Finance
Dow
Jones
Industrial
Average
Mining
the
Social
Web
-‐
Aalto
-‐
2015
36
37. r periods.
DJIA daily closing value (March 2008−December 2008
Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Mining
the
Social
Web
-‐
Aalto
-‐
2015
37
38. Mining
the
Social
Web
-‐
Aalto
-‐
2015
38
each tweet term
ets and methods
MS time series we
local mean and
of k days before
e z-score of time
(1)
mean and stan-
riod [t k, t+k].
uctuate around a
andard deviation.
ies against large
OMS to capture
we apply them
October 5, 2008
osen specifically
ts that may have
on public mood
ber 4, 2008) and
OF and GPOMS
or after Thanksgiving.
1.25
1.75
OpinionFinder day after
election
Thanksgiving
-1
1
pre- election
anxiety
CALM
-1
1
ALERT
-1
1
election
results
SURE
1
1
pre! election
energy
VITAL
-1
-1 KIND
-1
1
Thanksgiving
happiness
HAPPY
Oct 22 Oct 29 Nov 05 Nov 12 Nov 19 Nov 26
z-scores
Fig. 2. Tracking public mood states from tweets posted between October
2008 to December 2008 shows public responses to presidential election and
39. Mining
the
Social
Web
-‐
Aalto
-‐
2015
39
vs. DJIA
sponds to
ntial elec-
question
correlate
A closing
onometric
aily time
the DJIA.
that if a
matically
he lagged
orrelation
ation. We
ar fashion
ether one
r or not7
.
flect daily
the delta
At 1. To
more detail, we plot both time series in Fig. 3. To maintain
the same scale, we convert the DJIA delta values Dt and mood
index value Xt to z-scores as shown in Eq. 1.
-2
-1
0
1
2
DJIAz-score
Aug 09 Aug 29 Sep 18 Oct 08 Oct 28
-2
-1
0
1
2
-2
-1
0
1
2
-2
-1
0
1
2
DJIAz-scoreCalmz-score
Calmz-score
bank
bail-out
Fig. 3. A panel of three graphs. The top graph shows the overlap of the
day-to-day difference of DJIA values (blue: ZDt ) with the GPOMS’ Calm
time series (red: ZXt ) that has been lagged by 3 days. Where the two graphs
overlap the Calm time series predict changes in the DJIA closing values that
40. Linear
CorrelaHon
Mining
the
Social
Web
-‐
Aalto
-‐
2015
40
4
R VS. 6 GPOMS
p
0.20460
0.932
4.25e-08 ? ? ?
0.004??
0.226
1.30e-05 ??
p
2.382e-13
< 0.1: ?)
e provided in
cate that YOF
X4 (Vital) and
Alert) and X5
GPOMS mood
es provided by
imensions that
components of
We perform the Granger causality analysis according to
model L1 and L2 shown in Eq. 3 and 4 for the period of
time between February 28 to November 3, 2008 to exclude
the exceptional public mood response to the Presidential
Election and Thanksgiving from the comparison. GPOMS and
OpinionFinder time series were produced for 342,255 tweets
in that period, and the daily Dow Jones Industrial Average
(DJIA) was retrieved from Yahoo! Finance for each day8
.
L1 : Dt = ↵ +
nX
i=1
iDt i + ✏t (3)
L2 : Dt = ↵ +
nX
i=1
iDt i +
nX
i=1
iXt i + ✏t (4)
Based on the results of our Granger causality (shown in
Table II), we can reject the null hypothesis that the mood time
series do not predict DJIA values, i.e. {1,2,··· ,n} 6= 0 with a
high level of confidence. However, this result only applies to
1 GPOMS mood dimension. We observe that X1 (i.e. Calm)
has the highest Granger causality relation with DJIA for lags
Basic
Model
DJIA
daily
change
Mood
Variables
Enhanced
Model
two
models,
with
and
without
senHment
scores
41. Linear
CorrelaHon
Mining
the
Social
Web
-‐
Aalto
-‐
2015
41
TABLE II
ICANCE (P-VALUES) OF BIVARIATE GRANGER-CAUSALITY CORRELATION BETWEEN MOODS AND DJIA IN PE
2008 TO NOVEMBER 3, 2008.
Lag OF Calm Alert Sure Vital Kind Happy
1 day 0.085? 0.272 0.952 0.648 0.120 0.848 0.388
2 days 0.268 0.013?? 0.973 0.811 0.369 0.991 0.7061
3 days 0.436 0.022?? 0.981 0.349 0.418 0.991 0.723
4 days 0.218 0.030?? 0.998 0.415 0.475 0.989 0.750
5 days 0.300 0.036?? 0.989 0.544 0.553 0.996 0.173
6 days 0.446 0.065? 0.996 0.691 0.682 0.994 0.081?
7 days 0.620 0.157 0.999 0.381 0.713 0.999 0.150
(p-value < 0.05: ??, p-value < 0.1: ?)
d dimension thus has predictive value with
A. In fact the p-value for this shorter period,
8 to October 30 2008, is significantly lower
0.009) than that listed in Table II for the
mood values of the past n days. We cho
the results shown in Table II indicate tha
Granger causal relation between Calm an
significantly. All historical load values are
42. Neural
Network
TABLE III
DJIA DAILY PREDICTION USING SOFNN
Evaluation IOF I0 I1 I1,2 I1,3 I1,4 I1,5 I1,6
MAPE (%) 1.95 1.94 1.83 2.03 2.13 2.05 1.85 1.79?
Direction (%) 73.3 73.3 86.7? 60.0 46.7 60.0 73.3 80.0
of relevant economic indicators. These
ns for existing sentiment tracking tools
self-reported subjective well-being” in
ate the extent to which they experience
[2] Fama, E. F. (1991) Journal of Finance 46,
[3] H.Cootner, P. (1964) The random chara
(MIT).
[4] Fama, E. F. (1965) The Journal of Busines
[5] Qian, Bo, Rasheed, & Khaled. (2007) AppMining
the
Social
Web
-‐
Aalto
-‐
2015
42
calm
calm
+
happy
values
of
3
previous
days
Mean
Average
PredicHon
Error
less
is
beLer
PredicHon
of
DJIA
direc%on
43. Summary
quesHon:
does
twiLer
mood
predict
the
stock
market
data:
8
months
of
tweets
methodology:
lexicon-‐based
senHment
scores,
LR,
NN
impact:
showed
a
case
that
twiLer
mood
can
predict
the
stock
market
reproducibility:
data
on
website,
website
not
working
https://terramood.soic.indiana.edu/data
Mining
the
Social
Web
-‐
Aalto
-‐
2015
43
44. Discussion
what
do
you
think?
what
would
you
do
differently?
Mining
the
Social
Web
-‐
Aalto
-‐
2015
44
45. Baseline
Scenarios
invest
10k
$
on
DJIA
on
Jan
1st
1976
sell
everything
at
end
of
day
if
index
is
down
buy
everything
back
at
end
of
first
day
that
index
is
up
cash
in
on
Dec
31st
1985
>>
25k
$
before
transacHon
costs
>>
1.1k
$
a^er
transacHon
costs
(0.25%
commission)
invest
10k
$
on
Jan
1st
1976,
cash
in
on
Dec
31st
1985
>>
18k
$
repeat
during
the
2000s
>>
4k
$
before
commissions
Mining
the
Social
Web
-‐
Aalto
-‐
2015
45
source:
“The
Signal
and
the
Noise”,
N.
Silver,
2012,
page
344
46. Do
Investors
Mine
TwiLer?
Mining
the
Social
Web
-‐
Aalto
-‐
2015
46
47. Mining
the
Social
Web
-‐
Aalto
-‐
2015
47
The Livehoods Project:
Utilizing Social Media to Understand the Dynamics of a City
Justin Cranshaw Raz Schwartz Jason I. Hong Norman Sadeh
jcransh@cs.cmu.edu razs@andrew.cmu.edu jasonh@cs.cmu.edu sadeh@cs.cmu.edu
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213
Abstract
Studying the social dynamics of a city on a large scale has tra-
ditionally been a challenging endeavor, often requiring long
hours of observation and interviews, usually resulting in only
a partial depiction of reality. To address this difficulty, we
introduce a clustering model and research methodology for
studying the structure and composition of a city on a large
scale based on the social media its residents generate. We ap-
ply this new methodology to data from approximately 18 mil-
lion check-ins collected from users of a location-based online
social network. Unlike the boundaries of traditional munici-
pal organizational units such as neighborhoods, which do not
always reflect the character of life in these areas, our clusters,
which we call Livehoods, are representations of the dynamic
areas that comprise the city. We take a qualitative approach to
validating these clusters, interviewing 27 residents of Pitts-
burgh, PA, to see how their perceptions of the city project
onto our findings there. Our results provide strong support
for the discovered clusters, showing how Livehoods reveal
the distinctly characterized areas of the city and the forces
that shape them.
Introduction
The forces that shape the dynamics of a city are multifarious
and complex. Cultural perceptions, economic factors, mu-
nicipal borders, demography, geography, and resources—all
shape and constrain the texture and character of local urban
life. It can be extremely difficult to convey these intricacies
to an outsider; one may call them well-kept secrets, some-
times only even partially known to the locals. When out-
siders, such as researchers, journalists, or city planners, do
want to learn about a city, it often requires hundreds of hours
activity patterns of its people. Contrary to traditional organi-
zational units such as neighborhoods that are often stagnant
and may portray old realities, our clusters reflect current col-
lective activity patterns of people in the city, thus revealing
the dynamic nature of local urban areas, exposing their indi-
vidual characters, and highlighting various forces that form
the urban habitat.
Our work is made possible by the rapid proliferation of
smartphones in recent years and the subsequent emergence
of location-based services and applications. Location-based
social networks such as foursquare have created new means
for online interactions based on the physical location of their
users. In these systems, users can “check-in” to a location by
selecting it from a list of named nearby venues. Their check-
in is then broadcast to other users of the system.
To algorithmically explore the dynamics of cities, we use
data from millions of check-ins gathered from foursquare.
Using well studied techniques in spectral clustering, we in-
troduce a model for the structure of local urban areas that
groups nearby foursquare venues into clusters. Our model
takes into account both the spatial proximity between venues
as given by their geographic coordinates, as well as the so-
cial proximity which we derive from the distribution of peo-
ple that check-in to them. The underlying hypothesis of our
model is that the “character” of an urban area is defined not
just by the types of places found there, but also by the peo-
ple that choose to make that area part of their daily life. We
call these clusters Livehoods, reflecting the dynamic nature
of activity patterns in the lives of city inhabitants.
We take a qualatative approach to evaluating this hypoth-
esis. In a true urban studies tradition, we conducted in-
terviews with 27 residents of different areas of Pittsburgh,
International Conference on Weblogs and Social Media (ICWSM) 2012
48. In
This
Work...
goal:
discover
city
structure
using
online
acHvity
use
4sq
data
to
uncover
‘livehoods’
in
PiLsburgh
how
do
they
differ
from
official
boundaries?
evaluate
findings
with
interviews
Mining
the
Social
Web
-‐
Aalto
-‐
2015
48
49. Data
43k
check-‐ins
in
PiLsburgh
4k
users,
5k
venues
(restaurants,
cafeterias,
etc)
newly
collected
&
from
previous
work
who
(user
id)
visited
what
venue
(venue-‐id,
geo-‐locaHon)
Mining
the
Social
Web
-‐
Aalto
-‐
2015
49
50. Clustering
For
each
pair
of
venues
– geographic
distance
– social
similarity
s
(in
terms
of
visiHng
users)
For
each
venue
– maintain
m
nearest
geographic
neighbors
– connect
them
with
an
edge
with
weight
s
Apply
spectral
clustering
–
number
of
clusters
at
largest
eigenvalue
gap
Clusters
è
‘Livehoods’
Mining
the
Social
Web
-‐
Aalto
-‐
2015
50
51. Findings
Figure 1: The municipal borders (black) and Livehoods for Shadyside/East Liberty (Left) and Lawrenceville/Polish Hill (Right).
Mining
the
Social
Web
-‐
Aalto
-‐
2015
51
52. EvaluaHon
Found
livehoods
that
split,
spilled,
or
corresponded
with
municipal
areas
Interviews
with
27
residents
– IdenHfied
and
drew
their
neighborhood
– Shown
a
map
with
municipal
boundaries,
asked
if
they
could
idenHfy
borders
that
were
not
accurate
(‘in
flux’)
– Shown
algorithm’s
results,
asked
for
feedback
Mining
the
Social
Web
-‐
Aalto
-‐
2015
52
53. EvaluaHon
Results
• Not
very
rigorous,
but
interesHng...
• Most
discovered
livehoods
matched
boundaries
• When
not,
the
interviewees
usually
agree
&
offered
explanaHons
• One
controversial
case
• Poor
area
missing
(no
smartphones?)
Mining
the
Social
Web
-‐
Aalto
-‐
2015
53
55. Summary
goal:
discover
livehoods
in
ciHes
data:
43k
Foursquare
check-‐ins
methodology:
combined
geographic
&
social
clustering
impact:
showed
a
case
where
we
can
use
online
data
to
understand
offline
acHvity
reproducibility:
some
data
are
online,
results
are
online
Mining
the
Social
Web
-‐
Aalto
-‐
2015
55
56. Discussion
What
do
you
think?
What
would
you
do
differently?
Mining
the
Social
Web
-‐
Aalto
-‐
2015
56
57. Schedule
• Today:
overview
• February
2nd
:
discuss
literature
(Aris)
• February
9th
:
discuss
literature
(Michael)
• February
23rd
:
students
present
project
proposals
• March
30th
:
students
submit
progress
report
• March
30th
&
April
6th:
intermediate
presentaHons
• May
4th
&
May
11th
:
final
presentaHons
• May
15th
:
final
report
due
57
Mining
the
Social
Web
-‐
Aalto
-‐
2015
58. Proposals
• 5
min
presentaHons
• what
do
you
intend
to
do?
• why?
• what
data
will
you
use?
• what
techniques?
• how
will
you
evaluate?
• do
you
plan
to
publish
your
code
and
data?
Mining
the
Social
Web
-‐
Aalto
-‐
2015
58