SlideShare a Scribd company logo
1 of 59
Download to read offline
Mining	
  the	
  Social	
  Web	
  
Aris%des	
  Gionis	
  
Michael	
  Mathioudakis	
  
firstname.lastname@aalto.fi	
  
	
  
	
  Aalto	
  University	
  
Spring	
  2015	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   2	
  
T-61.6020: Mining the social web — lecture #2
structure of social networks
social networks and social-media data can be
represented as graphs (or networks)
how these graphs look like?
what is their structure
data contain additional information
(actions, interactions, dynamics, attributes,…)
mining this additional information as part of
the network structure
6
T-61.6020: Mining the social web — lecture #2
community structure in social networks
12
dolphins network and its NPC
Community structure
dolphins network and its NCP
(source [Leskovec et al., 2009])
Frieze, Gionis, Tsourakakis Algorithmic Techniques for Modeling and Mining Large Graphs 34 / 277
Previously	
  on	
  T.61-­‐6020	
  
structure	
  and	
  dynamics	
  
	
  of	
  social	
  networks	
  
what	
  does	
  a	
  social	
  network	
  look	
  like?	
  
how	
  do	
  social	
  networks	
  evolve	
  over	
  Hme?	
  
how	
  does	
  informaHon	
  spread?	
  
do	
  users	
  influence	
  each	
  other?	
  
Figure 4: Top 50 threads in the news cycle with highest volume for the period Aug. 1 – Oct. 31, 2008. Each thread consists of all news
articles and blog posts containing a textual variant of a particular quoted phrases. (Phrase variants for the two largest threads in
each week are shown as labels pointing to the corresponding thread.) The data is drawn as a stacked plot in which the thickness of the
strand corresponding to each thread indicates its volume over time. Interactive visualization is available at http://memetracker.org.
Figure 5: Temporal dynamics of top threads as generated by our model. Only two ingredients, namely imitation and a preference to
recent threads, are enough to qualitatively reproduce the observed dynamics of the news cycle.
3. GLOBAL ANALYSIS: TEMPORAL VARI-
ATION AND A PROBABILISTIC MODEL
periods when the upper envelope of the curve are high correspond
to times when there is a greater degree of convergence on key sto-
ries, while the low periods indicate that attention is more diffuse,
threads dynamics
Today	
  
poliHcs	
  
does	
  network	
  structure	
  reflect	
  poliHcal	
  divisions?	
  
can	
  we	
  infer	
  poliHcal	
  affiliaHon?	
  
	
  
financial	
  senHment	
  
can	
  twiLer	
  predict	
  the	
  stock	
  market?	
  
	
  
urban	
  compuHng	
  
what	
  does	
  online	
  acHvity	
  say	
  about	
  how	
  we	
  live	
  in	
  ciHes?	
  
	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   3	
  
PoliHcs	
  Online	
  
www	
  &	
  ‘democraHzaHon’	
  of	
  informaHon	
  
	
  
ciHzen	
  journalism	
  
(e.g.	
  HaiH	
  earthquake,	
  Arab	
  spring)	
  
	
  
poliHcians	
  and	
  tradiHonal	
  media	
  
parHcipate,	
  too	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   4	
  
US	
  PoliHcs	
  &	
  the	
  Web	
  
Websites	
  -­‐	
  1996	
  
(Email)	
  -­‐	
  1998	
  	
  
Online	
  Fund	
  raising	
  -­‐	
  2000	
  
Blogs	
  -­‐	
  2004	
  
TwiLer	
  &	
  FB	
  -­‐	
  2008	
  
Jesse Ventura - MN governor
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   5	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   6	
  
social	
  media	
  mining	
  
	
  vs	
  	
  
tradiHonal	
  poliHcal	
  polls	
  
new	
  content	
  
&	
  biases	
  
International Workshop on Link Discovery, 2005
The Political Blogosphere and the 2004 U.S. Election:
Divided They Blog
Lada Adamic
HP Labs
1501 Page Mill Road
Palo Alto, CA 94304
lada.adamic@hp.com
Natalie Glance
Intelliseek Applied Research Center
5001 Baum Blvd.
Pittsburgh, PA 15217
nglance@intelliseek.com
4 March 2005
Abstract
In this paper, we study the linking patterns and discussion topics of political bloggers.
Our aim is to measure the degree of interaction between liberal and conservative blogs, and
to uncover any differences in the structure of the two communities. Specifically, we analyze
the posts of 40 “A-list” blogs over the period of two months preceding the U.S. Presidential
Election of 2004, to study how often they referred to one another and to quantify the overlap in
the topics they discussed, both within the liberal and conservative communities, and also across
communities. We also study a single day snapshot of over 1,000 political blogs. This snapshot
captures blogrolls (the list of links to other blogs frequently found in sidebars), and presents
a more static picture of a broader blogosphere. Most significantly, we find differences in the
behavior of liberal and conservative blogs, with conservative blogs linking to each other more
frequently and in a denser pattern.
1 Introduction
The 2004 U.S. Presidential Election was the first Presidential Election in the United States in which
blogging played an important role. Although the term weblog was coined in 1997, it was not until
after 9/11 that blogs gained readership and influence in the U.S. The next major trend in political
blogging was “warblogging”: blogs centered around discussion of the invasion of Iraq by the U.S.1
The year 2004 saw a rapid rise in the popularity and proliferation of blogs. According to a
report from the Pew Internet & American Life Project published in January 2005, 32 million U.S.
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   7	
  
Some	
  History	
  
before	
  facebook	
  and	
  twiLer,	
  
there	
  were	
  blogs	
  
	
  
rise	
  a^er	
  9/11	
  
‘war-­‐blogging’	
  
	
  
2004	
  
32	
  million	
  US	
  ciHzens	
  read	
  blogs	
  
62%	
  of	
  US	
  ciHzens	
  do	
  not	
  know	
  them	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   8	
  
Jan 25, 2004
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   9	
  
everyone	
  talks	
  with	
  everyone?...	
  
or	
  ‘echo	
  chambers’?	
  
In	
  This	
  Work...	
  
task:	
  extract	
  network	
  structure	
  of	
  poliHcal	
  interacHons	
  
quesHon:	
  one	
  or	
  separate	
  communiHes?	
  
	
  
...previous	
  evidence	
  on	
  two	
  blogs	
  (Instapundit	
  and	
  Atrios)	
  show	
  
‘neighborhoods’	
  have	
  no	
  overlap	
  in	
  cited	
  urls	
  
...same	
  with	
  book	
  purchases	
  on	
  amazon.com	
  
	
  
on	
  the	
  other	
  hand...	
  
...	
  it	
  is	
  now	
  easier	
  to	
  interact	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   10	
  
used	
  blog	
  directories	
  for	
  lists	
  of	
  poliHcal	
  blogs	
  
parsed	
  front	
  page	
  for	
  links	
  to	
  discover	
  more	
  blogs	
  
labeled	
  them	
  manually	
  
	
  
only	
  liberal	
  &	
  conservaHve	
  
	
  
759	
  liberal	
  &	
  735	
  conservaHve	
  blogs	
  
	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   11	
  
Data	
  
Findings	
  
Figure 1: Community structure of political blogs (expanded set), shown using utilizing a GEM
layout [11] in the GUESS[3] visualization and analysis tool. The colors reflect political orientation,
red for conservative, and blue for liberal. Orange links go from liberal to conservative, and purple
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   12	
  
Findings	
  
10
0
10
1
10
2
10
−2
10
−1
10
0
incoming links (k)
fractionofblogswithatlastklinks
conservative
liberal
Lognormal fit
k−036
e−k/57
ative distribution of incoming links for political blogs, separated by category. As
ognormal, shown as a dashed line, to be a fairly good fit. A power-law with an
ff, shown as a solid line, is an even better fit.
This is on par with the 277 links received by the most linked to conservative
1 2
3
4
56
7
8
9
10
11
1213
1415
16
17
18
19
20
21
22 23
24
25
26
27
28
29 30
31
32
33
34 35 36
37 38 39
40
1 Digbys Blog
2 James Walcott
3 Pandagon
4 blog.johnkerry.com
5 Oliver Willis
6 America Blog
7 Crooked Timber
8 Daily Kos
9 American Prospect
10 Eschaton
11 Wonkette
12 Talk Left
13 Political Wire
14 Talking Points Memo
15 Matthew Yglesias
16 Washington Monthly
17 MyDD
18 Juan Cole
19 Left Coaster
20 Bradford DeLong
21 JawaReport
22 Voka Pundit
23 Roger L Simon
24 Tim Blair
25 Andrew Sullivan
26 Instapundit
27 Blogs for Bush
28 Little Green Footballs
29 Belmont Club
30 Captain’s Quarters
31 Powerline
32 Hugh Hewitt
33 INDC Journal
34 Real Clear Politics
35 Winds of Change
36 Allahpundit
37 Michelle Malkin
38 WizBang
39 Dean’s World
40 Volokh
(C)
(B)
(A)
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   13	
  
similar	
  picture	
  for	
  top	
  20	
  blogs	
  
small	
  number	
  of	
  blogs	
  
aLract	
  most	
  links	
  
0 200 400 600 800 1000 1200 1400 1600
nytimes.com
washingtonpost.com
news.yahoo.com
msnbc.msn.com
nationalreview.com
cnn.com
latimes.com
boston.com
usatoday.com
washingtontimes.com
apnews.myway.com
guardian.co.uk
foxnews.com
cbsnews.com
slate.msn.com
nypost.com
news.bbc.co.uk
tnr.com
opinionjournal.com
online.wsj.com
salon.com
# citations from weblog posts
Left
Right
Figure 4: Most linked to news sources by the top 20 conservative and top 20 liberal blogs during
8/29/2004 - 11/15/2004.
1. CBS News poll of uncommitted voters shows Kerry winning 43% to 28%
2. Sun Times article: Bob Novak predicts that George Bush will retreat from Iraq if reelected
0 200 400 600 800 1000 1200 1400 1600
nytimes.com
washingtonpost.com
news.yahoo.com
msnbc.msn.com
nationalreview.com
cnn.com
latimes.com
boston.com
usatoday.com
washingtontimes.com
apnews.myway.com
guardian.co.uk
foxnews.com
cbsnews.com
slate.msn.com
nypost.com
# citations from weblog posts
Figure 4: Most linked to news sources by the top 20 conservative and top 20 liberal blogs during
8/29/2004 - 11/15/2004.
1. CBS News poll of uncommitted voters shows Kerry winning 43% to 28%
2. Sun Times article: Bob Novak predicts that George Bush will retreat from Iraq if reelected
3. CBS News article on forged memos
4. New York Daily News article on Osama Bin Laden videotope, “gift” for the President
5. Time Magazine poll: Bush opens double-digit lead on post convention bounce
In contrast, the top news articles cited by right leaning bloggers are:
1. CBS News article on forged memos
2. Time Magazine poll: Bush opens double-digit lead on post convention bounce
3. National Review article refuting the case about missing explosives
4. ABC News article refuting the case about missing explosives
5. Washington Post article reporting on Kerry’s proposal to allow Iran to keep its nuclear power
plants in exchange for giving up the right to retain the nuclear fuel that could be used for
bomb-making
A time series chart further shows how quickly and strongly conservative bloggers responded to
forged CBS documents (Figure 5). The conservative bloggers saw Dan Rather’s report as an attempt
by the left to discredit President Bush. They acted quickly to debunk the report, with the charge
led by PowerLine and seconded by Wizbangblog and others. In contrast, the pick-up among liberal
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   14	
  
Findings	
  
0 200 400 600 800 1000 1200 1400 1600
nytimes.com
washingtonpost.com
news.yahoo.
# citations from weblog posts
Figure 4: Most linked to news sources by the top 20 conservative and top 20 liberal blogs during
8/29/2004 - 11/15/2004.
1. CBS News poll of uncommitted voters shows Kerry winning 43% to 28%
2. Sun Times article: Bob Novak predicts that George Bush will retreat from Iraq if reelected
3. CBS News article on forged memos
4. New York Daily News article on Osama Bin Laden videotope, “gift” for the President
5. Time Magazine poll: Bush opens double-digit lead on post convention bounce
In contrast, the top news articles cited by right leaning bloggers are:
1. CBS News article on forged memos
2. Time Magazine poll: Bush opens double-digit lead on post convention bounce
3. National Review article refuting the case about missing explosives
4. ABC News article refuting the case about missing explosives
5. Washington Post article reporting on Kerry’s proposal to allow Iran to keep its nuclear power
plants in exchange for giving up the right to retain the nuclear fuel that could be used for
bomb-making
A time series chart further shows how quickly and strongly conservative bloggers responded to
forged CBS documents (Figure 5). The conservative bloggers saw Dan Rather’s report as an attempt
by the left to discredit President Bush. They acted quickly to debunk the report, with the charge
led by PowerLine and seconded by Wizbangblog and others. In contrast, the pick-up among liberal
bloggers occurred later, with lower volume. The most vocal left leaning bloggers on the subject were
TalkLeft and AMERICAblog.
11
top	
  20	
  blogs	
  
news	
  links	
  
0 20 40 60 80 100 120 140
buzzflash.com
cursor.org
mediamatters.org
commondreams.org
alternet.org
airamericaradio.co
salon.com
thenation.com
theonion.com
guardian.co.uk
nytimes.com
news.google.com
washingtonpost.com
cnn.com
foxnews.com
weeklystandard.com
command-post.org
townhall.com
opinionjournal.com
nationalreview.com
number of blogs linking
Left
Right
Figure 7: Most linked to news sources (online and off), showing proportionally how many liberal
and conservative blogs link to them.Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   15	
  
Findings	
  
all	
  blogs	
  
links	
  to	
  other	
  websites	
  
Summary	
  
quesHon:	
  one	
  or	
  separate	
  communiHes?	
  
data:	
  1.5k	
  blogs,	
  manually	
  labeled	
  
methodology:	
  simple	
  link	
  analysis	
  
impact:	
  showed	
  divide	
  in	
  ‘online’	
  world	
  
reproducibility:	
  ?	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   16	
  
Political Polarization on Twitter
M. D. Conover, J. Ratkiewicz, M. Francisco, B. Gonc¸alves, A. Flammini, F. Menczer
Center for Complex Networks and Systems Research
School of Informatics and Computing
Indiana University, Bloomington, IN, USA
Abstract
In this study we investigate how social media shape the
networked public sphere and facilitate communication be-
tween communities with different political orientations. We
examine two networks of political communication on Twit-
ter, comprised of more than 250,000 tweets from the six
weeks leading up to the 2010 U.S. congressional midterm
elections. Using a combination of network clustering algo-
rithms and manually-annotated data we demonstrate that the
network of political retweets exhibits a highly segregated par-
tisan structure, with extremely limited connectivity between
left- and right-leaning users. Surprisingly this is not the case
for the user-to-user mention network, which is dominated by
a single politically heterogeneous cluster of users in which
ideologically-opposed individuals interact at a much higher
rate compared to the network of retweets. To explain the dis-
tinct topologies of the retweet and mention networks we con-
jecture that politically motivated individuals provoke inter-
action by injecting partisan content into information streams
whose primary audience consists of ideologically-opposed
users. We conclude with statistical evidence in support of this
hypothesis.
1 Introduction
Social media play an important role in shaping political dis-
course in the U.S. and around the world (Bennett 2003;
Benkler 2006; Sunstein 2007; Farrell and Drezner 2008;
Aday et al. 2010; Tumasjan et al. 2010; O’Connor et al.
2010). According to the Pew Internet and American Life
Gallo, and Kane (2007). Consumers of online political in-
formation tend to behave similarly, choosing to read blogs
that share their political beliefs, with 26% more users do-
ing so in 2008 than 2004 (Pew Internet and American Life
Project 2008).
In its own right, the formation of online communities is
not necessarily a serious problem. The concern is that when
politically active individuals can avoid people and informa-
tion they would not have chosen in advance, their opinions
are likely to become increasingly extreme as a result of being
exposed to more homogeneous viewpoints and fewer credi-
ble opposing opinions. The implications for the political pro-
cess in this case are clear. A deliberative democracy relies on
a broadly informed public and a healthy ecosystem of com-
peting ideas. If individuals are exposed exclusively to people
or facts that reinforce their pre-existing beliefs, democracy
suffers (Sunstein 2002; 2007).
In this study we examine networks of political commu-
nication on the Twitter microblogging service during the
six weeks prior to the 2010 U.S. midterm elections. Sam-
pling data from the Twitter ‘gardenhose’ API, we identi-
fied 250,000 politically relevant messages (tweets) produced
by more than 45,000 users. From these tweets we isolated
two networks of political communication — the retweet
network, in which users are connected if one has rebroad-
cast content produced by another, and the mention network,
where users are connected if one has mentioned another in a
post, including the case of tweet replies.
We demonstrate that the retweet network exhibits a highly
Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   17	
  
International Conference on Weblogs and Social Media (ICWSM) 2011
In	
  This	
  Work...	
  
quesHon	
  
check	
  if	
  same	
  paLern	
  holds	
  in	
  twiLer	
  
	
  
however	
  
retweets	
  and	
  men8ons	
  instead	
  of	
  links	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   18	
  
Data	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   19	
  
6	
  weeks	
  of	
  tweets	
  before	
  US	
  2010	
  elecHons	
  
to	
  disHnguish	
  poliHcal	
  tweets	
  
1.  parse	
  all	
  tweets	
  (355	
  million)	
  
2.  select	
  subset	
  with	
  hashtags	
  #p2	
  and	
  #tcot	
  
3.  find	
  frequently	
  co-­‐occuring	
  hashtags	
  (66)	
  
4.  keep	
  tweets	
  that	
  contain	
  those	
  hashtags	
  
>>	
  250k	
  tweets	
  
The major contributions of this work are:
• Creation and release of a network and text dataset derived
from more than 250,000 politically-related Twitter posts
authored in the weeks preceeding the 2010 U.S. midterm
elections (§ 2).
• Cluster analysis of networks derived from this corpus
showing that the network of retweet exhibits clear seg-
regation, while the mention network is dominated by a
single large community (§ 3.1).
• Manual classification of Twitter users by political align-
ment, demonstrating that the retweet network clusters cor-
respond to the political left and right. These data also
show the mention network to be politically heteroge-
neous, with users of opposing political views interacting
at a much higher rate than in the retweet network (§ 3.3).
• An interpretation of the observed community structures
Table 1: Hashtags related to #p2, #tcot, or both. Tweets
containing any of these were included in our sample.
Just #p2 #casen #dadt #dc10210 #democrats #du1
#fem2 #gotv #kysen #lgf #ofa #onenation
#p2b #pledge #rebelleft #truthout #vote
#vote2010 #whyimvotingdemocrat #youcut
Both #cspj #dem #dems #desen #gop #hcr
#nvsen #obama #ocra #p2 #p21 #phnm
#politics #sgp #tcot #teaparty #tlot
#topprog #tpp #twisters #votedem
Just #tcot #912 #ampat #ftrs #glennbeck #hhrs
#iamthemob #ma04 #mapoli #palin
#palin12 #spwbt #tsot #tweetcongress
#ucot #wethepeople
Table 2: Hashtags excluded from the analysis due to ambigu-
Extract	
  Networks	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   20	
  
node	
  u	
  	
   node	
  v	
  	
  
v	
  retweets	
  u	
  
u	
  menHons	
  v	
  
two	
  networks	
  
clustering	
  
1.	
  max	
  modularity	
  for	
  two	
  clusters	
  
to	
  assign	
  iniHal	
  labels	
  to	
  nodes	
  (a	
  and	
  b)	
  
2.	
  label	
  propagaHon	
  
re-­‐assign	
  nodes	
  to	
  label	
  of	
  most	
  neighbors	
  
unHl	
  no	
  change	
  
Findings	
  
Figure 1: The political retweet (left) and mention (right) networks, laid out using a force-directed algorithm. Node colors reflect
cluster assignments (see § 3.1). Community structure is evident in the retweet network, but less so in the mention network. We
show in § 3.3 that in the retweet network, the red cluster A is made of 93% right-leaning users, while the blue cluster B is made
of 80% left-leaning users.
tive Twitter users. This structural difference is of particular
importance with respect to political communication, as we
now have statistical evidence to suggest that mentions and Retweet Mention
A↔A 0.31 0.31
10
0
10
1
s(a,b))
Clusters A
Cluster B
Different clusters
retweets	
   men%ons	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   21	
  
separate	
  clusters	
  for	
  retweets	
  
one	
  big	
  cluster	
  for	
  men8ons	
  
Summary	
  
quesHon:	
  does	
  network	
  structure	
  reflect	
  poliHcal	
  divisions	
  
data:	
  250k	
  poliHcal	
  tweets	
  before	
  US	
  2010	
  elecHons	
  
methodology:	
  clustering	
  &	
  label	
  propagaHon	
  
impact:	
  showed	
  different	
  modes	
  of	
  interacHon	
  between	
  
individuals	
  in	
  same	
  &	
  different	
  sides	
  
reproducibility:	
  data	
  are	
  online	
  -­‐-­‐	
  	
  
http://truthy.indiana.edu/projects/data-and-software.html
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   22	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   23	
  
This	
  network	
  graph	
  details	
  the	
  landscape	
  of	
  Twi=er	
  handles	
  
responding	
  to	
  the	
  UNWRA	
  school	
  bombing.	
  
Source:	
  https://medium.com/i-data/israel-gaza-war-data-a54969aeb23e
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   24	
  
Instagram	
  co-­‐tag	
  graph,	
  highligh8ng	
  three	
  dis8nct	
  topical	
  
communi8es:	
  1)	
  pro-­‐Israeli	
  (Orange),	
  2)	
  pro-­‐Pales8nian	
  
(Yellow),	
  and	
  3)	
  Religious	
  /	
  muslim	
  (Purple)	
  
Source:	
  https://medium.com/i-data/israel-gaza-war-data-a54969aeb23e
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   25	
  
International Conference on Weblogs and Social Media (ICWSM) 2013
Classifying Political Orientation on Twitter: It’s Not Easy!
Raviv Cohen and Derek Ruths
School of Computer Science
McGill University
raviv.cohen@mail.mcgill.ca, derek.ruths@mcgill.ca
Abstract
Numerous papers have reported great success at infer-
ring the political orientation of Twitter users. This paper
has some unfortunate news to deliver: while past work
has been sound and often methodologically novel, we
have discovered that reported accuracies have been sys-
temically overoptimistic due to the way in which vali-
dation datasets have been collected, reporting accuracy
levels nearly 30% higher than can be expected in popu-
lations of general Twitter users.
Using careful and novel data collection and annotation
techniques, we collected three different sets of Twitter
users, each characterizing a different degree of political
engagement on Twitter — from politicians (highly po-
litically vocal) to “normal” users (those who rarely dis-
cuss politics). Applying standard techniques for infer-
ring political orientation, we show that methods which
previously reported greater than 90% inference accu-
racy, actually achieve barely 65% accuracy on normal
users. We also show that classifiers cannot be used to
classify users outside the narrow range of political ori-
entation on which they were trained.
While a sobering finding, our results quantify and call
attention to overlooked problems in the latent attribute
inference literature that, no doubt, extend beyond polit-
ical orientation inference: the way in which datasets are
assembled and the transferability of classifiers.
Introduction
Much of the promise of online social media studies, analyt-
ics, and commerce depends on knowing various attributes
of individual and groups of users. For a variety of reasons,
few intrinsic attributes of individuals are explicitly revealed
in their user account profiles. As a result, latent attribute in-
ference, the computational discovery of “hidden” attributes,
has become a topic of significant interest among social me-
including gender, age, education, political orientation, and
even coffee preferences (Zamal, Liu, and Ruths 2012;
Conover et al. 2011b; 2011a; Rao and Yarowsky 2010;
Pennacchiotti and Popescu 2011; Wong et al. 2013; Liu and
Ruths 2013; Golbeck and Hansen 2011; Burger, Henderson,
and Zarrella 2011). In general, inference algorithms have
achieved accuracy rates in the range of 85%, but have strug-
gled to improve beyond this point. To date, the great suc-
cess story of this area is political orientation inference for
which a number of papers have boasted inference accuracy
reaching and even surpassing 90% (Conover et al. 2011b;
Zamal, Liu, and Ruths 2012).
By any reasonable measure, the existing work on political
orientation is sound and represents a sincere and successful
effort to advance the technology of latent attribute inference.
Furthermore, a number of the works have yielded notable
insights into the nature of political orientation in online en-
vironments (Conover et al. 2011b; 2011a). In this paper, we
examine the question of whether existing political orienta-
tion inference systems actual perform as well as reported on
the general Twitter population. Our findings indicate that,
without exception, they do not, even when the general pop-
ulation consider is restricted only to those who discuss pol-
itics (since inferring the political orientation of a user who
never speaks about politics is, certainly, very hard if not im-
possible).
We consider this an important question and finding for
two reasons. Foremost, nearly all applications of latent at-
tribute inference involve its use on large populations of un-
known users. As a result, quantifying its performance on the
general Twitter population is arguably the best way of eval-
uating its practical utility. Second, the existing literature on
this topic reports its accuracy in inferring political orienta-
tion without qualification or caveats (author’s note: includ-
In	
  This	
  Work...	
  
task:	
  classify	
  poliHcal	
  orientaHon	
  on	
  twiLer	
  
goal:	
  invesHgate	
  claims	
  of	
  previous	
  work	
  
	
  
previous	
  work	
  
high	
  classificaHon	
  accuracy	
  
but	
  was	
  the	
  task	
  too	
  easy?	
  
heavily	
  poliHcal	
  user	
  accounts	
  
what	
  about	
  ‘modestly	
  poliHcal’	
  users?	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   26	
  
Data	
  
3	
  datasets	
  
PoliHcal	
  Figures	
  
PoliHcally	
  AcHve	
  
PoliHcally	
  Modest	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   27	
  
pe (e.g.
ocrats).
s from
further
riginal
ployed
2012).
ance to
restric-
y con-
rmore,
Table 1: Basic statistics on the different datasets used. Total size
of the Figures dataset was limited by the number of federal level
politicians; size of the Modest dataset was limited by the number
of users that satisfied our stringent conditions - these were culled
from a dataset of 10,000 random individuals.
Dataset Republicans Democrats Total
Figures 203 194 397
Active 860 977 1837
Modest 105 157 262
Conover 107 89 196
Data	
  
	
  
Poli%cal	
  Figures	
  
twiLer	
  accounts	
  for	
  US	
  governors	
  &	
  congressmen	
  
latest	
  1000	
  tweets	
  
	
  
also...	
  
parHHoned	
  	
  used	
  hashtags	
  
into	
  poliHcally	
  discriminaHve	
  &	
  neutral	
  
	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   28	
  
Data	
  
Poli%cally	
  Ac%ve	
  
self-­‐declared,	
  according	
  to	
  profile	
  
democrats/liberals,	
  conservaHves/republicans	
  
manual	
  inspecHon	
  
only	
  US	
  residents	
  
	
  
poliHcal	
  topics	
  not	
  dominaHng	
  their	
  tweets	
  
	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   29	
  
Data	
  
Poli%cally	
  Modest	
  
>>	
  10000	
  random	
  users	
  	
  
	
  
filter	
  ones	
  that	
  use	
  poliHcally	
  neutral	
  hashtag	
  
>>	
  1500	
  users	
  
	
  
use	
  amazon	
  turk	
  to	
  classify	
  
based	
  on	
  10	
  tweets	
  with	
  hashtags	
  
democrats/liberals	
  vs	
  republicans/conserva8ves	
  
>>	
  327	
  classified	
  users	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   30	
  
ClassificaHon	
  
SVM	
  with	
  10-­‐fold	
  cross	
  validaHon	
  
	
  
Features	
  (from	
  previous	
  work):	
  	
  
–  tweet/rt/hashtag/link/menHon	
  frequencies,	
  
–  number	
  of	
  friends,	
  followers	
  
–  usage	
  of	
  top-­‐k	
  Rep/Dem	
  1-­‐2-­‐3-­‐grams,	
  hashtags	
  	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   31	
  
Findings	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   32	
  
The average of ten 10-fold cross-validation SVM itera-
est was performed on each one of our datasets respec-
Dataset SVM Accuracy
Figures 91%
Active 84%
Modest 68%
Conover 87%
case, a less c
to summariz
the vocabula
We evalua
we looked at
inating hash
model of inte
Republicans
ticular word
from both g
Democrat/R
96
by Republi-
istically fa-
Table 6: Performance results of training our SVM on one datas
and inferring on another, italicized are the averaged 10-fold cros
validation results
Dataset Figures Active Modest
Figures 91% 72% 66%
Active 62% 84% 69%
Modest 54% 57% 68%
cross-­‐dataset	
  classifica%on	
  
Summary	
  
goal:	
  study	
  classificaHon	
  accuracy	
  for	
  poliHcal	
  leaning	
  
data:	
  targeted	
  twiLer	
  sample	
  
methodology:	
  parHHoned	
  subsets	
  of	
  data,	
  AMT,	
  SVM	
  
impact:	
  showed	
  difficulty	
  of	
  task	
  
reproducibility:	
  data	
  available	
  per	
  request	
  -­‐	
  
http://www.icwsm.org/2015/datasets/datasets/
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   33	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   34	
  
1
Twitter mood predicts the stock market.
Johan Bollen1,?,Huina Mao1,?,Xiao-Jun Zeng2.
?: authors made equal contributions.
Abstract—Behavioral economics tells us that emotions can
profoundly affect individual behavior and decision-making. Does
this also apply to societies at large, i.e. can societies experience
mood states that affect their collective decision making? By
extension is the public mood correlated or even predictive of
economic indicators? Here we investigate whether measurements
of collective mood states derived from large-scale Twitter feeds
are correlated to the value of the Dow Jones Industrial Average
(DJIA) over time. We analyze the text content of daily Twitter
feeds by two mood tracking tools, namely OpinionFinder that
measures positive vs. negative mood and Google-Profile of Mood
States (GPOMS) that measures mood in terms of 6 dimensions
(Calm, Alert, Sure, Vital, Kind, and Happy). We cross-validate
the resulting mood time series by comparing their ability to
detect the public’s response to the presidential election and
Thanksgiving day in 2008. A Granger causality analysis and
a Self-Organizing Fuzzy Neural Network are then used to
investigate the hypothesis that public mood states, as measured by
the OpinionFinder and GPOMS mood time series, are predictive
of changes in DJIA closing values. Our results indicate that the
accuracy of DJIA predictions can be significantly improved by
the inclusion of specific public mood dimensions but not others.
We find an accuracy of 87.6% in predicting the daily up and
down changes in the closing values of the DJIA and a reduction
of the Mean Average Percentage Error by more than 6%.
Index Terms—stock market prediction — twitter — mood
analysis.
I. INTRODUCTION
STOCK market prediction has attracted much attention
from academia as well as business. But can the stock
market really be predicted? Early research on stock market
prediction [1], [2], [3] was based on random walk theory
and the Efficient Market Hypothesis (EMH) [4]. According
to the EMH stock market prices are largely driven by new
sentiment from blogs. In addition, Google search queries have
been shown to provide early indicators of disease infection
rates and consumer spending [14]. [9] investigates the relations
between breaking financial news and stock price changes.
Most recently [13] provide a ground-breaking demonstration
of how public sentiment related to movies, as expressed on
Twitter, can actually predict box office receipts.
Although news most certainly influences stock market
prices, public mood states or sentiment may play an equally
important role. We know from psychological research that
emotions, in addition to information, play an significant role
in human decision-making [16], [18], [39]. Behavioral finance
has provided further proof that financial decisions are sig-
nificantly driven by emotion and mood [19]. It is therefore
reasonable to assume that the public mood and sentiment can
drive stock market values as much as news. This is supported
by recent research by [10] who extract an indicator of public
anxiety from LiveJournal posts and investigate whether its
variations can predict S&P500 values.
However, if it is our goal to study how public mood
influences the stock markets, we need reliable, scalable and
early assessments of the public mood at a time-scale and
resolution appropriate for practical stock market prediction.
Large surveys of public mood over representative samples of
the population are generally expensive and time-consuming
to conduct, cf. Gallup’s opinion polls and various consumer
and well-being indices. Some have therefore proposed indirect
assessment of public mood or sentiment from the results of
soccer games [20] and from weather conditions [21]. The
accuracy of these methods is however limited by the low
degree to which the chosen indicators are expected to be
correlated with public mood.
rXiv:1010.3003v1[cs.CE]14Oct2010
Journal of Computational Science - 2011
In	
  This	
  Work...	
  
Efficient	
  Market	
  Hypothesis	
  
‘you	
  can’t	
  beat	
  the	
  market’	
  
all	
  informaHon	
  is	
  already	
  taken	
  into	
  account	
  
	
  
But	
  maybe	
  twiLer	
  can?	
  
measure	
  the	
  people’s	
  mood	
  as	
  reflected	
  on	
  twiLer	
  
use	
  for	
  predicHon	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   35	
  
Data	
  
about	
  10M	
  tweets,	
  February	
  to	
  December	
  2008	
  
	
  
Hme-­‐series	
  of	
  senHment	
  scores	
  
OpinionFinder:	
  posiHve	
  -­‐	
  negaHve	
  scale	
  
POMS:	
  lexicon-­‐based	
  mood	
  score	
  
	
  >	
  calm,	
  alert,	
  sure,	
  vital,	
  kind,	
  happy	
  
	
  
Hme-­‐series	
  of	
  DJIA	
  from	
  Yahoo!	
  Finance	
  
Dow	
  Jones	
  Industrial	
  Average	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   36	
  
r periods.
DJIA daily closing value (March 2008−December 2008
Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   37	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   38	
  
each tweet term
ets and methods
MS time series we
local mean and
of k days before
e z-score of time
(1)
mean and stan-
riod [t k, t+k].
uctuate around a
andard deviation.
ies against large
OMS to capture
we apply them
October 5, 2008
osen specifically
ts that may have
on public mood
ber 4, 2008) and
OF and GPOMS
or after Thanksgiving.
1.25
1.75
OpinionFinder day after
election
Thanksgiving
-1
1
pre- election
anxiety
CALM
-1
1
ALERT
-1
1
election
results
SURE
1
1
pre! election
energy
VITAL
-1
-1 KIND
-1
1
Thanksgiving
happiness
HAPPY
Oct 22 Oct 29 Nov 05 Nov 12 Nov 19 Nov 26
z-scores
Fig. 2. Tracking public mood states from tweets posted between October
2008 to December 2008 shows public responses to presidential election and
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   39	
  
vs. DJIA
sponds to
ntial elec-
question
correlate
A closing
onometric
aily time
the DJIA.
that if a
matically
he lagged
orrelation
ation. We
ar fashion
ether one
r or not7
.
flect daily
the delta
At 1. To
more detail, we plot both time series in Fig. 3. To maintain
the same scale, we convert the DJIA delta values Dt and mood
index value Xt to z-scores as shown in Eq. 1.
-2
-1
0
1
2
DJIAz-score
Aug 09 Aug 29 Sep 18 Oct 08 Oct 28
-2
-1
0
1
2
-2
-1
0
1
2
-2
-1
0
1
2
DJIAz-scoreCalmz-score
Calmz-score
bank
bail-out
Fig. 3. A panel of three graphs. The top graph shows the overlap of the
day-to-day difference of DJIA values (blue: ZDt ) with the GPOMS’ Calm
time series (red: ZXt ) that has been lagged by 3 days. Where the two graphs
overlap the Calm time series predict changes in the DJIA closing values that
Linear	
  CorrelaHon	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   40	
  
4
R VS. 6 GPOMS
p
0.20460
0.932
4.25e-08 ? ? ?
0.004??
0.226
1.30e-05 ??
p
2.382e-13
< 0.1: ?)
e provided in
cate that YOF
X4 (Vital) and
Alert) and X5
GPOMS mood
es provided by
imensions that
components of
We perform the Granger causality analysis according to
model L1 and L2 shown in Eq. 3 and 4 for the period of
time between February 28 to November 3, 2008 to exclude
the exceptional public mood response to the Presidential
Election and Thanksgiving from the comparison. GPOMS and
OpinionFinder time series were produced for 342,255 tweets
in that period, and the daily Dow Jones Industrial Average
(DJIA) was retrieved from Yahoo! Finance for each day8
.
L1 : Dt = ↵ +
nX
i=1
iDt i + ✏t (3)
L2 : Dt = ↵ +
nX
i=1
iDt i +
nX
i=1
iXt i + ✏t (4)
Based on the results of our Granger causality (shown in
Table II), we can reject the null hypothesis that the mood time
series do not predict DJIA values, i.e. {1,2,··· ,n} 6= 0 with a
high level of confidence. However, this result only applies to
1 GPOMS mood dimension. We observe that X1 (i.e. Calm)
has the highest Granger causality relation with DJIA for lags
Basic	
  Model	
  
DJIA	
  daily	
  change	
  
Mood	
  Variables	
  
Enhanced	
  Model	
  
two	
  models,	
  
with	
  and	
  without	
  senHment	
  scores	
  
Linear	
  CorrelaHon	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   41	
  
TABLE II
ICANCE (P-VALUES) OF BIVARIATE GRANGER-CAUSALITY CORRELATION BETWEEN MOODS AND DJIA IN PE
2008 TO NOVEMBER 3, 2008.
Lag OF Calm Alert Sure Vital Kind Happy
1 day 0.085? 0.272 0.952 0.648 0.120 0.848 0.388
2 days 0.268 0.013?? 0.973 0.811 0.369 0.991 0.7061
3 days 0.436 0.022?? 0.981 0.349 0.418 0.991 0.723
4 days 0.218 0.030?? 0.998 0.415 0.475 0.989 0.750
5 days 0.300 0.036?? 0.989 0.544 0.553 0.996 0.173
6 days 0.446 0.065? 0.996 0.691 0.682 0.994 0.081?
7 days 0.620 0.157 0.999 0.381 0.713 0.999 0.150
(p-value < 0.05: ??, p-value < 0.1: ?)
d dimension thus has predictive value with
A. In fact the p-value for this shorter period,
8 to October 30 2008, is significantly lower
0.009) than that listed in Table II for the
mood values of the past n days. We cho
the results shown in Table II indicate tha
Granger causal relation between Calm an
significantly. All historical load values are
Neural	
  Network	
  
TABLE III
DJIA DAILY PREDICTION USING SOFNN
Evaluation IOF I0 I1 I1,2 I1,3 I1,4 I1,5 I1,6
MAPE (%) 1.95 1.94 1.83 2.03 2.13 2.05 1.85 1.79?
Direction (%) 73.3 73.3 86.7? 60.0 46.7 60.0 73.3 80.0
of relevant economic indicators. These
ns for existing sentiment tracking tools
self-reported subjective well-being” in
ate the extent to which they experience
[2] Fama, E. F. (1991) Journal of Finance 46,
[3] H.Cootner, P. (1964) The random chara
(MIT).
[4] Fama, E. F. (1965) The Journal of Busines
[5] Qian, Bo, Rasheed, & Khaled. (2007) AppMining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   42	
  
calm	
   calm	
  +	
  happy	
  
values	
  of	
  3	
  previous	
  days	
  
Mean	
  Average	
  
PredicHon	
  Error	
  
less	
  is	
  beLer	
  
PredicHon	
  of	
  
DJIA	
  direc%on	
  
Summary	
  
quesHon:	
  does	
  twiLer	
  mood	
  predict	
  the	
  stock	
  market	
  
data:	
  8	
  months	
  of	
  tweets	
  
methodology:	
  lexicon-­‐based	
  senHment	
  scores,	
  LR,	
  NN	
  
impact:	
  showed	
  a	
  case	
  that	
  twiLer	
  mood	
  can	
  predict	
  
the	
  stock	
  market	
  
reproducibility:	
  data	
  on	
  website,	
  website	
  not	
  working	
  
https://terramood.soic.indiana.edu/data
	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   43	
  
Discussion	
  
	
  
	
  
what	
  do	
  you	
  think?	
  
what	
  would	
  you	
  do	
  differently?	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   44	
  
Baseline	
  Scenarios	
  
invest	
  10k	
  $	
  on	
  DJIA	
  on	
  Jan	
  1st	
  1976	
  
sell	
  everything	
  at	
  end	
  of	
  day	
  if	
  index	
  is	
  down	
  
buy	
  everything	
  back	
  at	
  end	
  of	
  first	
  day	
  that	
  index	
  is	
  up	
  
cash	
  in	
  on	
  Dec	
  31st	
  1985	
  
>>	
  25k	
  $	
  before	
  transacHon	
  costs	
  
>>	
  1.1k	
  $	
  a^er	
  transacHon	
  costs	
  (0.25%	
  commission)	
  
	
  
invest	
  10k	
  $	
  on	
  Jan	
  1st	
  1976,	
  cash	
  in	
  on	
  Dec	
  31st	
  1985	
  
>>	
  18k	
  $	
  
	
  
repeat	
  during	
  the	
  2000s	
  
>>	
  4k	
  $	
  before	
  commissions	
  
	
  
	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   45	
  
source:	
  “The	
  Signal	
  and	
  the	
  Noise”,	
  N.	
  Silver,	
  2012,	
  page	
  344	
  
Do	
  Investors	
  Mine	
  TwiLer?	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   46	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   47	
  
The Livehoods Project:
Utilizing Social Media to Understand the Dynamics of a City
Justin Cranshaw Raz Schwartz Jason I. Hong Norman Sadeh
jcransh@cs.cmu.edu razs@andrew.cmu.edu jasonh@cs.cmu.edu sadeh@cs.cmu.edu
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213
Abstract
Studying the social dynamics of a city on a large scale has tra-
ditionally been a challenging endeavor, often requiring long
hours of observation and interviews, usually resulting in only
a partial depiction of reality. To address this difficulty, we
introduce a clustering model and research methodology for
studying the structure and composition of a city on a large
scale based on the social media its residents generate. We ap-
ply this new methodology to data from approximately 18 mil-
lion check-ins collected from users of a location-based online
social network. Unlike the boundaries of traditional munici-
pal organizational units such as neighborhoods, which do not
always reflect the character of life in these areas, our clusters,
which we call Livehoods, are representations of the dynamic
areas that comprise the city. We take a qualitative approach to
validating these clusters, interviewing 27 residents of Pitts-
burgh, PA, to see how their perceptions of the city project
onto our findings there. Our results provide strong support
for the discovered clusters, showing how Livehoods reveal
the distinctly characterized areas of the city and the forces
that shape them.
Introduction
The forces that shape the dynamics of a city are multifarious
and complex. Cultural perceptions, economic factors, mu-
nicipal borders, demography, geography, and resources—all
shape and constrain the texture and character of local urban
life. It can be extremely difficult to convey these intricacies
to an outsider; one may call them well-kept secrets, some-
times only even partially known to the locals. When out-
siders, such as researchers, journalists, or city planners, do
want to learn about a city, it often requires hundreds of hours
activity patterns of its people. Contrary to traditional organi-
zational units such as neighborhoods that are often stagnant
and may portray old realities, our clusters reflect current col-
lective activity patterns of people in the city, thus revealing
the dynamic nature of local urban areas, exposing their indi-
vidual characters, and highlighting various forces that form
the urban habitat.
Our work is made possible by the rapid proliferation of
smartphones in recent years and the subsequent emergence
of location-based services and applications. Location-based
social networks such as foursquare have created new means
for online interactions based on the physical location of their
users. In these systems, users can “check-in” to a location by
selecting it from a list of named nearby venues. Their check-
in is then broadcast to other users of the system.
To algorithmically explore the dynamics of cities, we use
data from millions of check-ins gathered from foursquare.
Using well studied techniques in spectral clustering, we in-
troduce a model for the structure of local urban areas that
groups nearby foursquare venues into clusters. Our model
takes into account both the spatial proximity between venues
as given by their geographic coordinates, as well as the so-
cial proximity which we derive from the distribution of peo-
ple that check-in to them. The underlying hypothesis of our
model is that the “character” of an urban area is defined not
just by the types of places found there, but also by the peo-
ple that choose to make that area part of their daily life. We
call these clusters Livehoods, reflecting the dynamic nature
of activity patterns in the lives of city inhabitants.
We take a qualatative approach to evaluating this hypoth-
esis. In a true urban studies tradition, we conducted in-
terviews with 27 residents of different areas of Pittsburgh,
International Conference on Weblogs and Social Media (ICWSM) 2012
In	
  This	
  Work...	
  
goal:	
  discover	
  city	
  structure	
  using	
  online	
  acHvity	
  
use	
  4sq	
  data	
  to	
  uncover	
  ‘livehoods’	
  in	
  PiLsburgh	
  
how	
  do	
  they	
  differ	
  from	
  official	
  boundaries?	
  
	
  evaluate	
  findings	
  with	
  interviews	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   48	
  
Data	
  
43k	
  check-­‐ins	
  in	
  PiLsburgh	
  
4k	
  users,	
  5k	
  venues	
  (restaurants,	
  cafeterias,	
  etc)	
  
newly	
  collected	
  &	
  from	
  previous	
  work	
  
	
  
who	
  (user	
  id)	
  visited	
  what	
  venue	
  (venue-­‐id,	
  geo-­‐locaHon)	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   49	
  
Clustering	
  
For	
  each	
  pair	
  of	
  venues	
  
–  geographic	
  distance	
  
–  social	
  similarity	
  s	
  (in	
  terms	
  of	
  visiHng	
  users)	
  
For	
  each	
  venue	
  
–  maintain	
  m	
  nearest	
  geographic	
  neighbors	
  
–  connect	
  them	
  with	
  an	
  edge	
  with	
  weight	
  s	
  
Apply	
  spectral	
  clustering	
  
–  	
   number	
  of	
  clusters	
  at	
  largest	
  eigenvalue	
  gap	
  
Clusters	
  è	
  ‘Livehoods’	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   50	
  
Findings	
  
Figure 1: The municipal borders (black) and Livehoods for Shadyside/East Liberty (Left) and Lawrenceville/Polish Hill (Right).
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   51	
  
EvaluaHon	
  
Found	
  livehoods	
  that	
  split,	
  spilled,	
  or	
  corresponded	
  
with	
  municipal	
  areas	
  
	
  
Interviews	
  with	
  27	
  residents	
  
–  IdenHfied	
  and	
  drew	
  their	
  neighborhood	
  
–  Shown	
  a	
  map	
  with	
  municipal	
  boundaries,	
  asked	
  if	
  they	
  
could	
  idenHfy	
  borders	
  that	
  were	
  not	
  accurate	
  (‘in	
  flux’)	
  
–  Shown	
  algorithm’s	
  results,	
  asked	
  for	
  feedback	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   52	
  
EvaluaHon	
  Results	
  
•  Not	
  very	
  rigorous,	
  but	
  interesHng...	
  
•  Most	
  discovered	
  livehoods	
  matched	
  boundaries	
  
•  When	
  not,	
  the	
  interviewees	
  usually	
  agree	
  &	
  offered	
  
explanaHons	
  
•  One	
  controversial	
  case	
  
•  Poor	
  area	
  missing	
  (no	
  smartphones?)	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   53	
  
livehoods.org	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   54	
  
Summary	
  
goal:	
  discover	
  livehoods	
  in	
  ciHes	
  
data:	
  43k	
  Foursquare	
  check-­‐ins	
  
methodology:	
  combined	
  geographic	
  &	
  social	
  clustering	
  
impact:	
  showed	
  a	
  case	
  where	
  we	
  can	
  use	
  online	
  data	
  to	
  
understand	
  offline	
  acHvity	
  
reproducibility:	
  some	
  data	
  are	
  online,	
  results	
  are	
  online	
  
	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   55	
  
Discussion	
  
	
  
	
  
What	
  do	
  you	
  think?	
  
What	
  would	
  you	
  do	
  differently?	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   56	
  
Schedule	
  
•  Today:	
  overview	
  
•  February	
  2nd	
  :	
  discuss	
  literature	
  (Aris)	
  
•  February	
  9th	
  :	
  discuss	
  literature	
  (Michael)	
  
•  February	
  23rd	
  :	
  students	
  present	
  project	
  proposals	
  
•  March	
  30th	
  :	
  students	
  submit	
  progress	
  report	
  
•  March	
  30th	
  &	
  April	
  6th:	
  intermediate	
  presentaHons	
  
•  May	
  4th	
  &	
  May	
  11th	
  :	
  final	
  presentaHons	
  
•  May	
  15th	
  :	
  final	
  report	
  due	
  
57	
  Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
  
Proposals	
  
•  5	
  min	
  presentaHons	
  
	
  
•  what	
  do	
  you	
  intend	
  to	
  do?	
  
•  why?	
  
•  what	
  data	
  will	
  you	
  use?	
  
•  what	
  techniques?	
  
•  how	
  will	
  you	
  evaluate?	
  
•  do	
  you	
  plan	
  to	
  publish	
  your	
  code	
  and	
  data?	
  
	
  
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   58	
  
The End
Mining	
  the	
  Social	
  Web	
  -­‐	
  Aalto	
  -­‐	
  2015	
   59	
  

More Related Content

What's hot

Burson-Marsteller DC Advocacy Groups Social Media Study Final
Burson-Marsteller DC Advocacy Groups Social Media Study FinalBurson-Marsteller DC Advocacy Groups Social Media Study Final
Burson-Marsteller DC Advocacy Groups Social Media Study FinalBurson-Marsteller
 
Block doc
Block docBlock doc
Block docqaq ss
 
Block doc
Block docBlock doc
Block docqaq ss
 
Social network analysis and audience segmentation, presented by Jason Baldridge
Social network analysis and audience segmentation, presented by Jason BaldridgeSocial network analysis and audience segmentation, presented by Jason Baldridge
Social network analysis and audience segmentation, presented by Jason BaldridgeSocialMedia.org
 
Shareworthiness and Motivated Reasoning in Hyper-Partisan News Sharing Behavi...
Shareworthiness and Motivated Reasoning in Hyper-Partisan News Sharing Behavi...Shareworthiness and Motivated Reasoning in Hyper-Partisan News Sharing Behavi...
Shareworthiness and Motivated Reasoning in Hyper-Partisan News Sharing Behavi...Axel Bruns
 
5003 presentation (revised)_2
5003 presentation (revised)_25003 presentation (revised)_2
5003 presentation (revised)_2JiayiWang7
 
Fan Identification, Twitter Use, & Social Identity Theory in Sport
Fan Identification, Twitter Use, & Social Identity Theory in SportFan Identification, Twitter Use, & Social Identity Theory in Sport
Fan Identification, Twitter Use, & Social Identity Theory in Sportdaronvaught
 
Doing Social and Political Research in a Digital Age: An Introduction to Digi...
Doing Social and Political Research in a Digital Age: An Introduction to Digi...Doing Social and Political Research in a Digital Age: An Introduction to Digi...
Doing Social and Political Research in a Digital Age: An Introduction to Digi...Liliana Bounegru
 
Long Tail&Power Law
Long Tail&Power LawLong Tail&Power Law
Long Tail&Power Lawkarlrlang
 
Online Petitioning Through Data Exploration and What We Found There: A Datase...
Online Petitioning Through Data Exploration and What We Found There: A Datase...Online Petitioning Through Data Exploration and What We Found There: A Datase...
Online Petitioning Through Data Exploration and What We Found There: A Datase...Pablo Aragón
 
A cross-national comparison of Twitter user interactions with leading politic...
A cross-national comparison of Twitter user interactions with leading politic...A cross-national comparison of Twitter user interactions with leading politic...
A cross-national comparison of Twitter user interactions with leading politic...Christian Nuernbergk
 
Democracy: The Least Bad Form of Government
Democracy: The Least Bad Form of GovernmentDemocracy: The Least Bad Form of Government
Democracy: The Least Bad Form of GovernmentVYTIS MALECKAS
 
Redistributing journalism: Journalism as a data public and the politics of qu...
Redistributing journalism: Journalism as a data public and the politics of qu...Redistributing journalism: Journalism as a data public and the politics of qu...
Redistributing journalism: Journalism as a data public and the politics of qu...Liliana Bounegru
 
Social media negativity fys 100
Social media negativity fys 100Social media negativity fys 100
Social media negativity fys 100DevinAHankins
 
Beyond the Bubble: A Critical Review of the Evidence for Echo Chambers and Fi...
Beyond the Bubble: A Critical Review of the Evidence for Echo Chambers and Fi...Beyond the Bubble: A Critical Review of the Evidence for Echo Chambers and Fi...
Beyond the Bubble: A Critical Review of the Evidence for Echo Chambers and Fi...Axel Bruns
 
Doing Digital Methods: Some Recent Highlights from Winter and Summer Schools
Doing Digital Methods: Some Recent Highlights from Winter and Summer SchoolsDoing Digital Methods: Some Recent Highlights from Winter and Summer Schools
Doing Digital Methods: Some Recent Highlights from Winter and Summer SchoolsLiliana Bounegru
 

What's hot (16)

Burson-Marsteller DC Advocacy Groups Social Media Study Final
Burson-Marsteller DC Advocacy Groups Social Media Study FinalBurson-Marsteller DC Advocacy Groups Social Media Study Final
Burson-Marsteller DC Advocacy Groups Social Media Study Final
 
Block doc
Block docBlock doc
Block doc
 
Block doc
Block docBlock doc
Block doc
 
Social network analysis and audience segmentation, presented by Jason Baldridge
Social network analysis and audience segmentation, presented by Jason BaldridgeSocial network analysis and audience segmentation, presented by Jason Baldridge
Social network analysis and audience segmentation, presented by Jason Baldridge
 
Shareworthiness and Motivated Reasoning in Hyper-Partisan News Sharing Behavi...
Shareworthiness and Motivated Reasoning in Hyper-Partisan News Sharing Behavi...Shareworthiness and Motivated Reasoning in Hyper-Partisan News Sharing Behavi...
Shareworthiness and Motivated Reasoning in Hyper-Partisan News Sharing Behavi...
 
5003 presentation (revised)_2
5003 presentation (revised)_25003 presentation (revised)_2
5003 presentation (revised)_2
 
Fan Identification, Twitter Use, & Social Identity Theory in Sport
Fan Identification, Twitter Use, & Social Identity Theory in SportFan Identification, Twitter Use, & Social Identity Theory in Sport
Fan Identification, Twitter Use, & Social Identity Theory in Sport
 
Doing Social and Political Research in a Digital Age: An Introduction to Digi...
Doing Social and Political Research in a Digital Age: An Introduction to Digi...Doing Social and Political Research in a Digital Age: An Introduction to Digi...
Doing Social and Political Research in a Digital Age: An Introduction to Digi...
 
Long Tail&Power Law
Long Tail&Power LawLong Tail&Power Law
Long Tail&Power Law
 
Online Petitioning Through Data Exploration and What We Found There: A Datase...
Online Petitioning Through Data Exploration and What We Found There: A Datase...Online Petitioning Through Data Exploration and What We Found There: A Datase...
Online Petitioning Through Data Exploration and What We Found There: A Datase...
 
A cross-national comparison of Twitter user interactions with leading politic...
A cross-national comparison of Twitter user interactions with leading politic...A cross-national comparison of Twitter user interactions with leading politic...
A cross-national comparison of Twitter user interactions with leading politic...
 
Democracy: The Least Bad Form of Government
Democracy: The Least Bad Form of GovernmentDemocracy: The Least Bad Form of Government
Democracy: The Least Bad Form of Government
 
Redistributing journalism: Journalism as a data public and the politics of qu...
Redistributing journalism: Journalism as a data public and the politics of qu...Redistributing journalism: Journalism as a data public and the politics of qu...
Redistributing journalism: Journalism as a data public and the politics of qu...
 
Social media negativity fys 100
Social media negativity fys 100Social media negativity fys 100
Social media negativity fys 100
 
Beyond the Bubble: A Critical Review of the Evidence for Echo Chambers and Fi...
Beyond the Bubble: A Critical Review of the Evidence for Echo Chambers and Fi...Beyond the Bubble: A Critical Review of the Evidence for Echo Chambers and Fi...
Beyond the Bubble: A Critical Review of the Evidence for Echo Chambers and Fi...
 
Doing Digital Methods: Some Recent Highlights from Winter and Summer Schools
Doing Digital Methods: Some Recent Highlights from Winter and Summer SchoolsDoing Digital Methods: Some Recent Highlights from Winter and Summer Schools
Doing Digital Methods: Some Recent Highlights from Winter and Summer Schools
 

Viewers also liked

What counts in social media? - Politics of Big Data conference
What counts in social media? - Politics of Big Data conferenceWhat counts in social media? - Politics of Big Data conference
What counts in social media? - Politics of Big Data conferencecgrltz
 
Measuring the effect of social connections on political activity on Facebook
Measuring the effect of social connections on political activity on FacebookMeasuring the effect of social connections on political activity on Facebook
Measuring the effect of social connections on political activity on FacebookPetro Poutanen
 
How i learned to stop worrying and love big data machines
How i learned to stop worrying and love big data machinesHow i learned to stop worrying and love big data machines
How i learned to stop worrying and love big data machinesAnthony Behan
 
La scienza dei non scienziati La citizen science come modello culturale
La scienza dei non scienziati La citizen science come modello culturaleLa scienza dei non scienziati La citizen science come modello culturale
La scienza dei non scienziati La citizen science come modello culturaleDavide Bennato
 
Ethics and Politics of Big Data
Ethics and Politics of Big DataEthics and Politics of Big Data
Ethics and Politics of Big Datarobkitchin
 
A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"Nicola Ferraro
 
The Day After (The 2016 Election)
The Day After (The 2016 Election)The Day After (The 2016 Election)
The Day After (The 2016 Election)Giovanni Rodriguez
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBernard Marr
 
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika AldabaLightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldabaux singapore
 

Viewers also liked (10)

What counts in social media? - Politics of Big Data conference
What counts in social media? - Politics of Big Data conferenceWhat counts in social media? - Politics of Big Data conference
What counts in social media? - Politics of Big Data conference
 
Measuring the effect of social connections on political activity on Facebook
Measuring the effect of social connections on political activity on FacebookMeasuring the effect of social connections on political activity on Facebook
Measuring the effect of social connections on political activity on Facebook
 
How i learned to stop worrying and love big data machines
How i learned to stop worrying and love big data machinesHow i learned to stop worrying and love big data machines
How i learned to stop worrying and love big data machines
 
La scienza dei non scienziati La citizen science come modello culturale
La scienza dei non scienziati La citizen science come modello culturaleLa scienza dei non scienziati La citizen science come modello culturale
La scienza dei non scienziati La citizen science come modello culturale
 
Ethics and Politics of Big Data
Ethics and Politics of Big DataEthics and Politics of Big Data
Ethics and Politics of Big Data
 
A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"
 
The Day After (The 2016 Election)
The Day After (The 2016 Election)The Day After (The 2016 Election)
The Day After (The 2016 Election)
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should Know
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika AldabaLightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
 

Similar to Mining the Social Web - Lecture 3 - T61.6020

Ponencia Congreso Andaluz Sociología, Almeria 25.11.2016 Social media el quin...
Ponencia Congreso Andaluz Sociología, Almeria 25.11.2016 Social media el quin...Ponencia Congreso Andaluz Sociología, Almeria 25.11.2016 Social media el quin...
Ponencia Congreso Andaluz Sociología, Almeria 25.11.2016 Social media el quin...BO TRUE ACTIVITIES SL
 
Visualizing Co-Retweeting Behavior for Recommending Relevant Real-Time Content
Visualizing Co-Retweeting Behavior for Recommending Relevant Real-Time ContentVisualizing Co-Retweeting Behavior for Recommending Relevant Real-Time Content
Visualizing Co-Retweeting Behavior for Recommending Relevant Real-Time ContentSamantha Finn
 
Offen. Divers. Inklusiv. Thinking the Future of Organizations
Offen. Divers. Inklusiv. Thinking the Future of OrganizationsOffen. Divers. Inklusiv. Thinking the Future of Organizations
Offen. Divers. Inklusiv. Thinking the Future of OrganizationsDobusch Leonhard
 
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...Artificial Intelligence Institute at UofSC
 
Roundup Global Internet Activism Course
Roundup Global Internet Activism CourseRoundup Global Internet Activism Course
Roundup Global Internet Activism CourseThe New School
 
Fostering Community With Social Media - Midwest Newspaper Summit 2010
Fostering Community With Social Media - Midwest Newspaper Summit 2010Fostering Community With Social Media - Midwest Newspaper Summit 2010
Fostering Community With Social Media - Midwest Newspaper Summit 2010Nathan Wright
 
Marc Smith - Charting Collections of Connections in Social Media: Creating Ma...
Marc Smith - Charting Collections of Connections in Social Media: Creating Ma...Marc Smith - Charting Collections of Connections in Social Media: Creating Ma...
Marc Smith - Charting Collections of Connections in Social Media: Creating Ma...Saratoga
 
Social Media for Local Government
Social Media for Local GovernmentSocial Media for Local Government
Social Media for Local Governmentgliyeos
 
Mining and analyzing social media part 1 - hicss47 tutorial - dave king
Mining and analyzing social media   part 1 - hicss47 tutorial - dave kingMining and analyzing social media   part 1 - hicss47 tutorial - dave king
Mining and analyzing social media part 1 - hicss47 tutorial - dave kingDave King
 
Fehlende Diversität und Autorenschwund in der Wikipedia: 
Ausgrenzung durch g...
Fehlende Diversität und Autorenschwund in der Wikipedia: 
Ausgrenzung durch g...Fehlende Diversität und Autorenschwund in der Wikipedia: 
Ausgrenzung durch g...
Fehlende Diversität und Autorenschwund in der Wikipedia: 
Ausgrenzung durch g...Dobusch Leonhard
 
ARC211: American Diversity and Design: Tyler Deyoung
ARC211: American Diversity and Design: Tyler DeyoungARC211: American Diversity and Design: Tyler Deyoung
ARC211: American Diversity and Design: Tyler DeyoungTyler DeYoung
 
Social Networking Informative Speech
Social Networking Informative SpeechSocial Networking Informative Speech
Social Networking Informative SpeechCory Bohon
 
Which way up? Drawing and reading maps of the blogosphere
Which way up? Drawing and reading maps of the blogosphereWhich way up? Drawing and reading maps of the blogosphere
Which way up? Drawing and reading maps of the blogosphereTim Highfield
 
Social Network Analysis - #CiclofaixaPaulista
Social Network Analysis - #CiclofaixaPaulistaSocial Network Analysis - #CiclofaixaPaulista
Social Network Analysis - #CiclofaixaPaulistaJoão Paulo Bellucci
 
L4 - L7 - Social Media
L4 - L7 - Social MediaL4 - L7 - Social Media
L4 - L7 - Social MediaNick Crafts
 
Using social media to promote your cause
Using social media to promote your causeUsing social media to promote your cause
Using social media to promote your causeRuby Sinreich
 
Social media and parliament
Social media and parliamentSocial media and parliament
Social media and parliamentJyrki Kasvi
 
Copyright © 1995 The National Endowment for Democracy and The .docx
Copyright © 1995 The National Endowment for Democracy and The .docxCopyright © 1995 The National Endowment for Democracy and The .docx
Copyright © 1995 The National Endowment for Democracy and The .docxvanesaburnand
 

Similar to Mining the Social Web - Lecture 3 - T61.6020 (20)

Ponencia Congreso Andaluz Sociología, Almeria 25.11.2016 Social media el quin...
Ponencia Congreso Andaluz Sociología, Almeria 25.11.2016 Social media el quin...Ponencia Congreso Andaluz Sociología, Almeria 25.11.2016 Social media el quin...
Ponencia Congreso Andaluz Sociología, Almeria 25.11.2016 Social media el quin...
 
Visualizing Co-Retweeting Behavior for Recommending Relevant Real-Time Content
Visualizing Co-Retweeting Behavior for Recommending Relevant Real-Time ContentVisualizing Co-Retweeting Behavior for Recommending Relevant Real-Time Content
Visualizing Co-Retweeting Behavior for Recommending Relevant Real-Time Content
 
Offen. Divers. Inklusiv. Thinking the Future of Organizations
Offen. Divers. Inklusiv. Thinking the Future of OrganizationsOffen. Divers. Inklusiv. Thinking the Future of Organizations
Offen. Divers. Inklusiv. Thinking the Future of Organizations
 
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
 
electric town hall
electric town hallelectric town hall
electric town hall
 
Roundup Global Internet Activism Course
Roundup Global Internet Activism CourseRoundup Global Internet Activism Course
Roundup Global Internet Activism Course
 
Fostering Community With Social Media - Midwest Newspaper Summit 2010
Fostering Community With Social Media - Midwest Newspaper Summit 2010Fostering Community With Social Media - Midwest Newspaper Summit 2010
Fostering Community With Social Media - Midwest Newspaper Summit 2010
 
Marc Smith - Charting Collections of Connections in Social Media: Creating Ma...
Marc Smith - Charting Collections of Connections in Social Media: Creating Ma...Marc Smith - Charting Collections of Connections in Social Media: Creating Ma...
Marc Smith - Charting Collections of Connections in Social Media: Creating Ma...
 
Social Media for Local Government
Social Media for Local GovernmentSocial Media for Local Government
Social Media for Local Government
 
Sunbelt popr2
Sunbelt popr2Sunbelt popr2
Sunbelt popr2
 
Mining and analyzing social media part 1 - hicss47 tutorial - dave king
Mining and analyzing social media   part 1 - hicss47 tutorial - dave kingMining and analyzing social media   part 1 - hicss47 tutorial - dave king
Mining and analyzing social media part 1 - hicss47 tutorial - dave king
 
Fehlende Diversität und Autorenschwund in der Wikipedia: 
Ausgrenzung durch g...
Fehlende Diversität und Autorenschwund in der Wikipedia: 
Ausgrenzung durch g...Fehlende Diversität und Autorenschwund in der Wikipedia: 
Ausgrenzung durch g...
Fehlende Diversität und Autorenschwund in der Wikipedia: 
Ausgrenzung durch g...
 
ARC211: American Diversity and Design: Tyler Deyoung
ARC211: American Diversity and Design: Tyler DeyoungARC211: American Diversity and Design: Tyler Deyoung
ARC211: American Diversity and Design: Tyler Deyoung
 
Social Networking Informative Speech
Social Networking Informative SpeechSocial Networking Informative Speech
Social Networking Informative Speech
 
Which way up? Drawing and reading maps of the blogosphere
Which way up? Drawing and reading maps of the blogosphereWhich way up? Drawing and reading maps of the blogosphere
Which way up? Drawing and reading maps of the blogosphere
 
Social Network Analysis - #CiclofaixaPaulista
Social Network Analysis - #CiclofaixaPaulistaSocial Network Analysis - #CiclofaixaPaulista
Social Network Analysis - #CiclofaixaPaulista
 
L4 - L7 - Social Media
L4 - L7 - Social MediaL4 - L7 - Social Media
L4 - L7 - Social Media
 
Using social media to promote your cause
Using social media to promote your causeUsing social media to promote your cause
Using social media to promote your cause
 
Social media and parliament
Social media and parliamentSocial media and parliament
Social media and parliament
 
Copyright © 1995 The National Endowment for Democracy and The .docx
Copyright © 1995 The National Endowment for Democracy and The .docxCopyright © 1995 The National Endowment for Democracy and The .docx
Copyright © 1995 The National Endowment for Democracy and The .docx
 

More from Michael Mathioudakis

Measuring polarization on social media
Measuring polarization on social mediaMeasuring polarization on social media
Measuring polarization on social mediaMichael Mathioudakis
 
Lecture 07 - CS-5040 - modern database systems
Lecture 07 -  CS-5040 - modern database systemsLecture 07 -  CS-5040 - modern database systems
Lecture 07 - CS-5040 - modern database systemsMichael Mathioudakis
 
Lecture 06 - CS-5040 - modern database systems
Lecture 06  - CS-5040 - modern database systemsLecture 06  - CS-5040 - modern database systems
Lecture 06 - CS-5040 - modern database systemsMichael Mathioudakis
 
Modern Database Systems - Lecture 02
Modern Database Systems - Lecture 02Modern Database Systems - Lecture 02
Modern Database Systems - Lecture 02Michael Mathioudakis
 
Modern Database Systems - Lecture 01
Modern Database Systems - Lecture 01Modern Database Systems - Lecture 01
Modern Database Systems - Lecture 01Michael Mathioudakis
 
Modern Database Systems - Lecture 00
Modern Database Systems - Lecture 00Modern Database Systems - Lecture 00
Modern Database Systems - Lecture 00Michael Mathioudakis
 
Mining the Social Web - Lecture 2 - T61.6020
Mining the Social Web - Lecture 2 - T61.6020Mining the Social Web - Lecture 2 - T61.6020
Mining the Social Web - Lecture 2 - T61.6020Michael Mathioudakis
 
Mining the Social Web - Lecture 1 - T61.6020 lecture-01-slides
Mining the Social Web - Lecture 1 - T61.6020 lecture-01-slidesMining the Social Web - Lecture 1 - T61.6020 lecture-01-slides
Mining the Social Web - Lecture 1 - T61.6020 lecture-01-slidesMichael Mathioudakis
 
Bump Hunting in the Dark - ICDE15 presentation
Bump Hunting in the Dark - ICDE15 presentationBump Hunting in the Dark - ICDE15 presentation
Bump Hunting in the Dark - ICDE15 presentationMichael Mathioudakis
 

More from Michael Mathioudakis (10)

Measuring polarization on social media
Measuring polarization on social mediaMeasuring polarization on social media
Measuring polarization on social media
 
Lecture 07 - CS-5040 - modern database systems
Lecture 07 -  CS-5040 - modern database systemsLecture 07 -  CS-5040 - modern database systems
Lecture 07 - CS-5040 - modern database systems
 
Lecture 06 - CS-5040 - modern database systems
Lecture 06  - CS-5040 - modern database systemsLecture 06  - CS-5040 - modern database systems
Lecture 06 - CS-5040 - modern database systems
 
Modern Database Systems - Lecture 02
Modern Database Systems - Lecture 02Modern Database Systems - Lecture 02
Modern Database Systems - Lecture 02
 
Modern Database Systems - Lecture 01
Modern Database Systems - Lecture 01Modern Database Systems - Lecture 01
Modern Database Systems - Lecture 01
 
Modern Database Systems - Lecture 00
Modern Database Systems - Lecture 00Modern Database Systems - Lecture 00
Modern Database Systems - Lecture 00
 
Mining the Social Web - Lecture 2 - T61.6020
Mining the Social Web - Lecture 2 - T61.6020Mining the Social Web - Lecture 2 - T61.6020
Mining the Social Web - Lecture 2 - T61.6020
 
Mining the Social Web - Lecture 1 - T61.6020 lecture-01-slides
Mining the Social Web - Lecture 1 - T61.6020 lecture-01-slidesMining the Social Web - Lecture 1 - T61.6020 lecture-01-slides
Mining the Social Web - Lecture 1 - T61.6020 lecture-01-slides
 
Absorbing Random Walk Centrality
Absorbing Random Walk CentralityAbsorbing Random Walk Centrality
Absorbing Random Walk Centrality
 
Bump Hunting in the Dark - ICDE15 presentation
Bump Hunting in the Dark - ICDE15 presentationBump Hunting in the Dark - ICDE15 presentation
Bump Hunting in the Dark - ICDE15 presentation
 

Recently uploaded

Optical Fibre and It's Applications.pptx
Optical Fibre and It's Applications.pptxOptical Fibre and It's Applications.pptx
Optical Fibre and It's Applications.pptxPurva Nikam
 
Department of Health Compounder Question ‍Solution 2022.pdf
Department of Health Compounder Question ‍Solution 2022.pdfDepartment of Health Compounder Question ‍Solution 2022.pdf
Department of Health Compounder Question ‍Solution 2022.pdfMohonDas
 
What is the Future of QuickBooks DeskTop?
What is the Future of QuickBooks DeskTop?What is the Future of QuickBooks DeskTop?
What is the Future of QuickBooks DeskTop?TechSoup
 
AUDIENCE THEORY -- FANDOM -- JENKINS.pptx
AUDIENCE THEORY -- FANDOM -- JENKINS.pptxAUDIENCE THEORY -- FANDOM -- JENKINS.pptx
AUDIENCE THEORY -- FANDOM -- JENKINS.pptxiammrhaywood
 
3.26.24 Race, the Draft, and the Vietnam War.pptx
3.26.24 Race, the Draft, and the Vietnam War.pptx3.26.24 Race, the Draft, and the Vietnam War.pptx
3.26.24 Race, the Draft, and the Vietnam War.pptxmary850239
 
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...Nguyen Thanh Tu Collection
 
A gentle introduction to Artificial Intelligence
A gentle introduction to Artificial IntelligenceA gentle introduction to Artificial Intelligence
A gentle introduction to Artificial IntelligenceApostolos Syropoulos
 
The Stolen Bacillus by Herbert George Wells
The Stolen Bacillus by Herbert George WellsThe Stolen Bacillus by Herbert George Wells
The Stolen Bacillus by Herbert George WellsEugene Lysak
 
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdfP4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdfYu Kanazawa / Osaka University
 
3.21.24 The Origins of Black Power.pptx
3.21.24  The Origins of Black Power.pptx3.21.24  The Origins of Black Power.pptx
3.21.24 The Origins of Black Power.pptxmary850239
 
HED Office Sohayok Exam Question Solution 2023.pdf
HED Office Sohayok Exam Question Solution 2023.pdfHED Office Sohayok Exam Question Solution 2023.pdf
HED Office Sohayok Exam Question Solution 2023.pdfMohonDas
 
How to Create a Toggle Button in Odoo 17
How to Create a Toggle Button in Odoo 17How to Create a Toggle Button in Odoo 17
How to Create a Toggle Button in Odoo 17Celine George
 
Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.EnglishCEIPdeSigeiro
 
How to Send Emails From Odoo 17 Using Code
How to Send Emails From Odoo 17 Using CodeHow to Send Emails From Odoo 17 Using Code
How to Send Emails From Odoo 17 Using CodeCeline George
 
Quality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICEQuality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICESayali Powar
 
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptxClinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptxraviapr7
 
Slides CapTechTalks Webinar March 2024 Joshua Sinai.pptx
Slides CapTechTalks Webinar March 2024 Joshua Sinai.pptxSlides CapTechTalks Webinar March 2024 Joshua Sinai.pptx
Slides CapTechTalks Webinar March 2024 Joshua Sinai.pptxCapitolTechU
 
Unveiling the Intricacies of Leishmania donovani: Structure, Life Cycle, Path...
Unveiling the Intricacies of Leishmania donovani: Structure, Life Cycle, Path...Unveiling the Intricacies of Leishmania donovani: Structure, Life Cycle, Path...
Unveiling the Intricacies of Leishmania donovani: Structure, Life Cycle, Path...Dr. Asif Anas
 

Recently uploaded (20)

Optical Fibre and It's Applications.pptx
Optical Fibre and It's Applications.pptxOptical Fibre and It's Applications.pptx
Optical Fibre and It's Applications.pptx
 
Department of Health Compounder Question ‍Solution 2022.pdf
Department of Health Compounder Question ‍Solution 2022.pdfDepartment of Health Compounder Question ‍Solution 2022.pdf
Department of Health Compounder Question ‍Solution 2022.pdf
 
What is the Future of QuickBooks DeskTop?
What is the Future of QuickBooks DeskTop?What is the Future of QuickBooks DeskTop?
What is the Future of QuickBooks DeskTop?
 
AUDIENCE THEORY -- FANDOM -- JENKINS.pptx
AUDIENCE THEORY -- FANDOM -- JENKINS.pptxAUDIENCE THEORY -- FANDOM -- JENKINS.pptx
AUDIENCE THEORY -- FANDOM -- JENKINS.pptx
 
3.26.24 Race, the Draft, and the Vietnam War.pptx
3.26.24 Race, the Draft, and the Vietnam War.pptx3.26.24 Race, the Draft, and the Vietnam War.pptx
3.26.24 Race, the Draft, and the Vietnam War.pptx
 
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
 
A gentle introduction to Artificial Intelligence
A gentle introduction to Artificial IntelligenceA gentle introduction to Artificial Intelligence
A gentle introduction to Artificial Intelligence
 
The Stolen Bacillus by Herbert George Wells
The Stolen Bacillus by Herbert George WellsThe Stolen Bacillus by Herbert George Wells
The Stolen Bacillus by Herbert George Wells
 
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdfP4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
 
3.21.24 The Origins of Black Power.pptx
3.21.24  The Origins of Black Power.pptx3.21.24  The Origins of Black Power.pptx
3.21.24 The Origins of Black Power.pptx
 
HED Office Sohayok Exam Question Solution 2023.pdf
HED Office Sohayok Exam Question Solution 2023.pdfHED Office Sohayok Exam Question Solution 2023.pdf
HED Office Sohayok Exam Question Solution 2023.pdf
 
How to Create a Toggle Button in Odoo 17
How to Create a Toggle Button in Odoo 17How to Create a Toggle Button in Odoo 17
How to Create a Toggle Button in Odoo 17
 
Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.
 
How to Send Emails From Odoo 17 Using Code
How to Send Emails From Odoo 17 Using CodeHow to Send Emails From Odoo 17 Using Code
How to Send Emails From Odoo 17 Using Code
 
Quality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICEQuality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICE
 
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptxClinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptx
 
Slides CapTechTalks Webinar March 2024 Joshua Sinai.pptx
Slides CapTechTalks Webinar March 2024 Joshua Sinai.pptxSlides CapTechTalks Webinar March 2024 Joshua Sinai.pptx
Slides CapTechTalks Webinar March 2024 Joshua Sinai.pptx
 
Personal Resilience in Project Management 2 - TV Edit 1a.pdf
Personal Resilience in Project Management 2 - TV Edit 1a.pdfPersonal Resilience in Project Management 2 - TV Edit 1a.pdf
Personal Resilience in Project Management 2 - TV Edit 1a.pdf
 
Unveiling the Intricacies of Leishmania donovani: Structure, Life Cycle, Path...
Unveiling the Intricacies of Leishmania donovani: Structure, Life Cycle, Path...Unveiling the Intricacies of Leishmania donovani: Structure, Life Cycle, Path...
Unveiling the Intricacies of Leishmania donovani: Structure, Life Cycle, Path...
 
Prelims of Kant get Marx 2.0: a general politics quiz
Prelims of Kant get Marx 2.0: a general politics quizPrelims of Kant get Marx 2.0: a general politics quiz
Prelims of Kant get Marx 2.0: a general politics quiz
 

Mining the Social Web - Lecture 3 - T61.6020

  • 1. Mining  the  Social  Web   Aris%des  Gionis   Michael  Mathioudakis   firstname.lastname@aalto.fi      Aalto  University   Spring  2015  
  • 2. Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   2   T-61.6020: Mining the social web — lecture #2 structure of social networks social networks and social-media data can be represented as graphs (or networks) how these graphs look like? what is their structure data contain additional information (actions, interactions, dynamics, attributes,…) mining this additional information as part of the network structure 6 T-61.6020: Mining the social web — lecture #2 community structure in social networks 12 dolphins network and its NPC Community structure dolphins network and its NCP (source [Leskovec et al., 2009]) Frieze, Gionis, Tsourakakis Algorithmic Techniques for Modeling and Mining Large Graphs 34 / 277 Previously  on  T.61-­‐6020   structure  and  dynamics    of  social  networks   what  does  a  social  network  look  like?   how  do  social  networks  evolve  over  Hme?   how  does  informaHon  spread?   do  users  influence  each  other?   Figure 4: Top 50 threads in the news cycle with highest volume for the period Aug. 1 – Oct. 31, 2008. Each thread consists of all news articles and blog posts containing a textual variant of a particular quoted phrases. (Phrase variants for the two largest threads in each week are shown as labels pointing to the corresponding thread.) The data is drawn as a stacked plot in which the thickness of the strand corresponding to each thread indicates its volume over time. Interactive visualization is available at http://memetracker.org. Figure 5: Temporal dynamics of top threads as generated by our model. Only two ingredients, namely imitation and a preference to recent threads, are enough to qualitatively reproduce the observed dynamics of the news cycle. 3. GLOBAL ANALYSIS: TEMPORAL VARI- ATION AND A PROBABILISTIC MODEL periods when the upper envelope of the curve are high correspond to times when there is a greater degree of convergence on key sto- ries, while the low periods indicate that attention is more diffuse, threads dynamics
  • 3. Today   poliHcs   does  network  structure  reflect  poliHcal  divisions?   can  we  infer  poliHcal  affiliaHon?     financial  senHment   can  twiLer  predict  the  stock  market?     urban  compuHng   what  does  online  acHvity  say  about  how  we  live  in  ciHes?     Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   3  
  • 4. PoliHcs  Online   www  &  ‘democraHzaHon’  of  informaHon     ciHzen  journalism   (e.g.  HaiH  earthquake,  Arab  spring)     poliHcians  and  tradiHonal  media   parHcipate,  too   Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   4  
  • 5. US  PoliHcs  &  the  Web   Websites  -­‐  1996   (Email)  -­‐  1998     Online  Fund  raising  -­‐  2000   Blogs  -­‐  2004   TwiLer  &  FB  -­‐  2008   Jesse Ventura - MN governor Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   5  
  • 6. Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   6   social  media  mining    vs     tradiHonal  poliHcal  polls   new  content   &  biases  
  • 7. International Workshop on Link Discovery, 2005 The Political Blogosphere and the 2004 U.S. Election: Divided They Blog Lada Adamic HP Labs 1501 Page Mill Road Palo Alto, CA 94304 lada.adamic@hp.com Natalie Glance Intelliseek Applied Research Center 5001 Baum Blvd. Pittsburgh, PA 15217 nglance@intelliseek.com 4 March 2005 Abstract In this paper, we study the linking patterns and discussion topics of political bloggers. Our aim is to measure the degree of interaction between liberal and conservative blogs, and to uncover any differences in the structure of the two communities. Specifically, we analyze the posts of 40 “A-list” blogs over the period of two months preceding the U.S. Presidential Election of 2004, to study how often they referred to one another and to quantify the overlap in the topics they discussed, both within the liberal and conservative communities, and also across communities. We also study a single day snapshot of over 1,000 political blogs. This snapshot captures blogrolls (the list of links to other blogs frequently found in sidebars), and presents a more static picture of a broader blogosphere. Most significantly, we find differences in the behavior of liberal and conservative blogs, with conservative blogs linking to each other more frequently and in a denser pattern. 1 Introduction The 2004 U.S. Presidential Election was the first Presidential Election in the United States in which blogging played an important role. Although the term weblog was coined in 1997, it was not until after 9/11 that blogs gained readership and influence in the U.S. The next major trend in political blogging was “warblogging”: blogs centered around discussion of the invasion of Iraq by the U.S.1 The year 2004 saw a rapid rise in the popularity and proliferation of blogs. According to a report from the Pew Internet & American Life Project published in January 2005, 32 million U.S. Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   7  
  • 8. Some  History   before  facebook  and  twiLer,   there  were  blogs     rise  a^er  9/11   ‘war-­‐blogging’     2004   32  million  US  ciHzens  read  blogs   62%  of  US  ciHzens  do  not  know  them   Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   8  
  • 9. Jan 25, 2004 Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   9   everyone  talks  with  everyone?...   or  ‘echo  chambers’?  
  • 10. In  This  Work...   task:  extract  network  structure  of  poliHcal  interacHons   quesHon:  one  or  separate  communiHes?     ...previous  evidence  on  two  blogs  (Instapundit  and  Atrios)  show   ‘neighborhoods’  have  no  overlap  in  cited  urls   ...same  with  book  purchases  on  amazon.com     on  the  other  hand...   ...  it  is  now  easier  to  interact   Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   10  
  • 11. used  blog  directories  for  lists  of  poliHcal  blogs   parsed  front  page  for  links  to  discover  more  blogs   labeled  them  manually     only  liberal  &  conservaHve     759  liberal  &  735  conservaHve  blogs     Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   11   Data  
  • 12. Findings   Figure 1: Community structure of political blogs (expanded set), shown using utilizing a GEM layout [11] in the GUESS[3] visualization and analysis tool. The colors reflect political orientation, red for conservative, and blue for liberal. Orange links go from liberal to conservative, and purple Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   12  
  • 13. Findings   10 0 10 1 10 2 10 −2 10 −1 10 0 incoming links (k) fractionofblogswithatlastklinks conservative liberal Lognormal fit k−036 e−k/57 ative distribution of incoming links for political blogs, separated by category. As ognormal, shown as a dashed line, to be a fairly good fit. A power-law with an ff, shown as a solid line, is an even better fit. This is on par with the 277 links received by the most linked to conservative 1 2 3 4 56 7 8 9 10 11 1213 1415 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 1 Digbys Blog 2 James Walcott 3 Pandagon 4 blog.johnkerry.com 5 Oliver Willis 6 America Blog 7 Crooked Timber 8 Daily Kos 9 American Prospect 10 Eschaton 11 Wonkette 12 Talk Left 13 Political Wire 14 Talking Points Memo 15 Matthew Yglesias 16 Washington Monthly 17 MyDD 18 Juan Cole 19 Left Coaster 20 Bradford DeLong 21 JawaReport 22 Voka Pundit 23 Roger L Simon 24 Tim Blair 25 Andrew Sullivan 26 Instapundit 27 Blogs for Bush 28 Little Green Footballs 29 Belmont Club 30 Captain’s Quarters 31 Powerline 32 Hugh Hewitt 33 INDC Journal 34 Real Clear Politics 35 Winds of Change 36 Allahpundit 37 Michelle Malkin 38 WizBang 39 Dean’s World 40 Volokh (C) (B) (A) Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   13   similar  picture  for  top  20  blogs   small  number  of  blogs   aLract  most  links  
  • 14. 0 200 400 600 800 1000 1200 1400 1600 nytimes.com washingtonpost.com news.yahoo.com msnbc.msn.com nationalreview.com cnn.com latimes.com boston.com usatoday.com washingtontimes.com apnews.myway.com guardian.co.uk foxnews.com cbsnews.com slate.msn.com nypost.com news.bbc.co.uk tnr.com opinionjournal.com online.wsj.com salon.com # citations from weblog posts Left Right Figure 4: Most linked to news sources by the top 20 conservative and top 20 liberal blogs during 8/29/2004 - 11/15/2004. 1. CBS News poll of uncommitted voters shows Kerry winning 43% to 28% 2. Sun Times article: Bob Novak predicts that George Bush will retreat from Iraq if reelected 0 200 400 600 800 1000 1200 1400 1600 nytimes.com washingtonpost.com news.yahoo.com msnbc.msn.com nationalreview.com cnn.com latimes.com boston.com usatoday.com washingtontimes.com apnews.myway.com guardian.co.uk foxnews.com cbsnews.com slate.msn.com nypost.com # citations from weblog posts Figure 4: Most linked to news sources by the top 20 conservative and top 20 liberal blogs during 8/29/2004 - 11/15/2004. 1. CBS News poll of uncommitted voters shows Kerry winning 43% to 28% 2. Sun Times article: Bob Novak predicts that George Bush will retreat from Iraq if reelected 3. CBS News article on forged memos 4. New York Daily News article on Osama Bin Laden videotope, “gift” for the President 5. Time Magazine poll: Bush opens double-digit lead on post convention bounce In contrast, the top news articles cited by right leaning bloggers are: 1. CBS News article on forged memos 2. Time Magazine poll: Bush opens double-digit lead on post convention bounce 3. National Review article refuting the case about missing explosives 4. ABC News article refuting the case about missing explosives 5. Washington Post article reporting on Kerry’s proposal to allow Iran to keep its nuclear power plants in exchange for giving up the right to retain the nuclear fuel that could be used for bomb-making A time series chart further shows how quickly and strongly conservative bloggers responded to forged CBS documents (Figure 5). The conservative bloggers saw Dan Rather’s report as an attempt by the left to discredit President Bush. They acted quickly to debunk the report, with the charge led by PowerLine and seconded by Wizbangblog and others. In contrast, the pick-up among liberal Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   14   Findings   0 200 400 600 800 1000 1200 1400 1600 nytimes.com washingtonpost.com news.yahoo. # citations from weblog posts Figure 4: Most linked to news sources by the top 20 conservative and top 20 liberal blogs during 8/29/2004 - 11/15/2004. 1. CBS News poll of uncommitted voters shows Kerry winning 43% to 28% 2. Sun Times article: Bob Novak predicts that George Bush will retreat from Iraq if reelected 3. CBS News article on forged memos 4. New York Daily News article on Osama Bin Laden videotope, “gift” for the President 5. Time Magazine poll: Bush opens double-digit lead on post convention bounce In contrast, the top news articles cited by right leaning bloggers are: 1. CBS News article on forged memos 2. Time Magazine poll: Bush opens double-digit lead on post convention bounce 3. National Review article refuting the case about missing explosives 4. ABC News article refuting the case about missing explosives 5. Washington Post article reporting on Kerry’s proposal to allow Iran to keep its nuclear power plants in exchange for giving up the right to retain the nuclear fuel that could be used for bomb-making A time series chart further shows how quickly and strongly conservative bloggers responded to forged CBS documents (Figure 5). The conservative bloggers saw Dan Rather’s report as an attempt by the left to discredit President Bush. They acted quickly to debunk the report, with the charge led by PowerLine and seconded by Wizbangblog and others. In contrast, the pick-up among liberal bloggers occurred later, with lower volume. The most vocal left leaning bloggers on the subject were TalkLeft and AMERICAblog. 11 top  20  blogs   news  links  
  • 15. 0 20 40 60 80 100 120 140 buzzflash.com cursor.org mediamatters.org commondreams.org alternet.org airamericaradio.co salon.com thenation.com theonion.com guardian.co.uk nytimes.com news.google.com washingtonpost.com cnn.com foxnews.com weeklystandard.com command-post.org townhall.com opinionjournal.com nationalreview.com number of blogs linking Left Right Figure 7: Most linked to news sources (online and off), showing proportionally how many liberal and conservative blogs link to them.Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   15   Findings   all  blogs   links  to  other  websites  
  • 16. Summary   quesHon:  one  or  separate  communiHes?   data:  1.5k  blogs,  manually  labeled   methodology:  simple  link  analysis   impact:  showed  divide  in  ‘online’  world   reproducibility:  ?   Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   16  
  • 17. Political Polarization on Twitter M. D. Conover, J. Ratkiewicz, M. Francisco, B. Gonc¸alves, A. Flammini, F. Menczer Center for Complex Networks and Systems Research School of Informatics and Computing Indiana University, Bloomington, IN, USA Abstract In this study we investigate how social media shape the networked public sphere and facilitate communication be- tween communities with different political orientations. We examine two networks of political communication on Twit- ter, comprised of more than 250,000 tweets from the six weeks leading up to the 2010 U.S. congressional midterm elections. Using a combination of network clustering algo- rithms and manually-annotated data we demonstrate that the network of political retweets exhibits a highly segregated par- tisan structure, with extremely limited connectivity between left- and right-leaning users. Surprisingly this is not the case for the user-to-user mention network, which is dominated by a single politically heterogeneous cluster of users in which ideologically-opposed individuals interact at a much higher rate compared to the network of retweets. To explain the dis- tinct topologies of the retweet and mention networks we con- jecture that politically motivated individuals provoke inter- action by injecting partisan content into information streams whose primary audience consists of ideologically-opposed users. We conclude with statistical evidence in support of this hypothesis. 1 Introduction Social media play an important role in shaping political dis- course in the U.S. and around the world (Bennett 2003; Benkler 2006; Sunstein 2007; Farrell and Drezner 2008; Aday et al. 2010; Tumasjan et al. 2010; O’Connor et al. 2010). According to the Pew Internet and American Life Gallo, and Kane (2007). Consumers of online political in- formation tend to behave similarly, choosing to read blogs that share their political beliefs, with 26% more users do- ing so in 2008 than 2004 (Pew Internet and American Life Project 2008). In its own right, the formation of online communities is not necessarily a serious problem. The concern is that when politically active individuals can avoid people and informa- tion they would not have chosen in advance, their opinions are likely to become increasingly extreme as a result of being exposed to more homogeneous viewpoints and fewer credi- ble opposing opinions. The implications for the political pro- cess in this case are clear. A deliberative democracy relies on a broadly informed public and a healthy ecosystem of com- peting ideas. If individuals are exposed exclusively to people or facts that reinforce their pre-existing beliefs, democracy suffers (Sunstein 2002; 2007). In this study we examine networks of political commu- nication on the Twitter microblogging service during the six weeks prior to the 2010 U.S. midterm elections. Sam- pling data from the Twitter ‘gardenhose’ API, we identi- fied 250,000 politically relevant messages (tweets) produced by more than 45,000 users. From these tweets we isolated two networks of political communication — the retweet network, in which users are connected if one has rebroad- cast content produced by another, and the mention network, where users are connected if one has mentioned another in a post, including the case of tweet replies. We demonstrate that the retweet network exhibits a highly Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   17   International Conference on Weblogs and Social Media (ICWSM) 2011
  • 18. In  This  Work...   quesHon   check  if  same  paLern  holds  in  twiLer     however   retweets  and  men8ons  instead  of  links   Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   18  
  • 19. Data   Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   19   6  weeks  of  tweets  before  US  2010  elecHons   to  disHnguish  poliHcal  tweets   1.  parse  all  tweets  (355  million)   2.  select  subset  with  hashtags  #p2  and  #tcot   3.  find  frequently  co-­‐occuring  hashtags  (66)   4.  keep  tweets  that  contain  those  hashtags   >>  250k  tweets   The major contributions of this work are: • Creation and release of a network and text dataset derived from more than 250,000 politically-related Twitter posts authored in the weeks preceeding the 2010 U.S. midterm elections (§ 2). • Cluster analysis of networks derived from this corpus showing that the network of retweet exhibits clear seg- regation, while the mention network is dominated by a single large community (§ 3.1). • Manual classification of Twitter users by political align- ment, demonstrating that the retweet network clusters cor- respond to the political left and right. These data also show the mention network to be politically heteroge- neous, with users of opposing political views interacting at a much higher rate than in the retweet network (§ 3.3). • An interpretation of the observed community structures Table 1: Hashtags related to #p2, #tcot, or both. Tweets containing any of these were included in our sample. Just #p2 #casen #dadt #dc10210 #democrats #du1 #fem2 #gotv #kysen #lgf #ofa #onenation #p2b #pledge #rebelleft #truthout #vote #vote2010 #whyimvotingdemocrat #youcut Both #cspj #dem #dems #desen #gop #hcr #nvsen #obama #ocra #p2 #p21 #phnm #politics #sgp #tcot #teaparty #tlot #topprog #tpp #twisters #votedem Just #tcot #912 #ampat #ftrs #glennbeck #hhrs #iamthemob #ma04 #mapoli #palin #palin12 #spwbt #tsot #tweetcongress #ucot #wethepeople Table 2: Hashtags excluded from the analysis due to ambigu-
  • 20. Extract  Networks   Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   20   node  u     node  v     v  retweets  u   u  menHons  v   two  networks   clustering   1.  max  modularity  for  two  clusters   to  assign  iniHal  labels  to  nodes  (a  and  b)   2.  label  propagaHon   re-­‐assign  nodes  to  label  of  most  neighbors   unHl  no  change  
  • 21. Findings   Figure 1: The political retweet (left) and mention (right) networks, laid out using a force-directed algorithm. Node colors reflect cluster assignments (see § 3.1). Community structure is evident in the retweet network, but less so in the mention network. We show in § 3.3 that in the retweet network, the red cluster A is made of 93% right-leaning users, while the blue cluster B is made of 80% left-leaning users. tive Twitter users. This structural difference is of particular importance with respect to political communication, as we now have statistical evidence to suggest that mentions and Retweet Mention A↔A 0.31 0.31 10 0 10 1 s(a,b)) Clusters A Cluster B Different clusters retweets   men%ons   Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   21   separate  clusters  for  retweets   one  big  cluster  for  men8ons  
  • 22. Summary   quesHon:  does  network  structure  reflect  poliHcal  divisions   data:  250k  poliHcal  tweets  before  US  2010  elecHons   methodology:  clustering  &  label  propagaHon   impact:  showed  different  modes  of  interacHon  between   individuals  in  same  &  different  sides   reproducibility:  data  are  online  -­‐-­‐     http://truthy.indiana.edu/projects/data-and-software.html Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   22  
  • 23. Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   23   This  network  graph  details  the  landscape  of  Twi=er  handles   responding  to  the  UNWRA  school  bombing.   Source:  https://medium.com/i-data/israel-gaza-war-data-a54969aeb23e
  • 24. Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   24   Instagram  co-­‐tag  graph,  highligh8ng  three  dis8nct  topical   communi8es:  1)  pro-­‐Israeli  (Orange),  2)  pro-­‐Pales8nian   (Yellow),  and  3)  Religious  /  muslim  (Purple)   Source:  https://medium.com/i-data/israel-gaza-war-data-a54969aeb23e
  • 25. Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   25   International Conference on Weblogs and Social Media (ICWSM) 2013 Classifying Political Orientation on Twitter: It’s Not Easy! Raviv Cohen and Derek Ruths School of Computer Science McGill University raviv.cohen@mail.mcgill.ca, derek.ruths@mcgill.ca Abstract Numerous papers have reported great success at infer- ring the political orientation of Twitter users. This paper has some unfortunate news to deliver: while past work has been sound and often methodologically novel, we have discovered that reported accuracies have been sys- temically overoptimistic due to the way in which vali- dation datasets have been collected, reporting accuracy levels nearly 30% higher than can be expected in popu- lations of general Twitter users. Using careful and novel data collection and annotation techniques, we collected three different sets of Twitter users, each characterizing a different degree of political engagement on Twitter — from politicians (highly po- litically vocal) to “normal” users (those who rarely dis- cuss politics). Applying standard techniques for infer- ring political orientation, we show that methods which previously reported greater than 90% inference accu- racy, actually achieve barely 65% accuracy on normal users. We also show that classifiers cannot be used to classify users outside the narrow range of political ori- entation on which they were trained. While a sobering finding, our results quantify and call attention to overlooked problems in the latent attribute inference literature that, no doubt, extend beyond polit- ical orientation inference: the way in which datasets are assembled and the transferability of classifiers. Introduction Much of the promise of online social media studies, analyt- ics, and commerce depends on knowing various attributes of individual and groups of users. For a variety of reasons, few intrinsic attributes of individuals are explicitly revealed in their user account profiles. As a result, latent attribute in- ference, the computational discovery of “hidden” attributes, has become a topic of significant interest among social me- including gender, age, education, political orientation, and even coffee preferences (Zamal, Liu, and Ruths 2012; Conover et al. 2011b; 2011a; Rao and Yarowsky 2010; Pennacchiotti and Popescu 2011; Wong et al. 2013; Liu and Ruths 2013; Golbeck and Hansen 2011; Burger, Henderson, and Zarrella 2011). In general, inference algorithms have achieved accuracy rates in the range of 85%, but have strug- gled to improve beyond this point. To date, the great suc- cess story of this area is political orientation inference for which a number of papers have boasted inference accuracy reaching and even surpassing 90% (Conover et al. 2011b; Zamal, Liu, and Ruths 2012). By any reasonable measure, the existing work on political orientation is sound and represents a sincere and successful effort to advance the technology of latent attribute inference. Furthermore, a number of the works have yielded notable insights into the nature of political orientation in online en- vironments (Conover et al. 2011b; 2011a). In this paper, we examine the question of whether existing political orienta- tion inference systems actual perform as well as reported on the general Twitter population. Our findings indicate that, without exception, they do not, even when the general pop- ulation consider is restricted only to those who discuss pol- itics (since inferring the political orientation of a user who never speaks about politics is, certainly, very hard if not im- possible). We consider this an important question and finding for two reasons. Foremost, nearly all applications of latent at- tribute inference involve its use on large populations of un- known users. As a result, quantifying its performance on the general Twitter population is arguably the best way of eval- uating its practical utility. Second, the existing literature on this topic reports its accuracy in inferring political orienta- tion without qualification or caveats (author’s note: includ-
  • 26. In  This  Work...   task:  classify  poliHcal  orientaHon  on  twiLer   goal:  invesHgate  claims  of  previous  work     previous  work   high  classificaHon  accuracy   but  was  the  task  too  easy?   heavily  poliHcal  user  accounts   what  about  ‘modestly  poliHcal’  users?   Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   26  
  • 27. Data   3  datasets   PoliHcal  Figures   PoliHcally  AcHve   PoliHcally  Modest   Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   27   pe (e.g. ocrats). s from further riginal ployed 2012). ance to restric- y con- rmore, Table 1: Basic statistics on the different datasets used. Total size of the Figures dataset was limited by the number of federal level politicians; size of the Modest dataset was limited by the number of users that satisfied our stringent conditions - these were culled from a dataset of 10,000 random individuals. Dataset Republicans Democrats Total Figures 203 194 397 Active 860 977 1837 Modest 105 157 262 Conover 107 89 196
  • 28. Data     Poli%cal  Figures   twiLer  accounts  for  US  governors  &  congressmen   latest  1000  tweets     also...   parHHoned    used  hashtags   into  poliHcally  discriminaHve  &  neutral     Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   28  
  • 29. Data   Poli%cally  Ac%ve   self-­‐declared,  according  to  profile   democrats/liberals,  conservaHves/republicans   manual  inspecHon   only  US  residents     poliHcal  topics  not  dominaHng  their  tweets     Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   29  
  • 30. Data   Poli%cally  Modest   >>  10000  random  users       filter  ones  that  use  poliHcally  neutral  hashtag   >>  1500  users     use  amazon  turk  to  classify   based  on  10  tweets  with  hashtags   democrats/liberals  vs  republicans/conserva8ves   >>  327  classified  users   Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   30  
  • 31. ClassificaHon   SVM  with  10-­‐fold  cross  validaHon     Features  (from  previous  work):     –  tweet/rt/hashtag/link/menHon  frequencies,   –  number  of  friends,  followers   –  usage  of  top-­‐k  Rep/Dem  1-­‐2-­‐3-­‐grams,  hashtags     Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   31  
  • 32. Findings   Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   32   The average of ten 10-fold cross-validation SVM itera- est was performed on each one of our datasets respec- Dataset SVM Accuracy Figures 91% Active 84% Modest 68% Conover 87% case, a less c to summariz the vocabula We evalua we looked at inating hash model of inte Republicans ticular word from both g Democrat/R 96 by Republi- istically fa- Table 6: Performance results of training our SVM on one datas and inferring on another, italicized are the averaged 10-fold cros validation results Dataset Figures Active Modest Figures 91% 72% 66% Active 62% 84% 69% Modest 54% 57% 68% cross-­‐dataset  classifica%on  
  • 33. Summary   goal:  study  classificaHon  accuracy  for  poliHcal  leaning   data:  targeted  twiLer  sample   methodology:  parHHoned  subsets  of  data,  AMT,  SVM   impact:  showed  difficulty  of  task   reproducibility:  data  available  per  request  -­‐   http://www.icwsm.org/2015/datasets/datasets/ Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   33  
  • 34. Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   34   1 Twitter mood predicts the stock market. Johan Bollen1,?,Huina Mao1,?,Xiao-Jun Zeng2. ?: authors made equal contributions. Abstract—Behavioral economics tells us that emotions can profoundly affect individual behavior and decision-making. Does this also apply to societies at large, i.e. can societies experience mood states that affect their collective decision making? By extension is the public mood correlated or even predictive of economic indicators? Here we investigate whether measurements of collective mood states derived from large-scale Twitter feeds are correlated to the value of the Dow Jones Industrial Average (DJIA) over time. We analyze the text content of daily Twitter feeds by two mood tracking tools, namely OpinionFinder that measures positive vs. negative mood and Google-Profile of Mood States (GPOMS) that measures mood in terms of 6 dimensions (Calm, Alert, Sure, Vital, Kind, and Happy). We cross-validate the resulting mood time series by comparing their ability to detect the public’s response to the presidential election and Thanksgiving day in 2008. A Granger causality analysis and a Self-Organizing Fuzzy Neural Network are then used to investigate the hypothesis that public mood states, as measured by the OpinionFinder and GPOMS mood time series, are predictive of changes in DJIA closing values. Our results indicate that the accuracy of DJIA predictions can be significantly improved by the inclusion of specific public mood dimensions but not others. We find an accuracy of 87.6% in predicting the daily up and down changes in the closing values of the DJIA and a reduction of the Mean Average Percentage Error by more than 6%. Index Terms—stock market prediction — twitter — mood analysis. I. INTRODUCTION STOCK market prediction has attracted much attention from academia as well as business. But can the stock market really be predicted? Early research on stock market prediction [1], [2], [3] was based on random walk theory and the Efficient Market Hypothesis (EMH) [4]. According to the EMH stock market prices are largely driven by new sentiment from blogs. In addition, Google search queries have been shown to provide early indicators of disease infection rates and consumer spending [14]. [9] investigates the relations between breaking financial news and stock price changes. Most recently [13] provide a ground-breaking demonstration of how public sentiment related to movies, as expressed on Twitter, can actually predict box office receipts. Although news most certainly influences stock market prices, public mood states or sentiment may play an equally important role. We know from psychological research that emotions, in addition to information, play an significant role in human decision-making [16], [18], [39]. Behavioral finance has provided further proof that financial decisions are sig- nificantly driven by emotion and mood [19]. It is therefore reasonable to assume that the public mood and sentiment can drive stock market values as much as news. This is supported by recent research by [10] who extract an indicator of public anxiety from LiveJournal posts and investigate whether its variations can predict S&P500 values. However, if it is our goal to study how public mood influences the stock markets, we need reliable, scalable and early assessments of the public mood at a time-scale and resolution appropriate for practical stock market prediction. Large surveys of public mood over representative samples of the population are generally expensive and time-consuming to conduct, cf. Gallup’s opinion polls and various consumer and well-being indices. Some have therefore proposed indirect assessment of public mood or sentiment from the results of soccer games [20] and from weather conditions [21]. The accuracy of these methods is however limited by the low degree to which the chosen indicators are expected to be correlated with public mood. rXiv:1010.3003v1[cs.CE]14Oct2010 Journal of Computational Science - 2011
  • 35. In  This  Work...   Efficient  Market  Hypothesis   ‘you  can’t  beat  the  market’   all  informaHon  is  already  taken  into  account     But  maybe  twiLer  can?   measure  the  people’s  mood  as  reflected  on  twiLer   use  for  predicHon   Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   35  
  • 36. Data   about  10M  tweets,  February  to  December  2008     Hme-­‐series  of  senHment  scores   OpinionFinder:  posiHve  -­‐  negaHve  scale   POMS:  lexicon-­‐based  mood  score    >  calm,  alert,  sure,  vital,  kind,  happy     Hme-­‐series  of  DJIA  from  Yahoo!  Finance   Dow  Jones  Industrial  Average   Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   36  
  • 37. r periods. DJIA daily closing value (March 2008−December 2008 Mar Apr May Jun Jul Aug Sep Oct Nov Dec Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   37  
  • 38. Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   38   each tweet term ets and methods MS time series we local mean and of k days before e z-score of time (1) mean and stan- riod [t k, t+k]. uctuate around a andard deviation. ies against large OMS to capture we apply them October 5, 2008 osen specifically ts that may have on public mood ber 4, 2008) and OF and GPOMS or after Thanksgiving. 1.25 1.75 OpinionFinder day after election Thanksgiving -1 1 pre- election anxiety CALM -1 1 ALERT -1 1 election results SURE 1 1 pre! election energy VITAL -1 -1 KIND -1 1 Thanksgiving happiness HAPPY Oct 22 Oct 29 Nov 05 Nov 12 Nov 19 Nov 26 z-scores Fig. 2. Tracking public mood states from tweets posted between October 2008 to December 2008 shows public responses to presidential election and
  • 39. Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   39   vs. DJIA sponds to ntial elec- question correlate A closing onometric aily time the DJIA. that if a matically he lagged orrelation ation. We ar fashion ether one r or not7 . flect daily the delta At 1. To more detail, we plot both time series in Fig. 3. To maintain the same scale, we convert the DJIA delta values Dt and mood index value Xt to z-scores as shown in Eq. 1. -2 -1 0 1 2 DJIAz-score Aug 09 Aug 29 Sep 18 Oct 08 Oct 28 -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 DJIAz-scoreCalmz-score Calmz-score bank bail-out Fig. 3. A panel of three graphs. The top graph shows the overlap of the day-to-day difference of DJIA values (blue: ZDt ) with the GPOMS’ Calm time series (red: ZXt ) that has been lagged by 3 days. Where the two graphs overlap the Calm time series predict changes in the DJIA closing values that
  • 40. Linear  CorrelaHon   Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   40   4 R VS. 6 GPOMS p 0.20460 0.932 4.25e-08 ? ? ? 0.004?? 0.226 1.30e-05 ?? p 2.382e-13 < 0.1: ?) e provided in cate that YOF X4 (Vital) and Alert) and X5 GPOMS mood es provided by imensions that components of We perform the Granger causality analysis according to model L1 and L2 shown in Eq. 3 and 4 for the period of time between February 28 to November 3, 2008 to exclude the exceptional public mood response to the Presidential Election and Thanksgiving from the comparison. GPOMS and OpinionFinder time series were produced for 342,255 tweets in that period, and the daily Dow Jones Industrial Average (DJIA) was retrieved from Yahoo! Finance for each day8 . L1 : Dt = ↵ + nX i=1 iDt i + ✏t (3) L2 : Dt = ↵ + nX i=1 iDt i + nX i=1 iXt i + ✏t (4) Based on the results of our Granger causality (shown in Table II), we can reject the null hypothesis that the mood time series do not predict DJIA values, i.e. {1,2,··· ,n} 6= 0 with a high level of confidence. However, this result only applies to 1 GPOMS mood dimension. We observe that X1 (i.e. Calm) has the highest Granger causality relation with DJIA for lags Basic  Model   DJIA  daily  change   Mood  Variables   Enhanced  Model   two  models,   with  and  without  senHment  scores  
  • 41. Linear  CorrelaHon   Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   41   TABLE II ICANCE (P-VALUES) OF BIVARIATE GRANGER-CAUSALITY CORRELATION BETWEEN MOODS AND DJIA IN PE 2008 TO NOVEMBER 3, 2008. Lag OF Calm Alert Sure Vital Kind Happy 1 day 0.085? 0.272 0.952 0.648 0.120 0.848 0.388 2 days 0.268 0.013?? 0.973 0.811 0.369 0.991 0.7061 3 days 0.436 0.022?? 0.981 0.349 0.418 0.991 0.723 4 days 0.218 0.030?? 0.998 0.415 0.475 0.989 0.750 5 days 0.300 0.036?? 0.989 0.544 0.553 0.996 0.173 6 days 0.446 0.065? 0.996 0.691 0.682 0.994 0.081? 7 days 0.620 0.157 0.999 0.381 0.713 0.999 0.150 (p-value < 0.05: ??, p-value < 0.1: ?) d dimension thus has predictive value with A. In fact the p-value for this shorter period, 8 to October 30 2008, is significantly lower 0.009) than that listed in Table II for the mood values of the past n days. We cho the results shown in Table II indicate tha Granger causal relation between Calm an significantly. All historical load values are
  • 42. Neural  Network   TABLE III DJIA DAILY PREDICTION USING SOFNN Evaluation IOF I0 I1 I1,2 I1,3 I1,4 I1,5 I1,6 MAPE (%) 1.95 1.94 1.83 2.03 2.13 2.05 1.85 1.79? Direction (%) 73.3 73.3 86.7? 60.0 46.7 60.0 73.3 80.0 of relevant economic indicators. These ns for existing sentiment tracking tools self-reported subjective well-being” in ate the extent to which they experience [2] Fama, E. F. (1991) Journal of Finance 46, [3] H.Cootner, P. (1964) The random chara (MIT). [4] Fama, E. F. (1965) The Journal of Busines [5] Qian, Bo, Rasheed, & Khaled. (2007) AppMining  the  Social  Web  -­‐  Aalto  -­‐  2015   42   calm   calm  +  happy   values  of  3  previous  days   Mean  Average   PredicHon  Error   less  is  beLer   PredicHon  of   DJIA  direc%on  
  • 43. Summary   quesHon:  does  twiLer  mood  predict  the  stock  market   data:  8  months  of  tweets   methodology:  lexicon-­‐based  senHment  scores,  LR,  NN   impact:  showed  a  case  that  twiLer  mood  can  predict   the  stock  market   reproducibility:  data  on  website,  website  not  working   https://terramood.soic.indiana.edu/data   Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   43  
  • 44. Discussion       what  do  you  think?   what  would  you  do  differently?   Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   44  
  • 45. Baseline  Scenarios   invest  10k  $  on  DJIA  on  Jan  1st  1976   sell  everything  at  end  of  day  if  index  is  down   buy  everything  back  at  end  of  first  day  that  index  is  up   cash  in  on  Dec  31st  1985   >>  25k  $  before  transacHon  costs   >>  1.1k  $  a^er  transacHon  costs  (0.25%  commission)     invest  10k  $  on  Jan  1st  1976,  cash  in  on  Dec  31st  1985   >>  18k  $     repeat  during  the  2000s   >>  4k  $  before  commissions       Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   45   source:  “The  Signal  and  the  Noise”,  N.  Silver,  2012,  page  344  
  • 46. Do  Investors  Mine  TwiLer?   Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   46  
  • 47. Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   47   The Livehoods Project: Utilizing Social Media to Understand the Dynamics of a City Justin Cranshaw Raz Schwartz Jason I. Hong Norman Sadeh jcransh@cs.cmu.edu razs@andrew.cmu.edu jasonh@cs.cmu.edu sadeh@cs.cmu.edu School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Abstract Studying the social dynamics of a city on a large scale has tra- ditionally been a challenging endeavor, often requiring long hours of observation and interviews, usually resulting in only a partial depiction of reality. To address this difficulty, we introduce a clustering model and research methodology for studying the structure and composition of a city on a large scale based on the social media its residents generate. We ap- ply this new methodology to data from approximately 18 mil- lion check-ins collected from users of a location-based online social network. Unlike the boundaries of traditional munici- pal organizational units such as neighborhoods, which do not always reflect the character of life in these areas, our clusters, which we call Livehoods, are representations of the dynamic areas that comprise the city. We take a qualitative approach to validating these clusters, interviewing 27 residents of Pitts- burgh, PA, to see how their perceptions of the city project onto our findings there. Our results provide strong support for the discovered clusters, showing how Livehoods reveal the distinctly characterized areas of the city and the forces that shape them. Introduction The forces that shape the dynamics of a city are multifarious and complex. Cultural perceptions, economic factors, mu- nicipal borders, demography, geography, and resources—all shape and constrain the texture and character of local urban life. It can be extremely difficult to convey these intricacies to an outsider; one may call them well-kept secrets, some- times only even partially known to the locals. When out- siders, such as researchers, journalists, or city planners, do want to learn about a city, it often requires hundreds of hours activity patterns of its people. Contrary to traditional organi- zational units such as neighborhoods that are often stagnant and may portray old realities, our clusters reflect current col- lective activity patterns of people in the city, thus revealing the dynamic nature of local urban areas, exposing their indi- vidual characters, and highlighting various forces that form the urban habitat. Our work is made possible by the rapid proliferation of smartphones in recent years and the subsequent emergence of location-based services and applications. Location-based social networks such as foursquare have created new means for online interactions based on the physical location of their users. In these systems, users can “check-in” to a location by selecting it from a list of named nearby venues. Their check- in is then broadcast to other users of the system. To algorithmically explore the dynamics of cities, we use data from millions of check-ins gathered from foursquare. Using well studied techniques in spectral clustering, we in- troduce a model for the structure of local urban areas that groups nearby foursquare venues into clusters. Our model takes into account both the spatial proximity between venues as given by their geographic coordinates, as well as the so- cial proximity which we derive from the distribution of peo- ple that check-in to them. The underlying hypothesis of our model is that the “character” of an urban area is defined not just by the types of places found there, but also by the peo- ple that choose to make that area part of their daily life. We call these clusters Livehoods, reflecting the dynamic nature of activity patterns in the lives of city inhabitants. We take a qualatative approach to evaluating this hypoth- esis. In a true urban studies tradition, we conducted in- terviews with 27 residents of different areas of Pittsburgh, International Conference on Weblogs and Social Media (ICWSM) 2012
  • 48. In  This  Work...   goal:  discover  city  structure  using  online  acHvity   use  4sq  data  to  uncover  ‘livehoods’  in  PiLsburgh   how  do  they  differ  from  official  boundaries?    evaluate  findings  with  interviews   Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   48  
  • 49. Data   43k  check-­‐ins  in  PiLsburgh   4k  users,  5k  venues  (restaurants,  cafeterias,  etc)   newly  collected  &  from  previous  work     who  (user  id)  visited  what  venue  (venue-­‐id,  geo-­‐locaHon)   Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   49  
  • 50. Clustering   For  each  pair  of  venues   –  geographic  distance   –  social  similarity  s  (in  terms  of  visiHng  users)   For  each  venue   –  maintain  m  nearest  geographic  neighbors   –  connect  them  with  an  edge  with  weight  s   Apply  spectral  clustering   –    number  of  clusters  at  largest  eigenvalue  gap   Clusters  è  ‘Livehoods’   Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   50  
  • 51. Findings   Figure 1: The municipal borders (black) and Livehoods for Shadyside/East Liberty (Left) and Lawrenceville/Polish Hill (Right). Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   51  
  • 52. EvaluaHon   Found  livehoods  that  split,  spilled,  or  corresponded   with  municipal  areas     Interviews  with  27  residents   –  IdenHfied  and  drew  their  neighborhood   –  Shown  a  map  with  municipal  boundaries,  asked  if  they   could  idenHfy  borders  that  were  not  accurate  (‘in  flux’)   –  Shown  algorithm’s  results,  asked  for  feedback   Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   52  
  • 53. EvaluaHon  Results   •  Not  very  rigorous,  but  interesHng...   •  Most  discovered  livehoods  matched  boundaries   •  When  not,  the  interviewees  usually  agree  &  offered   explanaHons   •  One  controversial  case   •  Poor  area  missing  (no  smartphones?)   Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   53  
  • 54. livehoods.org   Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   54  
  • 55. Summary   goal:  discover  livehoods  in  ciHes   data:  43k  Foursquare  check-­‐ins   methodology:  combined  geographic  &  social  clustering   impact:  showed  a  case  where  we  can  use  online  data  to   understand  offline  acHvity   reproducibility:  some  data  are  online,  results  are  online     Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   55  
  • 56. Discussion       What  do  you  think?   What  would  you  do  differently?   Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   56  
  • 57. Schedule   •  Today:  overview   •  February  2nd  :  discuss  literature  (Aris)   •  February  9th  :  discuss  literature  (Michael)   •  February  23rd  :  students  present  project  proposals   •  March  30th  :  students  submit  progress  report   •  March  30th  &  April  6th:  intermediate  presentaHons   •  May  4th  &  May  11th  :  final  presentaHons   •  May  15th  :  final  report  due   57  Mining  the  Social  Web  -­‐  Aalto  -­‐  2015  
  • 58. Proposals   •  5  min  presentaHons     •  what  do  you  intend  to  do?   •  why?   •  what  data  will  you  use?   •  what  techniques?   •  how  will  you  evaluate?   •  do  you  plan  to  publish  your  code  and  data?     Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   58  
  • 59. The End Mining  the  Social  Web  -­‐  Aalto  -­‐  2015   59