BD-ACA week7a

Word co-occurrences Some suggestions on where to look further Next meetings
Big Data and Automated Content Analysis
Week 7 – Monday
»Word co-occurrances, Gephi
— and some suggestions«
Damian Trilling
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
Afdeling Communicatiewetenschap
Universiteit van Amsterdam
11 May 2015
Big Data and Automated Content Analysis Damian Trilling

Today
1 Integrating word counts and network analysis: Word
co-occurrences
The idea
A real-life example
2 Some suggestions on where to look further
Useful packages
Some more tips
3 Next meetings, & ﬁnal project

Integrating word counts and network analysis:
Word co-occurrences

The idea
Simple word count
We already know this.
1 from collections import Counter
2 tekst="this is a test where many test words occur several times this is
because it is a test yes indeed it is"
3 c=Counter(tekst.split())
4 print "The top 5 are: "
5 for woord,aantal in c.most_common(5):
6 print (aantal,woord)

The idea
Simple word count
The output:
1 The top 5 are:
2 4 is
3 3 test
4 2 a
5 2 this
6 2 it

The idea
What if we could. . .
. . . count the frequency of combinations of words?

The idea
What if we could. . .
. . . count the frequency of combinations of words?
As in: Which words do typical occur together in the same
tweet (or paragraph, or sentence, . . . )

The idea
We can — with the combinations() function
1 >>> from itertools import combinations
2 >>> words="Hoi this is a test test test a test it is".split()
3 >>> print ([e for e in combinations(words,2)])
4 [(’Hoi’, ’this’), (’Hoi’, ’is’), (’Hoi’, ’a’), (’Hoi’, ’test’), (’Hoi’,
’test’), (’Hoi’, ’test’), (’Hoi’, ’a’), (’Hoi’, ’test’), (’Hoi’, ’
it’), (’Hoi’, ’is’), (’this’, ’is’), (’this’, ’a’), (’this’, ’test
’), (’this’, ’test’), (’this’, ’test’), (’this’, ’a’), (’this’, ’
test’), (’this’, ’it’), (’this’, ’is’), (’is’, ’a’), (’is’, ’test’)
, (’is’, ’test’), (’is’, ’test’), (’is’, ’a’), (’is’, ’test’), (’is
’, ’it’), (’is’, ’is’), (’a’, ’test’), (’a’, ’test’), (’a’, ’test’)
, (’a’, ’a’), (’a’, ’test’), (’a’, ’it’), (’a’, ’is’), (’test’, ’
test’), (’test’, ’test’), (’test’, ’a’), (’test’, ’test’), (’test’,
’it’), (’test’, ’is’), (’test’, ’test’), (’test’, ’a’), (’test’, ’
test’), (’test’, ’it’), (’test’, ’is’), (’test’, ’a’), (’test’, ’
test’), (’test’, ’it’), (’test’, ’is’), (’a’, ’test’), (’a’, ’it’),
(’a’, ’is’), (’test’, ’it’), (’test’, ’is’), (’it’, ’is’)]

The idea
Count co-occurrences
1 from collections import defaultdict
2 from itertools import combinations
3
4 tweets=["i am having coffee with my friend","i like coffee","i like
coffee and beer","beer i like"]
5 cooc=defaultdict(int)
6
7 for tweet in tweets:
8 words=tweet.split()
9 for a,b in set(combinations(words,2)):
10 if (b,a) in cooc:
11 a,b = b,a
12 if a!=b:
13 cooc[(a,b)]+=1
14
15 for combi in sorted(cooc,key=cooc.get,reverse=True):
16 print (cooc[combi],"t",combi)

The idea
Count co-occurrences
The output:
1 3 (’i’, ’coffee’)
2 3 (’i’, ’like’)
3 2 (’i’, ’beer’)
4 2 (’like’, ’beer’)
5 2 (’like’, ’coffee’)
6 1 (’coffee’, ’beer’)
7 1 (’and’, ’beer’)
8 ...
9 ...
10 ...

The idea
From a list of co-occurrences to a network
Let’s conceptualize each word as a node and each
cooccurrence as an edge
• node weight = word frequency
• edge weight = number of coocurrences
A GDF ﬁle oﬀers all of this and looks like this:

1 nodedef>name VARCHAR, width DOUBLE
2 coffee,3
3 beer,2
4 i,4
5 and,1
6 with,1
7 friend,1
8 having,1
9 like,3
10 am,1
11 my,1
12 edgedef>node1 VARCHAR,node2 VARCHAR, weight DOUBLE
13 coffee,beer,1
14 i,beer,2
15 and,beer,1
16 with,friend,1
17 coffee,with,1
18 i,and,1
19 having,friend,1
20 like,beer,2
21 am,friend,1
22 i,am,1
23 i,coffee,3
24 i,with,1
25 am,having,1
26 i,having,1
27 coffee,and,1
28 like,coffee,2

The idea
How to represent the cooccurrences graphically?
A two-step approach
1 Save as a GDF ﬁle (the format seems easy to understand, so
we could write a function for this in Python)
2 Open the GDF ﬁle in Gephi for visualization and/or network
analysis

The idea
Gephi
• Install (NOT in the VM) from https://gephi.org
• By problems on MacOS, see what I wrote about Gephi here:
http://www.damiantrilling.net/
setting-up-my-new-macbook/
• I made a screencast on how to visualize the GDF ﬁle in Gephi:
https://streamingmedia.uva.nl/asset/detail/
t2KWKVZtQWZIe2Cj8qXcW5KF
• Further: see the materials I mailed to you

A real-life example
A real-life example
Trilling, D. (2014). Two diﬀerent debates? Investigating the
relationship between a political debate on TV and simultaneous
comments on Twitter. Social Science Computer Review, Advance
online publication. doi: 10.1177/0894439314537886

A real-life example
Commenting the TV debate on Twitter
The debating politicians
• issues largely set by the interviewers
• but candidates actively try to highlight the issues (⇒ agenda
setting) and aspects of the issues (⇒ framing).

A real-life example
Commenting the TV debate on Twitter
The viewers
• Commenting television programs on social networks has
become a regular pattern of behavior (Courtois & d’Heer, 2012)
• User comments have shown to reflect the structure of the
debate (Shamma, Churchill, & Kennedy, 2010; Shamma, Kennedy, & Churchill, 2009)
• Topic and speaker effect more influential than, e.g., rhetorical
skills (Nagel, Maurer, & Reinemann, 2012; De Mooy & Maier, 2014)

A real-life example
Research Questions
To which extent are the statements politicians make during a
TV debate reﬂected in online live discussions of the debate?
RQ1 Which topics are emphasized by the candidates?
RQ2 Which topics are emphasized by the Twitter users?
RQ3 With which topics are the two candidates associated
on Twitter?

A real-life example
Method

A real-life example
Method
The data
• debate transcript
• tweets containing
#tvduell
• N = 120, 557 tweets
by N = 24, 796 users
• 22-9-2013,
20.30-22.00

A real-life example
Method
The data
• debate transcript
• tweets containing
#tvduell
• N = 120, 557 tweets
by N = 24, 796 users
• 22-9-2013,
20.30-22.00
The analysis
• Series of self-written Python
scripts:
1 preprocessing (stemming,
stopword removal)
2 word counts
3 word log likelihood (corpus
comparison)
• Stata: regression analysis

02000400060008000
−60 −50 −40 −30 −20 −10 10 20 30 40 50 60 70 80 100 110 120 130 140 150
start
end

A real-life example
Relationship between words on TV and on Twitter
0246810
ln(wordonTwitter+1)
0 1 2 3
ln (word on TV +1)

A real-life example
Word frequency TV ⇒ word frequency Twitter
Model 1 Model 2 Model 3
ln(Twitter +1) ln(Twitter +1) ln(Twitter +1)
together w/ M. together w/ S.
b (SE) b(SE) b(SE)
beta beta beta
ln (TV M. +1) 1.59 (.052) *** 1.54 (.041) *** .77 (.037) ***
.21 .26 .14
ln (TV S. +1) 1.29 (.051) *** .88 (.041) *** 1.25 (.037) ***
.17 .15 .24
intercept 1.64 (.008) *** .87 (.007) *** .60 (.006) ***
R2 .100 .115 .100
b M. & S. diﬀer? F(1, 21408) = 12.29 F(1, 21408) = 96.69 F(1, 21408) =
p <.001 p <.001 63.38
p <.001
M = Merkel; S = Steinbrück

A real-life example
Most distinctive words on TV
LL word Frequency Merkel Frequency Steinbrüc
27,73 merkel 0 20
19,41 arbeitsplatz [job] 14 0
15,25 steinbruck 11 0
9,70 koalition [coaltion] 7 0
9,70 international 7 0
9,70 gemeinsam [together] 7 0
8,55 griechenland [Greece] 10 1
8,32 investi [investment] 6 0
6,93 uberzeug [belief] 5 0
6,93 okonom [economic] 0 5

A real-life example
Most distinctive words on Twitter
LL word Frequency Merkel Frequency Ste
32443,39 merkel 29672 0
30751,65 steinbrueck 0 17780
1507,08 kett [necklace] 1628 34
1241,14 vertrau [trust] 1240 12
863,84 fdp [a coalition partner] 985 29
775,93 nsa 1809 298
626,49 wikipedia 40 502
574,65 twittert [tweets] 40 469
544,87 koalition [coalition] 864 77
517,99 gold 669 34

A real-life example
Putting the pieces together
Merkel
• necklace
• trust (sarcastic)
• nsa aﬀair
• coalition partners
Steinbrück
• suggestion to look sth. up
on Wikipedia
• tweets from his account
during the debate

Useful packages
Some suggestions on where to look further

Useful packages
Further analysis
Ways to further analyze the data

Useful packages
Further analysis
• Write the data in a speciﬁc format to link to special extenral
program (GDF-example)

Useful packages
Further analysis
• Export to CSV ﬁles and analyze using R, Stata, SPSS, Excel,
. . .

Useful packages
Further analysis
• Export to CSV ﬁles and analyze using R, Stata, SPSS, Excel,
. . .
• Do it in Python, using. . . . . . . . .

Useful packages
Packages for statistics and graphics
Already installed with anaconda:
• numpy
• scipy
• pandas
• mathplotlib
We won’t cover these packages in detail, but you are very much
encouraged to have a look at these packages yourself if you feel
they are useful.

Useful packages
numpy
1 >>> x = [1,2,3,4,3,2]
2 >>> y = [2,2,4,3,4,2]
3 >>> np.mean(x)
4 2.5
5 >>> np.std(x)
6 0.9574271077563381
7 >>> np.corrcoef(x,y)
8 array([[ 1. , 0.67883359],
9 [ 0.67883359, 1. ]])

Useful packages
pandas
1 import pandas as pd
2 from pandas.stats.api import ols
3 df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C
": [32, 234, 23, 23, 42523]})
4 result = ols(y=df[’A’], x=df[[’B’,’C’]])
5 print(result)
prints a regression table like you would expect from any statistics
program:

Useful packages
-------------------------Summary of Regression Analysis-------------------------
Formula: Y ~ <B> + <C> + <intercept>
Number of Observations: 5
Number of Degrees of Freedom: 3
R-squared: 0.5789
Adj R-squared: 0.1577
Rmse: 14.5108
F-stat (2, 2): 1.3746, p-value: 0.4211
Degrees of Freedom: model 2, resid 2
-----------------------Summary of Estimated Coefficients------------------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
B 0.4012 0.6497 0.62 0.5999 -0.8723 1.6746
C 0.0004 0.0005 0.65 0.5826 -0.0007 0.0014
intercept 14.9525 17.7643 0.84 0.4886 -19.8655 49.7705
---------------------------------End of Summary---------------------------------
... but you can get much more, like a list of predicted values
(result.y_predict), . . .

Useful packages
matplotlib
1 import matplotlib.pyplot as plt
2 x = [1,2,3,4,3,2]
3 y = [2,2,4,3,4,2]
4 plt.hist(x)
5 plt.plot(x,y)

Useful packages

Some more tips
Some tips
• Make use of IPython features in Spyder (tab completion,
object inspector)
• Try things out in the IPython console (think of RStudio of
STATA!)
• Watch this video on “Python for data analysis" with pandas:
https://vimeo.com/59324550

Final project Next meetings

Final project
On 29–5, you have to hand in your ﬁnal project
• Details and rules: ⇒ course manual
• Similar to take-home exam
• But: Much more advanced, and now, the result counts as well
• And: Be creative! You can use code from class, but you need
to extend it
• Start working on it!

Next meeting
Wednesday, 13–5
Lab session, focus on INDIVIDUAL PROJECTS! Prepare!
(No common exercise)

BD-ACA week7a

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (11)

Similar to BD-ACA week7a

Similar to BD-ACA week7a (20)

More from Department of Communication Science, University of Amsterdam

More from Department of Communication Science, University of Amsterdam (16)

Recently uploaded

Recently uploaded (20)

BD-ACA week7a