Slides for the course Big Data and Automated Content Analysis, in which students of the social sciences (communication science) learn how to conduct analyses using Python.
Science 7 - LAND and SEA BREEZE and its Characteristics
BD-ACA week7a
1. Word co-occurrences Some suggestions on where to look further Next meetings
Big Data and Automated Content Analysis
Week 7 – Monday
»Word co-occurrances, Gephi
— and some suggestions«
Damian Trilling
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
Afdeling Communicatiewetenschap
Universiteit van Amsterdam
11 May 2015
Big Data and Automated Content Analysis Damian Trilling
2. Word co-occurrences Some suggestions on where to look further Next meetings
Today
1 Integrating word counts and network analysis: Word
co-occurrences
The idea
A real-life example
2 Some suggestions on where to look further
Useful packages
Some more tips
3 Next meetings, & final project
Big Data and Automated Content Analysis Damian Trilling
3. Word co-occurrences Some suggestions on where to look further Next meetings
Integrating word counts and network analysis:
Word co-occurrences
Big Data and Automated Content Analysis Damian Trilling
4. Word co-occurrences Some suggestions on where to look further Next meetings
The idea
Simple word count
We already know this.
1 from collections import Counter
2 tekst="this is a test where many test words occur several times this is
because it is a test yes indeed it is"
3 c=Counter(tekst.split())
4 print "The top 5 are: "
5 for woord,aantal in c.most_common(5):
6 print (aantal,woord)
Big Data and Automated Content Analysis Damian Trilling
5. Word co-occurrences Some suggestions on where to look further Next meetings
The idea
Simple word count
The output:
1 The top 5 are:
2 4 is
3 3 test
4 2 a
5 2 this
6 2 it
Big Data and Automated Content Analysis Damian Trilling
6. Word co-occurrences Some suggestions on where to look further Next meetings
The idea
What if we could. . .
. . . count the frequency of combinations of words?
Big Data and Automated Content Analysis Damian Trilling
7. Word co-occurrences Some suggestions on where to look further Next meetings
The idea
What if we could. . .
. . . count the frequency of combinations of words?
As in: Which words do typical occur together in the same
tweet (or paragraph, or sentence, . . . )
Big Data and Automated Content Analysis Damian Trilling
8. Word co-occurrences Some suggestions on where to look further Next meetings
The idea
We can — with the combinations() function
1 >>> from itertools import combinations
2 >>> words="Hoi this is a test test test a test it is".split()
3 >>> print ([e for e in combinations(words,2)])
4 [(’Hoi’, ’this’), (’Hoi’, ’is’), (’Hoi’, ’a’), (’Hoi’, ’test’), (’Hoi’,
’test’), (’Hoi’, ’test’), (’Hoi’, ’a’), (’Hoi’, ’test’), (’Hoi’, ’
it’), (’Hoi’, ’is’), (’this’, ’is’), (’this’, ’a’), (’this’, ’test
’), (’this’, ’test’), (’this’, ’test’), (’this’, ’a’), (’this’, ’
test’), (’this’, ’it’), (’this’, ’is’), (’is’, ’a’), (’is’, ’test’)
, (’is’, ’test’), (’is’, ’test’), (’is’, ’a’), (’is’, ’test’), (’is
’, ’it’), (’is’, ’is’), (’a’, ’test’), (’a’, ’test’), (’a’, ’test’)
, (’a’, ’a’), (’a’, ’test’), (’a’, ’it’), (’a’, ’is’), (’test’, ’
test’), (’test’, ’test’), (’test’, ’a’), (’test’, ’test’), (’test’,
’it’), (’test’, ’is’), (’test’, ’test’), (’test’, ’a’), (’test’, ’
test’), (’test’, ’it’), (’test’, ’is’), (’test’, ’a’), (’test’, ’
test’), (’test’, ’it’), (’test’, ’is’), (’a’, ’test’), (’a’, ’it’),
(’a’, ’is’), (’test’, ’it’), (’test’, ’is’), (’it’, ’is’)]
Big Data and Automated Content Analysis Damian Trilling
9. Word co-occurrences Some suggestions on where to look further Next meetings
The idea
Count co-occurrences
1 from collections import defaultdict
2 from itertools import combinations
3
4 tweets=["i am having coffee with my friend","i like coffee","i like
coffee and beer","beer i like"]
5 cooc=defaultdict(int)
6
7 for tweet in tweets:
8 words=tweet.split()
9 for a,b in set(combinations(words,2)):
10 if (b,a) in cooc:
11 a,b = b,a
12 if a!=b:
13 cooc[(a,b)]+=1
14
15 for combi in sorted(cooc,key=cooc.get,reverse=True):
16 print (cooc[combi],"t",combi)
Big Data and Automated Content Analysis Damian Trilling
10. Word co-occurrences Some suggestions on where to look further Next meetings
The idea
Count co-occurrences
The output:
1 3 (’i’, ’coffee’)
2 3 (’i’, ’like’)
3 2 (’i’, ’beer’)
4 2 (’like’, ’beer’)
5 2 (’like’, ’coffee’)
6 1 (’coffee’, ’beer’)
7 1 (’and’, ’beer’)
8 ...
9 ...
10 ...
Big Data and Automated Content Analysis Damian Trilling
11. Word co-occurrences Some suggestions on where to look further Next meetings
The idea
From a list of co-occurrences to a network
Let’s conceptualize each word as a node and each
cooccurrence as an edge
• node weight = word frequency
• edge weight = number of coocurrences
A GDF file offers all of this and looks like this:
Big Data and Automated Content Analysis Damian Trilling
13. Word co-occurrences Some suggestions on where to look further Next meetings
The idea
How to represent the cooccurrences graphically?
A two-step approach
1 Save as a GDF file (the format seems easy to understand, so
we could write a function for this in Python)
2 Open the GDF file in Gephi for visualization and/or network
analysis
Big Data and Automated Content Analysis Damian Trilling
14. Word co-occurrences Some suggestions on where to look further Next meetings
The idea
Gephi
• Install (NOT in the VM) from https://gephi.org
• By problems on MacOS, see what I wrote about Gephi here:
http://www.damiantrilling.net/
setting-up-my-new-macbook/
• I made a screencast on how to visualize the GDF file in Gephi:
https://streamingmedia.uva.nl/asset/detail/
t2KWKVZtQWZIe2Cj8qXcW5KF
• Further: see the materials I mailed to you
Big Data and Automated Content Analysis Damian Trilling
15. Word co-occurrences Some suggestions on where to look further Next meetings
A real-life example
A real-life example
Trilling, D. (2014). Two different debates? Investigating the
relationship between a political debate on TV and simultaneous
comments on Twitter. Social Science Computer Review, Advance
online publication. doi: 10.1177/0894439314537886
Big Data and Automated Content Analysis Damian Trilling
16. Word co-occurrences Some suggestions on where to look further Next meetings
A real-life example
Commenting the TV debate on Twitter
The debating politicians
• issues largely set by the interviewers
• but candidates actively try to highlight the issues (⇒ agenda
setting) and aspects of the issues (⇒ framing).
Big Data and Automated Content Analysis Damian Trilling
17. Word co-occurrences Some suggestions on where to look further Next meetings
A real-life example
Commenting the TV debate on Twitter
The viewers
• Commenting television programs on social networks has
become a regular pattern of behavior (Courtois & d’Heer, 2012)
• User comments have shown to reflect the structure of the
debate (Shamma, Churchill, & Kennedy, 2010; Shamma, Kennedy, & Churchill, 2009)
• Topic and speaker effect more influential than, e.g., rhetorical
skills (Nagel, Maurer, & Reinemann, 2012; De Mooy & Maier, 2014)
Big Data and Automated Content Analysis Damian Trilling
18. Word co-occurrences Some suggestions on where to look further Next meetings
A real-life example
Research Questions
To which extent are the statements politicians make during a
TV debate reflected in online live discussions of the debate?
RQ1 Which topics are emphasized by the candidates?
RQ2 Which topics are emphasized by the Twitter users?
RQ3 With which topics are the two candidates associated
on Twitter?
Big Data and Automated Content Analysis Damian Trilling
19. Word co-occurrences Some suggestions on where to look further Next meetings
A real-life example
Method
Big Data and Automated Content Analysis Damian Trilling
20. Word co-occurrences Some suggestions on where to look further Next meetings
A real-life example
Method
The data
• debate transcript
• tweets containing
#tvduell
• N = 120, 557 tweets
by N = 24, 796 users
• 22-9-2013,
20.30-22.00
Big Data and Automated Content Analysis Damian Trilling
21. Word co-occurrences Some suggestions on where to look further Next meetings
A real-life example
Method
The data
• debate transcript
• tweets containing
#tvduell
• N = 120, 557 tweets
by N = 24, 796 users
• 22-9-2013,
20.30-22.00
The analysis
• Series of self-written Python
scripts:
1 preprocessing (stemming,
stopword removal)
2 word counts
3 word log likelihood (corpus
comparison)
• Stata: regression analysis
Big Data and Automated Content Analysis Damian Trilling
23. Word co-occurrences Some suggestions on where to look further Next meetings
A real-life example
Relationship between words on TV and on Twitter
0246810
ln(wordonTwitter+1)
0 1 2 3
ln (word on TV +1)
Big Data and Automated Content Analysis Damian Trilling
24. Word co-occurrences Some suggestions on where to look further Next meetings
A real-life example
Word frequency TV ⇒ word frequency Twitter
Model 1 Model 2 Model 3
ln(Twitter +1) ln(Twitter +1) ln(Twitter +1)
together w/ M. together w/ S.
b (SE) b(SE) b(SE)
beta beta beta
ln (TV M. +1) 1.59 (.052) *** 1.54 (.041) *** .77 (.037) ***
.21 .26 .14
ln (TV S. +1) 1.29 (.051) *** .88 (.041) *** 1.25 (.037) ***
.17 .15 .24
intercept 1.64 (.008) *** .87 (.007) *** .60 (.006) ***
R2 .100 .115 .100
b M. & S. differ? F(1, 21408) = 12.29 F(1, 21408) = 96.69 F(1, 21408) =
p <.001 p <.001 63.38
p <.001
M = Merkel; S = Steinbrück
Big Data and Automated Content Analysis Damian Trilling
25. Word co-occurrences Some suggestions on where to look further Next meetings
A real-life example
Most distinctive words on TV
LL word Frequency Merkel Frequency Steinbrüc
27,73 merkel 0 20
19,41 arbeitsplatz [job] 14 0
15,25 steinbruck 11 0
9,70 koalition [coaltion] 7 0
9,70 international 7 0
9,70 gemeinsam [together] 7 0
8,55 griechenland [Greece] 10 1
8,32 investi [investment] 6 0
6,93 uberzeug [belief] 5 0
6,93 okonom [economic] 0 5
Big Data and Automated Content Analysis Damian Trilling
26. Word co-occurrences Some suggestions on where to look further Next meetings
A real-life example
Most distinctive words on Twitter
LL word Frequency Merkel Frequency Ste
32443,39 merkel 29672 0
30751,65 steinbrueck 0 17780
1507,08 kett [necklace] 1628 34
1241,14 vertrau [trust] 1240 12
863,84 fdp [a coalition partner] 985 29
775,93 nsa 1809 298
626,49 wikipedia 40 502
574,65 twittert [tweets] 40 469
544,87 koalition [coalition] 864 77
517,99 gold 669 34
Big Data and Automated Content Analysis Damian Trilling
27. Word co-occurrences Some suggestions on where to look further Next meetings
A real-life example
Putting the pieces together
Merkel
• necklace
• trust (sarcastic)
• nsa affair
• coalition partners
Steinbrück
• suggestion to look sth. up
on Wikipedia
• tweets from his account
during the debate
Big Data and Automated Content Analysis Damian Trilling
28.
29. Word co-occurrences Some suggestions on where to look further Next meetings
Useful packages
Some suggestions on where to look further
Big Data and Automated Content Analysis Damian Trilling
30. Word co-occurrences Some suggestions on where to look further Next meetings
Useful packages
Further analysis
Ways to further analyze the data
Big Data and Automated Content Analysis Damian Trilling
31. Word co-occurrences Some suggestions on where to look further Next meetings
Useful packages
Further analysis
Ways to further analyze the data
• Write the data in a specific format to link to special extenral
program (GDF-example)
Big Data and Automated Content Analysis Damian Trilling
32. Word co-occurrences Some suggestions on where to look further Next meetings
Useful packages
Further analysis
Ways to further analyze the data
• Write the data in a specific format to link to special extenral
program (GDF-example)
• Export to CSV files and analyze using R, Stata, SPSS, Excel,
. . .
Big Data and Automated Content Analysis Damian Trilling
33. Word co-occurrences Some suggestions on where to look further Next meetings
Useful packages
Further analysis
Ways to further analyze the data
• Write the data in a specific format to link to special extenral
program (GDF-example)
• Export to CSV files and analyze using R, Stata, SPSS, Excel,
. . .
• Do it in Python, using. . . . . . . . .
Big Data and Automated Content Analysis Damian Trilling
34. Word co-occurrences Some suggestions on where to look further Next meetings
Useful packages
Packages for statistics and graphics
Already installed with anaconda:
• numpy
• scipy
• pandas
• mathplotlib
We won’t cover these packages in detail, but you are very much
encouraged to have a look at these packages yourself if you feel
they are useful.
Big Data and Automated Content Analysis Damian Trilling
35. Word co-occurrences Some suggestions on where to look further Next meetings
Useful packages
numpy
1 >>> x = [1,2,3,4,3,2]
2 >>> y = [2,2,4,3,4,2]
3 >>> np.mean(x)
4 2.5
5 >>> np.std(x)
6 0.9574271077563381
7 >>> np.corrcoef(x,y)
8 array([[ 1. , 0.67883359],
9 [ 0.67883359, 1. ]])
Big Data and Automated Content Analysis Damian Trilling
36. Word co-occurrences Some suggestions on where to look further Next meetings
Useful packages
pandas
1 import pandas as pd
2 from pandas.stats.api import ols
3 df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C
": [32, 234, 23, 23, 42523]})
4 result = ols(y=df[’A’], x=df[[’B’,’C’]])
5 print(result)
prints a regression table like you would expect from any statistics
program:
Big Data and Automated Content Analysis Damian Trilling
37. Word co-occurrences Some suggestions on where to look further Next meetings
Useful packages
-------------------------Summary of Regression Analysis-------------------------
Formula: Y ~ <B> + <C> + <intercept>
Number of Observations: 5
Number of Degrees of Freedom: 3
R-squared: 0.5789
Adj R-squared: 0.1577
Rmse: 14.5108
F-stat (2, 2): 1.3746, p-value: 0.4211
Degrees of Freedom: model 2, resid 2
-----------------------Summary of Estimated Coefficients------------------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
B 0.4012 0.6497 0.62 0.5999 -0.8723 1.6746
C 0.0004 0.0005 0.65 0.5826 -0.0007 0.0014
intercept 14.9525 17.7643 0.84 0.4886 -19.8655 49.7705
---------------------------------End of Summary---------------------------------
... but you can get much more, like a list of predicted values
(result.y_predict), . . .
Big Data and Automated Content Analysis Damian Trilling
38. Word co-occurrences Some suggestions on where to look further Next meetings
Useful packages
matplotlib
1 import matplotlib.pyplot as plt
2 x = [1,2,3,4,3,2]
3 y = [2,2,4,3,4,2]
4 plt.hist(x)
5 plt.plot(x,y)
Big Data and Automated Content Analysis Damian Trilling
39. Word co-occurrences Some suggestions on where to look further Next meetings
Useful packages
Big Data and Automated Content Analysis Damian Trilling
40. Word co-occurrences Some suggestions on where to look further Next meetings
Some more tips
Some tips
• Make use of IPython features in Spyder (tab completion,
object inspector)
• Try things out in the IPython console (think of RStudio of
STATA!)
• Watch this video on “Python for data analysis" with pandas:
https://vimeo.com/59324550
Big Data and Automated Content Analysis Damian Trilling
41.
42. Word co-occurrences Some suggestions on where to look further Next meetings
Final project Next meetings
Big Data and Automated Content Analysis Damian Trilling
43. Word co-occurrences Some suggestions on where to look further Next meetings
Final project
On 29–5, you have to hand in your final project
• Details and rules: ⇒ course manual
• Similar to take-home exam
• But: Much more advanced, and now, the result counts as well
• And: Be creative! You can use code from class, but you need
to extend it
• Start working on it!
Big Data and Automated Content Analysis Damian Trilling
44. Word co-occurrences Some suggestions on where to look further Next meetings
Next meeting
Wednesday, 13–5
Lab session, focus on INDIVIDUAL PROJECTS! Prepare!
(No common exercise)
Big Data and Automated Content Analysis Damian Trilling