This document summarizes a presentation on analyzing word co-occurrences in text data using network analysis techniques. It discusses counting the frequency of word combinations, representing the co-occurrence data as a network with nodes for words and edges for co-occurrences, and visualizing the network in Gephi. It also provides an example analysis of tweets about a political debate, examining which topics were emphasized by each candidate based on word associations on Twitter.
1. Co-occurrences Networks Other co-occurrence based methods Next meetings
Big Data and Automated Content Analysis
Week 7 – Monday
»Co-occurring words«
Damian Trilling
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
Afdeling Communicatiewetenschap
Universiteit van Amsterdam
9 May 2016
Big Data and Automated Content Analysis Damian Trilling
2. Co-occurrences Networks Other co-occurrence based methods Next meetings
Today
1 Integrating word counts and network analysis: Word
co-occurrences
The idea
A real-life example
2 Other co-occurrence based methods
PCA
LDA
3 Next meetings, & final project
Big Data and Automated Content Analysis Damian Trilling
3. Co-occurrences Networks Other co-occurrence based methods Next meetings
Integrating word counts and network analysis:
Word co-occurrences
Big Data and Automated Content Analysis Damian Trilling
4. Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
Simple word count
We already know this.
1 from collections import Counter
2 tekst="this is a test where many test words occur several times this is
because it is a test yes indeed it is"
3 c=Counter(tekst.split())
4 print "The top 5 are: "
5 for woord,aantal in c.most_common(5):
6 print (aantal,woord)
Big Data and Automated Content Analysis Damian Trilling
5. Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
Simple word count
The output:
1 The top 5 are:
2 4 is
3 3 test
4 2 a
5 2 this
6 2 it
Big Data and Automated Content Analysis Damian Trilling
6. Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
What if we could. . .
. . . count the frequency of combinations of words?
Big Data and Automated Content Analysis Damian Trilling
7. Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
What if we could. . .
. . . count the frequency of combinations of words?
As in: Which words do typical occur together in the same
tweet (or paragraph, or sentence, . . . )
Big Data and Automated Content Analysis Damian Trilling
8. Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
We can — with the combinations() function
1 >>> from itertools import combinations
2 >>> words="Hoi this is a test test test a test it is".split()
3 >>> print ([e for e in combinations(words,2)])
4 [(’Hoi’, ’this’), (’Hoi’, ’is’), (’Hoi’, ’a’), (’Hoi’, ’test’), (’Hoi’,
’test’), (’Hoi’, ’test’), (’Hoi’, ’a’), (’Hoi’, ’test’), (’Hoi’, ’
it’), (’Hoi’, ’is’), (’this’, ’is’), (’this’, ’a’), (’this’, ’test
’), (’this’, ’test’), (’this’, ’test’), (’this’, ’a’), (’this’, ’
test’), (’this’, ’it’), (’this’, ’is’), (’is’, ’a’), (’is’, ’test’)
, (’is’, ’test’), (’is’, ’test’), (’is’, ’a’), (’is’, ’test’), (’is
’, ’it’), (’is’, ’is’), (’a’, ’test’), (’a’, ’test’), (’a’, ’test’)
, (’a’, ’a’), (’a’, ’test’), (’a’, ’it’), (’a’, ’is’), (’test’, ’
test’), (’test’, ’test’), (’test’, ’a’), (’test’, ’test’), (’test’,
’it’), (’test’, ’is’), (’test’, ’test’), (’test’, ’a’), (’test’, ’
test’), (’test’, ’it’), (’test’, ’is’), (’test’, ’a’), (’test’, ’
test’), (’test’, ’it’), (’test’, ’is’), (’a’, ’test’), (’a’, ’it’),
(’a’, ’is’), (’test’, ’it’), (’test’, ’is’), (’it’, ’is’)]
Big Data and Automated Content Analysis Damian Trilling
9. Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
Count co-occurrences
1 from collections import defaultdict
2 from itertools import combinations
3
4 tweets=["i am having coffee with my friend","i like coffee","i like
coffee and beer","beer i like"]
5 cooc=defaultdict(int)
6
7 for tweet in tweets:
8 words=tweet.split()
9 for a,b in set(combinations(words,2)):
10 if (b,a) in cooc:
11 a,b = b,a
12 if a!=b:
13 cooc[(a,b)]+=1
14
15 for combi in sorted(cooc,key=cooc.get,reverse=True):
16 print (cooc[combi],"t",combi)
Big Data and Automated Content Analysis Damian Trilling
10. Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
Count co-occurrences
The output:
1 3 (’i’, ’coffee’)
2 3 (’i’, ’like’)
3 2 (’i’, ’beer’)
4 2 (’like’, ’beer’)
5 2 (’like’, ’coffee’)
6 1 (’coffee’, ’beer’)
7 1 (’and’, ’beer’)
8 ...
9 ...
10 ...
Big Data and Automated Content Analysis Damian Trilling
11. Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
From a list of co-occurrences to a network
Let’s conceptualize each word as a node and each
cooccurrence as an edge
• node weight = word frequency
• edge weight = number of coocurrences
A GDF file offers all of this and looks like this:
Big Data and Automated Content Analysis Damian Trilling
13. Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
How to represent the cooccurrences graphically?
A two-step approach
1 Save as a GDF file (the format seems easy to understand, so
we could write a function for this in Python)
2 Open the GDF file in Gephi for visualization and/or network
analysis
Big Data and Automated Content Analysis Damian Trilling
14. Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
Gephi
• Install (NOT in the VM) from https://gephi.org
• By problems on MacOS, see what I wrote about Gephi here:
http://www.damiantrilling.net/
setting-up-my-new-macbook/
• I made a screencast on how to visualize the GDF file in Gephi:
https://streamingmedia.uva.nl/asset/detail/
t2KWKVZtQWZIe2Cj8qXcW5KF
• Further: see the materials I mailed to you
Big Data and Automated Content Analysis Damian Trilling
15. Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
A real-life example
Trilling, D. (2015). Two different debates? Investigating the
relationship between a political debate on TV and simultaneous
comments on Twitter. Social Science Computer Review,33,
259–276. doi: 10.1177/0894439314537886
Big Data and Automated Content Analysis Damian Trilling
16. Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Commenting the TV debate on Twitter
The viewers
• Commenting television programs on social networks has
become a regular pattern of behavior (Courtois & d’Heer, 2012)
• User comments have shown to reflect the structure of the
debate (Shamma, Churchill, & Kennedy, 2010; Shamma, Kennedy, & Churchill, 2009)
• Topic and speaker effect more influential than, e.g., rhetorical
skills (Nagel, Maurer, & Reinemann, 2012; De Mooy & Maier, 2014)
Big Data and Automated Content Analysis Damian Trilling
17. Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Research Questions
To which extent are the statements politicians make during a
TV debate reflected in online live discussions of the debate?
RQ1 Which topics are emphasized by the candidates?
RQ2 Which topics are emphasized by the Twitter users?
RQ3 With which topics are the two candidates associated
on Twitter?
Big Data and Automated Content Analysis Damian Trilling
18. Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Method
Big Data and Automated Content Analysis Damian Trilling
19. Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Method
The data
• debate transcript
• tweets containing
#tvduell
• N = 120, 557 tweets
by N = 24, 796 users
• 22-9-2013,
20.30-22.00
Big Data and Automated Content Analysis Damian Trilling
20. Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Method
The data
• debate transcript
• tweets containing
#tvduell
• N = 120, 557 tweets
by N = 24, 796 users
• 22-9-2013,
20.30-22.00
The analysis
• Series of self-written Python
scripts:
1 preprocessing (stemming,
stopword removal)
2 word counts
3 word log likelihood (corpus
comparison)
• Stata: regression analysis
Big Data and Automated Content Analysis Damian Trilling
22. Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Relationship between words on TV and on Twitter
0246810
ln(wordonTwitter+1)
0 1 2 3
ln (word on TV +1)
Big Data and Automated Content Analysis Damian Trilling
23. Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Word frequency TV ⇒ word frequency Twitter
Model 1 Model 2 Model 3
ln(Twitter +1) ln(Twitter +1) ln(Twitter +1)
together w/ M. together w/ S.
b (SE) b(SE) b(SE)
beta beta beta
ln (TV M. +1) 1.59 (.052) *** 1.54 (.041) *** .77 (.037) ***
.21 .26 .14
ln (TV S. +1) 1.29 (.051) *** .88 (.041) *** 1.25 (.037) ***
.17 .15 .24
intercept 1.64 (.008) *** .87 (.007) *** .60 (.006) ***
R2 .100 .115 .100
b M. & S. differ? F(1, 21408) = 12.29 F(1, 21408) = 96.69 F(1, 21408) =
p <.001 p <.001 63.38
p <.001
M = Merkel; S = Steinbrück
Big Data and Automated Content Analysis Damian Trilling
24. Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Most distinctive words on TV
LL word Frequency Merkel Frequency Steinbrüc
27,73 merkel 0 20
19,41 arbeitsplatz [job] 14 0
15,25 steinbruck 11 0
9,70 koalition [coaltion] 7 0
9,70 international 7 0
9,70 gemeinsam [together] 7 0
8,55 griechenland [Greece] 10 1
8,32 investi [investment] 6 0
6,93 uberzeug [belief] 5 0
6,93 okonom [economic] 0 5
Big Data and Automated Content Analysis Damian Trilling
25. Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Most distinctive words on Twitter
LL word Frequency Merkel Frequency Ste
32443,39 merkel 29672 0
30751,65 steinbrueck 0 17780
1507,08 kett [necklace] 1628 34
1241,14 vertrau [trust] 1240 12
863,84 fdp [a coalition partner] 985 29
775,93 nsa 1809 298
626,49 wikipedia 40 502
574,65 twittert [tweets] 40 469
544,87 koalition [coalition] 864 77
517,99 gold 669 34
Big Data and Automated Content Analysis Damian Trilling
26. Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Putting the pieces together
Merkel
• necklace
• trust (sarcastic)
• nsa affair
• coalition partners
Steinbrück
• suggestion to look sth. up
on Wikipedia
• tweets from his account
during the debate
Big Data and Automated Content Analysis Damian Trilling
31. Co-occurrences Networks Other co-occurrence based methods Next meetings
Some terminology
Supervised machine learning
You have a dataset with both
predictor and outcome
(dependent and independent
variables) — a labeled dataset.
Big Data and Automated Content Analysis Damian Trilling
32. Co-occurrences Networks Other co-occurrence based methods Next meetings
Some terminology
Supervised machine learning
You have a dataset with both
predictor and outcome
(dependent and independent
variables) — a labeled dataset.
Think of regression: You measured x1,
x2, x3 and you want to predict y,
which you also measured
Big Data and Automated Content Analysis Damian Trilling
33. Co-occurrences Networks Other co-occurrence based methods Next meetings
Some terminology
Supervised machine learning
You have a dataset with both
predictor and outcome
(dependent and independent
variables) — a labeled dataset.
Unsupervised machine learning
You have no labels.
Big Data and Automated Content Analysis Damian Trilling
34. Co-occurrences Networks Other co-occurrence based methods Next meetings
Some terminology
Supervised machine learning
You have a dataset with both
predictor and outcome
(dependent and independent
variables) — a labeled dataset.
Unsupervised machine learning
You have no labels. (You did not
measure y)
Big Data and Automated Content Analysis Damian Trilling
35. Co-occurrences Networks Other co-occurrence based methods Next meetings
Some terminology
Unsupervised machine learning
You have no labels.
Again, you already know some
techniques to find out how x1,
x2,. . . x_i co-occur from other
courses:
• Principal Component Analysis
(PCA)
• Cluster analysis
• . . .
Big Data and Automated Content Analysis Damian Trilling
36. Co-occurrences Networks Other co-occurrence based methods Next meetings
PCA
Principal Component Analysis? How does that fit in here?
Big Data and Automated Content Analysis Damian Trilling
37. Co-occurrences Networks Other co-occurrence based methods Next meetings
PCA
Principal Component Analysis? How does that fit in here?
In fact, PCA is used everywhere, even in image compression
Big Data and Automated Content Analysis Damian Trilling
38. Co-occurrences Networks Other co-occurrence based methods Next meetings
PCA
Principal Component Analysis? How does that fit in here?
PCA in ACA
• Find out what word cooccur (inductive frame analysis)
• Basically, transform each document in a vector of word
frequencies and do a PCA
Big Data and Automated Content Analysis Damian Trilling
39. Co-occurrences Networks Other co-occurrence based methods Next meetings
PCA
A so-called term-document-matrix
1 w1,w2,w3,w4,w5,w6 ...
2 text1, 2, 0, 0, 1, 2, 3 ...
3 text2, 0, 0, 1, 2, 3, 4 ...
4 text3, 9, 0, 1, 1, 0, 0 ...
5 ...
Big Data and Automated Content Analysis Damian Trilling
40. Co-occurrences Networks Other co-occurrence based methods Next meetings
PCA
A so-called term-document-matrix
1 w1,w2,w3,w4,w5,w6 ...
2 text1, 2, 0, 0, 1, 2, 3 ...
3 text2, 0, 0, 1, 2, 3, 4 ...
4 text3, 9, 0, 1, 1, 0, 0 ...
5 ...
These can be simple counts, but also more advanced metrics, like
tf-idf scores (where you weigh the frequency by the number of
documents in which it occurs), cosine distances, etc.
Big Data and Automated Content Analysis Damian Trilling
41. Co-occurrences Networks Other co-occurrence based methods Next meetings
PCA
PCA: implications and problems
• given a term-document matrix, easy to do with any tool
• probably extremely skewed distributions
• some problematic assumptions: does the goal of PCA, to find
a solution in which one word loads on one component match
real life, where a word can belong to several topics or frames?
Big Data and Automated Content Analysis Damian Trilling
42. Co-occurrences Networks Other co-occurrence based methods Next meetings
LDA
Enter topic modeling with Latent Dirichlet Allocation (LDA)
Big Data and Automated Content Analysis Damian Trilling
43. Co-occurrences Networks Other co-occurrence based methods Next meetings
LDA
LDA, what’s that?
No mathematical details here, but the general idea
• There are k topics, T1. . . Tk
• Each document Di consists of a mixture of these topics,
e.g.80%T1, 15%T2, 0%T3, . . . 5%Tk
• On the next level, each topic consists of a specific probability
distribution of words
• Thus, based on the frequencies of words in Di , one can infer
its distribution of topics
• Note that LDA (likek PCA) is a Bag-of-Words (BOW)
approach
Big Data and Automated Content Analysis Damian Trilling
44. Co-occurrences Networks Other co-occurrence based methods Next meetings
LDA
Doing a LDA in Python
You can use gensim (Řehůřek & Sojka, 2010) for this.
1 sudo pip3 install gensim
Furthermore, let us assume you have a list of lists of words (!)
called texts:
1 articles=[’The tax deficit is higher than expected. This said xxx ...’,
’Germany won the World Cup. After a’]
2 texts=[art.split() for art in articles]
which looks like this:
1 [[’The’, ’tax’, ’deficit’, ’is’, ’higher’, ’than’, ’expected.’, ’This’,
’said’, ’xxx’, ’...’], [’Germany’, ’won’, ’the’, ’World’, ’Cup.’, ’
After’, ’a’]]
Řehůřek, R., & Sojka, P. (2010). Software framework for topic modelling with large corpora. Proceedings of the
LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. Valletta, Malta: ELRA.
Big Data and Automated Content Analysis Damian Trilling
45. 1 from gensim import corpora, models
2
3 NTOPICS = 100
4 LDAOUTPUTFILE="topicscores.tsv"
5
6 # Create a BOW represenation of the texts
7 id2word = corpora.Dictionary(texts)
8 mm =[id2word.doc2bow(text) for text in texts]
9
10 # Train the LDA models.
11 lda = models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=
NTOPICS, alpha="auto")
12
13 # Print the topics.
14 for top in lda.print_topics(num_topics=NTOPICS, num_words=5):
15 print ("n",top)
16
17 print ("nFor further analysis, a dataset with the topic score for each
document is saved to",LDAOUTPUTFILE)
18
19 scoresperdoc=lda.inference(mm)
20
21 with open(LDAOUTPUTFILE,"w",encoding="utf-8") as fo:
22 for row in scoresperdoc[0]:
23 fo.write("t".join(["{:0.3f}".format(score) for score in row]))
24 fo.write("n")
48. Co-occurrences Networks Other co-occurrence based methods Next meetings
Next meetings
Big Data and Automated Content Analysis Damian Trilling
49. Co-occurrences Networks Other co-occurrence based methods Next meetings
Next meetings
Wednesday, 11–5
Lab session
Conduct an analysis based on word co-occurrences (Chapter 8
and/or 9.2). Install Gephi in advance!
No meeting on Monday (Pentecost)
Wednesday, 18–5
Supervised machine learning
Big Data and Automated Content Analysis Damian Trilling