SlideShare a Scribd company logo
1 of 49
Download to read offline
Co-occurrences Networks Other co-occurrence based methods Next meetings
Big Data and Automated Content Analysis
Week 7 – Monday
»Co-occurring words«
Damian Trilling
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
Afdeling Communicatiewetenschap
Universiteit van Amsterdam
9 May 2016
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
Today
1 Integrating word counts and network analysis: Word
co-occurrences
The idea
A real-life example
2 Other co-occurrence based methods
PCA
LDA
3 Next meetings, & final project
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
Integrating word counts and network analysis:
Word co-occurrences
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
Simple word count
We already know this.
1 from collections import Counter
2 tekst="this is a test where many test words occur several times this is
because it is a test yes indeed it is"
3 c=Counter(tekst.split())
4 print "The top 5 are: "
5 for woord,aantal in c.most_common(5):
6 print (aantal,woord)
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
Simple word count
The output:
1 The top 5 are:
2 4 is
3 3 test
4 2 a
5 2 this
6 2 it
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
What if we could. . .
. . . count the frequency of combinations of words?
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
What if we could. . .
. . . count the frequency of combinations of words?
As in: Which words do typical occur together in the same
tweet (or paragraph, or sentence, . . . )
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
We can — with the combinations() function
1 >>> from itertools import combinations
2 >>> words="Hoi this is a test test test a test it is".split()
3 >>> print ([e for e in combinations(words,2)])
4 [(’Hoi’, ’this’), (’Hoi’, ’is’), (’Hoi’, ’a’), (’Hoi’, ’test’), (’Hoi’,
’test’), (’Hoi’, ’test’), (’Hoi’, ’a’), (’Hoi’, ’test’), (’Hoi’, ’
it’), (’Hoi’, ’is’), (’this’, ’is’), (’this’, ’a’), (’this’, ’test
’), (’this’, ’test’), (’this’, ’test’), (’this’, ’a’), (’this’, ’
test’), (’this’, ’it’), (’this’, ’is’), (’is’, ’a’), (’is’, ’test’)
, (’is’, ’test’), (’is’, ’test’), (’is’, ’a’), (’is’, ’test’), (’is
’, ’it’), (’is’, ’is’), (’a’, ’test’), (’a’, ’test’), (’a’, ’test’)
, (’a’, ’a’), (’a’, ’test’), (’a’, ’it’), (’a’, ’is’), (’test’, ’
test’), (’test’, ’test’), (’test’, ’a’), (’test’, ’test’), (’test’,
’it’), (’test’, ’is’), (’test’, ’test’), (’test’, ’a’), (’test’, ’
test’), (’test’, ’it’), (’test’, ’is’), (’test’, ’a’), (’test’, ’
test’), (’test’, ’it’), (’test’, ’is’), (’a’, ’test’), (’a’, ’it’),
(’a’, ’is’), (’test’, ’it’), (’test’, ’is’), (’it’, ’is’)]
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
Count co-occurrences
1 from collections import defaultdict
2 from itertools import combinations
3
4 tweets=["i am having coffee with my friend","i like coffee","i like
coffee and beer","beer i like"]
5 cooc=defaultdict(int)
6
7 for tweet in tweets:
8 words=tweet.split()
9 for a,b in set(combinations(words,2)):
10 if (b,a) in cooc:
11 a,b = b,a
12 if a!=b:
13 cooc[(a,b)]+=1
14
15 for combi in sorted(cooc,key=cooc.get,reverse=True):
16 print (cooc[combi],"t",combi)
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
Count co-occurrences
The output:
1 3 (’i’, ’coffee’)
2 3 (’i’, ’like’)
3 2 (’i’, ’beer’)
4 2 (’like’, ’beer’)
5 2 (’like’, ’coffee’)
6 1 (’coffee’, ’beer’)
7 1 (’and’, ’beer’)
8 ...
9 ...
10 ...
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
From a list of co-occurrences to a network
Let’s conceptualize each word as a node and each
cooccurrence as an edge
• node weight = word frequency
• edge weight = number of coocurrences
A GDF file offers all of this and looks like this:
Big Data and Automated Content Analysis Damian Trilling
1 nodedef>name VARCHAR, width DOUBLE
2 coffee,3
3 beer,2
4 i,4
5 and,1
6 with,1
7 friend,1
8 having,1
9 like,3
10 am,1
11 my,1
12 edgedef>node1 VARCHAR,node2 VARCHAR, weight DOUBLE
13 coffee,beer,1
14 i,beer,2
15 and,beer,1
16 with,friend,1
17 coffee,with,1
18 i,and,1
19 having,friend,1
20 like,beer,2
21 am,friend,1
22 i,am,1
23 i,coffee,3
24 i,with,1
25 am,having,1
26 i,having,1
27 coffee,and,1
28 like,coffee,2
Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
How to represent the cooccurrences graphically?
A two-step approach
1 Save as a GDF file (the format seems easy to understand, so
we could write a function for this in Python)
2 Open the GDF file in Gephi for visualization and/or network
analysis
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
Gephi
• Install (NOT in the VM) from https://gephi.org
• By problems on MacOS, see what I wrote about Gephi here:
http://www.damiantrilling.net/
setting-up-my-new-macbook/
• I made a screencast on how to visualize the GDF file in Gephi:
https://streamingmedia.uva.nl/asset/detail/
t2KWKVZtQWZIe2Cj8qXcW5KF
• Further: see the materials I mailed to you
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
A real-life example
Trilling, D. (2015). Two different debates? Investigating the
relationship between a political debate on TV and simultaneous
comments on Twitter. Social Science Computer Review,33,
259–276. doi: 10.1177/0894439314537886
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Commenting the TV debate on Twitter
The viewers
• Commenting television programs on social networks has
become a regular pattern of behavior (Courtois & d’Heer, 2012)
• User comments have shown to reflect the structure of the
debate (Shamma, Churchill, & Kennedy, 2010; Shamma, Kennedy, & Churchill, 2009)
• Topic and speaker effect more influential than, e.g., rhetorical
skills (Nagel, Maurer, & Reinemann, 2012; De Mooy & Maier, 2014)
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Research Questions
To which extent are the statements politicians make during a
TV debate reflected in online live discussions of the debate?
RQ1 Which topics are emphasized by the candidates?
RQ2 Which topics are emphasized by the Twitter users?
RQ3 With which topics are the two candidates associated
on Twitter?
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Method
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Method
The data
• debate transcript
• tweets containing
#tvduell
• N = 120, 557 tweets
by N = 24, 796 users
• 22-9-2013,
20.30-22.00
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Method
The data
• debate transcript
• tweets containing
#tvduell
• N = 120, 557 tweets
by N = 24, 796 users
• 22-9-2013,
20.30-22.00
The analysis
• Series of self-written Python
scripts:
1 preprocessing (stemming,
stopword removal)
2 word counts
3 word log likelihood (corpus
comparison)
• Stata: regression analysis
Big Data and Automated Content Analysis Damian Trilling
02000400060008000
−60 −50 −40 −30 −20 −10 10 20 30 40 50 60 70 80 100 110 120 130 140 150
start
end
Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Relationship between words on TV and on Twitter
0246810
ln(wordonTwitter+1)
0 1 2 3
ln (word on TV +1)
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Word frequency TV ⇒ word frequency Twitter
Model 1 Model 2 Model 3
ln(Twitter +1) ln(Twitter +1) ln(Twitter +1)
together w/ M. together w/ S.
b (SE) b(SE) b(SE)
beta beta beta
ln (TV M. +1) 1.59 (.052) *** 1.54 (.041) *** .77 (.037) ***
.21 .26 .14
ln (TV S. +1) 1.29 (.051) *** .88 (.041) *** 1.25 (.037) ***
.17 .15 .24
intercept 1.64 (.008) *** .87 (.007) *** .60 (.006) ***
R2 .100 .115 .100
b M. & S. differ? F(1, 21408) = 12.29 F(1, 21408) = 96.69 F(1, 21408) =
p <.001 p <.001 63.38
p <.001
M = Merkel; S = Steinbrück
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Most distinctive words on TV
LL word Frequency Merkel Frequency Steinbrüc
27,73 merkel 0 20
19,41 arbeitsplatz [job] 14 0
15,25 steinbruck 11 0
9,70 koalition [coaltion] 7 0
9,70 international 7 0
9,70 gemeinsam [together] 7 0
8,55 griechenland [Greece] 10 1
8,32 investi [investment] 6 0
6,93 uberzeug [belief] 5 0
6,93 okonom [economic] 0 5
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Most distinctive words on Twitter
LL word Frequency Merkel Frequency Ste
32443,39 merkel 29672 0
30751,65 steinbrueck 0 17780
1507,08 kett [necklace] 1628 34
1241,14 vertrau [trust] 1240 12
863,84 fdp [a coalition partner] 985 29
775,93 nsa 1809 298
626,49 wikipedia 40 502
574,65 twittert [tweets] 40 469
544,87 koalition [coalition] 864 77
517,99 gold 669 34
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Putting the pieces together
Merkel
• necklace
• trust (sarcastic)
• nsa affair
• coalition partners
Steinbrück
• suggestion to look sth. up
on Wikipedia
• tweets from his account
during the debate
Big Data and Automated Content Analysis Damian Trilling
Other (non-networkbased, statistical) co-occurrence based
methods
Enter unsupervised machine learning
Enter unsupervised machine learning
(something you aready did in your Bachelor – no kidding.)
Co-occurrences Networks Other co-occurrence based methods Next meetings
Some terminology
Supervised machine learning
You have a dataset with both
predictor and outcome
(dependent and independent
variables) — a labeled dataset.
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
Some terminology
Supervised machine learning
You have a dataset with both
predictor and outcome
(dependent and independent
variables) — a labeled dataset.
Think of regression: You measured x1,
x2, x3 and you want to predict y,
which you also measured
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
Some terminology
Supervised machine learning
You have a dataset with both
predictor and outcome
(dependent and independent
variables) — a labeled dataset.
Unsupervised machine learning
You have no labels.
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
Some terminology
Supervised machine learning
You have a dataset with both
predictor and outcome
(dependent and independent
variables) — a labeled dataset.
Unsupervised machine learning
You have no labels. (You did not
measure y)
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
Some terminology
Unsupervised machine learning
You have no labels.
Again, you already know some
techniques to find out how x1,
x2,. . . x_i co-occur from other
courses:
• Principal Component Analysis
(PCA)
• Cluster analysis
• . . .
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
PCA
Principal Component Analysis? How does that fit in here?
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
PCA
Principal Component Analysis? How does that fit in here?
In fact, PCA is used everywhere, even in image compression
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
PCA
Principal Component Analysis? How does that fit in here?
PCA in ACA
• Find out what word cooccur (inductive frame analysis)
• Basically, transform each document in a vector of word
frequencies and do a PCA
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
PCA
A so-called term-document-matrix
1 w1,w2,w3,w4,w5,w6 ...
2 text1, 2, 0, 0, 1, 2, 3 ...
3 text2, 0, 0, 1, 2, 3, 4 ...
4 text3, 9, 0, 1, 1, 0, 0 ...
5 ...
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
PCA
A so-called term-document-matrix
1 w1,w2,w3,w4,w5,w6 ...
2 text1, 2, 0, 0, 1, 2, 3 ...
3 text2, 0, 0, 1, 2, 3, 4 ...
4 text3, 9, 0, 1, 1, 0, 0 ...
5 ...
These can be simple counts, but also more advanced metrics, like
tf-idf scores (where you weigh the frequency by the number of
documents in which it occurs), cosine distances, etc.
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
PCA
PCA: implications and problems
• given a term-document matrix, easy to do with any tool
• probably extremely skewed distributions
• some problematic assumptions: does the goal of PCA, to find
a solution in which one word loads on one component match
real life, where a word can belong to several topics or frames?
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
LDA
Enter topic modeling with Latent Dirichlet Allocation (LDA)
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
LDA
LDA, what’s that?
No mathematical details here, but the general idea
• There are k topics, T1. . . Tk
• Each document Di consists of a mixture of these topics,
e.g.80%T1, 15%T2, 0%T3, . . . 5%Tk
• On the next level, each topic consists of a specific probability
distribution of words
• Thus, based on the frequencies of words in Di , one can infer
its distribution of topics
• Note that LDA (likek PCA) is a Bag-of-Words (BOW)
approach
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
LDA
Doing a LDA in Python
You can use gensim (Řehůřek & Sojka, 2010) for this.
1 sudo pip3 install gensim
Furthermore, let us assume you have a list of lists of words (!)
called texts:
1 articles=[’The tax deficit is higher than expected. This said xxx ...’,
’Germany won the World Cup. After a’]
2 texts=[art.split() for art in articles]
which looks like this:
1 [[’The’, ’tax’, ’deficit’, ’is’, ’higher’, ’than’, ’expected.’, ’This’,
’said’, ’xxx’, ’...’], [’Germany’, ’won’, ’the’, ’World’, ’Cup.’, ’
After’, ’a’]]
Řehůřek, R., & Sojka, P. (2010). Software framework for topic modelling with large corpora. Proceedings of the
LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. Valletta, Malta: ELRA.
Big Data and Automated Content Analysis Damian Trilling
1 from gensim import corpora, models
2
3 NTOPICS = 100
4 LDAOUTPUTFILE="topicscores.tsv"
5
6 # Create a BOW represenation of the texts
7 id2word = corpora.Dictionary(texts)
8 mm =[id2word.doc2bow(text) for text in texts]
9
10 # Train the LDA models.
11 lda = models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=
NTOPICS, alpha="auto")
12
13 # Print the topics.
14 for top in lda.print_topics(num_topics=NTOPICS, num_words=5):
15 print ("n",top)
16
17 print ("nFor further analysis, a dataset with the topic score for each
document is saved to",LDAOUTPUTFILE)
18
19 scoresperdoc=lda.inference(mm)
20
21 with open(LDAOUTPUTFILE,"w",encoding="utf-8") as fo:
22 for row in scoresperdoc[0]:
23 fo.write("t".join(["{:0.3f}".format(score) for score in row]))
24 fo.write("n")
Co-occurrences Networks Other co-occurrence based methods Next meetings
LDA
Output: Topics (below) & topic scores (next slide)
1 0.069*fusie + 0.058*brussel + 0.045*europesecommissie + 0.036*europese +
0.023*overname
2 0.109*bank + 0.066*britse + 0.041*regering + 0.035*financien + 0.033*
minister
3 0.114*nederlandse + 0.106*nederland + 0.070*bedrijven + 0.042*rusland +
0.038*russische
4 0.093*nederlandsespoorwegen + 0.074*den + 0.036*jaar + 0.029*onderzoek +
0.027*raad
5 0.099*banen + 0.045*jaar + 0.045*productie + 0.036*ton + 0.029*aantal
6 0.041*grote + 0.038*bedrijven + 0.027*ondernemers + 0.023*goed + 0.015*
jaar
7 0.108*werknemers + 0.037*jongeren + 0.035*werkgevers + 0.029*jaar +
0.025*werk
8 0.171*bank + 0.122* + 0.041*klanten + 0.035*verzekeraar + 0.028*euro
9 0.162*banken + 0.055*bank + 0.039*centrale + 0.027*leningen + 0.024*
financiele
10 0.052*post + 0.042*media + 0.038*nieuwe + 0.034*netwerk + 0.025*
personeel
11 ...
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
Next meetings
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
Next meetings
Wednesday, 11–5
Lab session
Conduct an analysis based on word co-occurrences (Chapter 8
and/or 9.2). Install Gephi in advance!
No meeting on Monday (Pentecost)
Wednesday, 18–5
Supervised machine learning
Big Data and Automated Content Analysis Damian Trilling

More Related Content

What's hot

VLDB 2015 Tutorial: On Uncertain Graph Modeling and Queries
VLDB 2015 Tutorial: On Uncertain Graph Modeling and QueriesVLDB 2015 Tutorial: On Uncertain Graph Modeling and Queries
VLDB 2015 Tutorial: On Uncertain Graph Modeling and QueriesArijit Khan
 
Distributed Natural Language Processing Systems in Python
Distributed Natural Language Processing Systems in PythonDistributed Natural Language Processing Systems in Python
Distributed Natural Language Processing Systems in PythonClare Corthell
 

What's hot (20)

BDACA - Lecture4
BDACA - Lecture4BDACA - Lecture4
BDACA - Lecture4
 
BDACA1617s2 - Lecture7
BDACA1617s2 - Lecture7BDACA1617s2 - Lecture7
BDACA1617s2 - Lecture7
 
BDACA1617s2 - Lecture6
BDACA1617s2 - Lecture6BDACA1617s2 - Lecture6
BDACA1617s2 - Lecture6
 
BDACA1617s2 - Lecture4
BDACA1617s2 - Lecture4BDACA1617s2 - Lecture4
BDACA1617s2 - Lecture4
 
BDACA1617s2 - Lecture5
BDACA1617s2 - Lecture5BDACA1617s2 - Lecture5
BDACA1617s2 - Lecture5
 
BDACA - Lecture2
BDACA - Lecture2BDACA - Lecture2
BDACA - Lecture2
 
BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3
 
BDACA - Lecture5
BDACA - Lecture5BDACA - Lecture5
BDACA - Lecture5
 
BDACA - Tutorial5
BDACA - Tutorial5BDACA - Tutorial5
BDACA - Tutorial5
 
BDACA - Lecture7
BDACA - Lecture7BDACA - Lecture7
BDACA - Lecture7
 
BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2
 
BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1
 
BD-ACA week5
BD-ACA week5BD-ACA week5
BD-ACA week5
 
BDACA - Lecture6
BDACA - Lecture6BDACA - Lecture6
BDACA - Lecture6
 
BDACA - Lecture8
BDACA - Lecture8BDACA - Lecture8
BDACA - Lecture8
 
Analyzing social media with Python and other tools (2/4)
Analyzing social media with Python and other tools (2/4) Analyzing social media with Python and other tools (2/4)
Analyzing social media with Python and other tools (2/4)
 
Analyzing social media with Python and other tools (4/4)
Analyzing social media with Python and other tools (4/4) Analyzing social media with Python and other tools (4/4)
Analyzing social media with Python and other tools (4/4)
 
Algorithm
AlgorithmAlgorithm
Algorithm
 
VLDB 2015 Tutorial: On Uncertain Graph Modeling and Queries
VLDB 2015 Tutorial: On Uncertain Graph Modeling and QueriesVLDB 2015 Tutorial: On Uncertain Graph Modeling and Queries
VLDB 2015 Tutorial: On Uncertain Graph Modeling and Queries
 
Distributed Natural Language Processing Systems in Python
Distributed Natural Language Processing Systems in PythonDistributed Natural Language Processing Systems in Python
Distributed Natural Language Processing Systems in Python
 

Viewers also liked (13)

Viennapharmacia ltd
Viennapharmacia ltdViennapharmacia ltd
Viennapharmacia ltd
 
Reunió pedagògica 1r 2016 2017 definitiva
Reunió pedagògica 1r 2016 2017 definitivaReunió pedagògica 1r 2016 2017 definitiva
Reunió pedagògica 1r 2016 2017 definitiva
 
BDACA1516s2 - Lecture4
 BDACA1516s2 - Lecture4 BDACA1516s2 - Lecture4
BDACA1516s2 - Lecture4
 
e0101
e0101e0101
e0101
 
Taller de refuerzo 1
Taller de refuerzo 1Taller de refuerzo 1
Taller de refuerzo 1
 
PCM
PCMPCM
PCM
 
Presentacion
PresentacionPresentacion
Presentacion
 
People, Culture, & Perceptions
People, Culture, & PerceptionsPeople, Culture, & Perceptions
People, Culture, & Perceptions
 
Conceptualizing and measuring news exposure as network of users and news items
Conceptualizing and measuring news exposure as network of users and news itemsConceptualizing and measuring news exposure as network of users and news items
Conceptualizing and measuring news exposure as network of users and news items
 
6.1 the roman republic
6.1 the roman republic6.1 the roman republic
6.1 the roman republic
 
Andean Summit Mini Agenda
Andean Summit Mini AgendaAndean Summit Mini Agenda
Andean Summit Mini Agenda
 
Ejercicios de retroalimentacion
Ejercicios de retroalimentacionEjercicios de retroalimentacion
Ejercicios de retroalimentacion
 
Tipos de cromosomas
Tipos de cromosomasTipos de cromosomas
Tipos de cromosomas
 

Similar to Co-occurrence word network analysis

Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...Polytechnic University of Bari
 
Portfolio_JenniHawk
Portfolio_JenniHawkPortfolio_JenniHawk
Portfolio_JenniHawkJenniHawk
 
Bayesian Network 을 활용한 예측 분석
Bayesian Network 을 활용한 예측 분석Bayesian Network 을 활용한 예측 분석
Bayesian Network 을 활용한 예측 분석datasciencekorea
 
Private Distributed Collaborative Filtering
Private Distributed Collaborative FilteringPrivate Distributed Collaborative Filtering
Private Distributed Collaborative FilteringNeal Lathia
 
GPU Acceleration of Set Similarity Joins
GPU Acceleration of Set Similarity JoinsGPU Acceleration of Set Similarity Joins
GPU Acceleration of Set Similarity JoinsMateus S. H. Cruz
 
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...Jonathan Stray
 
Telefonica Lunch Seminar
Telefonica Lunch SeminarTelefonica Lunch Seminar
Telefonica Lunch SeminarNeal Lathia
 
Big-data analytics: challenges and opportunities
Big-data analytics: challenges and opportunitiesBig-data analytics: challenges and opportunities
Big-data analytics: challenges and opportunities台灣資料科學年會
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273Abutest
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273Abutest
 
Google socialnetworksmarch08
Google socialnetworksmarch08Google socialnetworksmarch08
Google socialnetworksmarch08Yves Caseau
 
Fyp ideas
Fyp ideasFyp ideas
Fyp ideasMr SMAK
 
Recsys Presentation
Recsys PresentationRecsys Presentation
Recsys PresentationNeal Lathia
 
Governance Services iTime Project: Automated Filing - DevCon 2019
Governance Services iTime Project: Automated Filing - DevCon 2019Governance Services iTime Project: Automated Filing - DevCon 2019
Governance Services iTime Project: Automated Filing - DevCon 2019Tom Page
 
Cloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug NeedhamCloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug NeedhamDoug Needham
 
CML's Presentation at FengChia University
CML's Presentation at FengChia UniversityCML's Presentation at FengChia University
CML's Presentation at FengChia UniversityTunghai University
 

Similar to Co-occurrence word network analysis (20)

BD-ACA week3a
BD-ACA week3aBD-ACA week3a
BD-ACA week3a
 
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
 
Portfolio_JenniHawk
Portfolio_JenniHawkPortfolio_JenniHawk
Portfolio_JenniHawk
 
Bayesian Network 을 활용한 예측 분석
Bayesian Network 을 활용한 예측 분석Bayesian Network 을 활용한 예측 분석
Bayesian Network 을 활용한 예측 분석
 
BD-ACA week2
BD-ACA week2BD-ACA week2
BD-ACA week2
 
Private Distributed Collaborative Filtering
Private Distributed Collaborative FilteringPrivate Distributed Collaborative Filtering
Private Distributed Collaborative Filtering
 
GPU Acceleration of Set Similarity Joins
GPU Acceleration of Set Similarity JoinsGPU Acceleration of Set Similarity Joins
GPU Acceleration of Set Similarity Joins
 
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
 
Telefonica Lunch Seminar
Telefonica Lunch SeminarTelefonica Lunch Seminar
Telefonica Lunch Seminar
 
Similarity at Scale
Similarity at ScaleSimilarity at Scale
Similarity at Scale
 
Big-data analytics: challenges and opportunities
Big-data analytics: challenges and opportunitiesBig-data analytics: challenges and opportunities
Big-data analytics: challenges and opportunities
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Google socialnetworksmarch08
Google socialnetworksmarch08Google socialnetworksmarch08
Google socialnetworksmarch08
 
Fyp ideas
Fyp ideasFyp ideas
Fyp ideas
 
Recsys Presentation
Recsys PresentationRecsys Presentation
Recsys Presentation
 
Governance Services iTime Project: Automated Filing - DevCon 2019
Governance Services iTime Project: Automated Filing - DevCon 2019Governance Services iTime Project: Automated Filing - DevCon 2019
Governance Services iTime Project: Automated Filing - DevCon 2019
 
Why Data Science is a Science
Why Data Science is a ScienceWhy Data Science is a Science
Why Data Science is a Science
 
Cloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug NeedhamCloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug Needham
 
CML's Presentation at FengChia University
CML's Presentation at FengChia UniversityCML's Presentation at FengChia University
CML's Presentation at FengChia University
 

More from Department of Communication Science, University of Amsterdam (8)

BDACA - Tutorial1
BDACA - Tutorial1BDACA - Tutorial1
BDACA - Tutorial1
 
BDACA - Lecture1
BDACA - Lecture1BDACA - Lecture1
BDACA - Lecture1
 
BDACA1617s2 - Lecture 1
BDACA1617s2 - Lecture 1BDACA1617s2 - Lecture 1
BDACA1617s2 - Lecture 1
 
Media diets in an age of apps and social media: Dealing with a third layer of...
Media diets in an age of apps and social media: Dealing with a third layer of...Media diets in an age of apps and social media: Dealing with a third layer of...
Media diets in an age of apps and social media: Dealing with a third layer of...
 
Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 2/2"Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 2/2"
 
Data Science: Case "Political Communication 1/2"
Data Science: Case "Political Communication 1/2"Data Science: Case "Political Communication 1/2"
Data Science: Case "Political Communication 1/2"
 
BDACA1516s2 - Lecture1
BDACA1516s2 - Lecture1BDACA1516s2 - Lecture1
BDACA1516s2 - Lecture1
 
Should we worry about filter bubbles?
Should we worry about filter bubbles?Should we worry about filter bubbles?
Should we worry about filter bubbles?
 

Recently uploaded

Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 

Recently uploaded (20)

Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 

Co-occurrence word network analysis

  • 1. Co-occurrences Networks Other co-occurrence based methods Next meetings Big Data and Automated Content Analysis Week 7 – Monday »Co-occurring words« Damian Trilling d.c.trilling@uva.nl @damian0604 www.damiantrilling.net Afdeling Communicatiewetenschap Universiteit van Amsterdam 9 May 2016 Big Data and Automated Content Analysis Damian Trilling
  • 2. Co-occurrences Networks Other co-occurrence based methods Next meetings Today 1 Integrating word counts and network analysis: Word co-occurrences The idea A real-life example 2 Other co-occurrence based methods PCA LDA 3 Next meetings, & final project Big Data and Automated Content Analysis Damian Trilling
  • 3. Co-occurrences Networks Other co-occurrence based methods Next meetings Integrating word counts and network analysis: Word co-occurrences Big Data and Automated Content Analysis Damian Trilling
  • 4. Co-occurrences Networks Other co-occurrence based methods Next meetings The idea Simple word count We already know this. 1 from collections import Counter 2 tekst="this is a test where many test words occur several times this is because it is a test yes indeed it is" 3 c=Counter(tekst.split()) 4 print "The top 5 are: " 5 for woord,aantal in c.most_common(5): 6 print (aantal,woord) Big Data and Automated Content Analysis Damian Trilling
  • 5. Co-occurrences Networks Other co-occurrence based methods Next meetings The idea Simple word count The output: 1 The top 5 are: 2 4 is 3 3 test 4 2 a 5 2 this 6 2 it Big Data and Automated Content Analysis Damian Trilling
  • 6. Co-occurrences Networks Other co-occurrence based methods Next meetings The idea What if we could. . . . . . count the frequency of combinations of words? Big Data and Automated Content Analysis Damian Trilling
  • 7. Co-occurrences Networks Other co-occurrence based methods Next meetings The idea What if we could. . . . . . count the frequency of combinations of words? As in: Which words do typical occur together in the same tweet (or paragraph, or sentence, . . . ) Big Data and Automated Content Analysis Damian Trilling
  • 8. Co-occurrences Networks Other co-occurrence based methods Next meetings The idea We can — with the combinations() function 1 >>> from itertools import combinations 2 >>> words="Hoi this is a test test test a test it is".split() 3 >>> print ([e for e in combinations(words,2)]) 4 [(’Hoi’, ’this’), (’Hoi’, ’is’), (’Hoi’, ’a’), (’Hoi’, ’test’), (’Hoi’, ’test’), (’Hoi’, ’test’), (’Hoi’, ’a’), (’Hoi’, ’test’), (’Hoi’, ’ it’), (’Hoi’, ’is’), (’this’, ’is’), (’this’, ’a’), (’this’, ’test ’), (’this’, ’test’), (’this’, ’test’), (’this’, ’a’), (’this’, ’ test’), (’this’, ’it’), (’this’, ’is’), (’is’, ’a’), (’is’, ’test’) , (’is’, ’test’), (’is’, ’test’), (’is’, ’a’), (’is’, ’test’), (’is ’, ’it’), (’is’, ’is’), (’a’, ’test’), (’a’, ’test’), (’a’, ’test’) , (’a’, ’a’), (’a’, ’test’), (’a’, ’it’), (’a’, ’is’), (’test’, ’ test’), (’test’, ’test’), (’test’, ’a’), (’test’, ’test’), (’test’, ’it’), (’test’, ’is’), (’test’, ’test’), (’test’, ’a’), (’test’, ’ test’), (’test’, ’it’), (’test’, ’is’), (’test’, ’a’), (’test’, ’ test’), (’test’, ’it’), (’test’, ’is’), (’a’, ’test’), (’a’, ’it’), (’a’, ’is’), (’test’, ’it’), (’test’, ’is’), (’it’, ’is’)] Big Data and Automated Content Analysis Damian Trilling
  • 9. Co-occurrences Networks Other co-occurrence based methods Next meetings The idea Count co-occurrences 1 from collections import defaultdict 2 from itertools import combinations 3 4 tweets=["i am having coffee with my friend","i like coffee","i like coffee and beer","beer i like"] 5 cooc=defaultdict(int) 6 7 for tweet in tweets: 8 words=tweet.split() 9 for a,b in set(combinations(words,2)): 10 if (b,a) in cooc: 11 a,b = b,a 12 if a!=b: 13 cooc[(a,b)]+=1 14 15 for combi in sorted(cooc,key=cooc.get,reverse=True): 16 print (cooc[combi],"t",combi) Big Data and Automated Content Analysis Damian Trilling
  • 10. Co-occurrences Networks Other co-occurrence based methods Next meetings The idea Count co-occurrences The output: 1 3 (’i’, ’coffee’) 2 3 (’i’, ’like’) 3 2 (’i’, ’beer’) 4 2 (’like’, ’beer’) 5 2 (’like’, ’coffee’) 6 1 (’coffee’, ’beer’) 7 1 (’and’, ’beer’) 8 ... 9 ... 10 ... Big Data and Automated Content Analysis Damian Trilling
  • 11. Co-occurrences Networks Other co-occurrence based methods Next meetings The idea From a list of co-occurrences to a network Let’s conceptualize each word as a node and each cooccurrence as an edge • node weight = word frequency • edge weight = number of coocurrences A GDF file offers all of this and looks like this: Big Data and Automated Content Analysis Damian Trilling
  • 12. 1 nodedef>name VARCHAR, width DOUBLE 2 coffee,3 3 beer,2 4 i,4 5 and,1 6 with,1 7 friend,1 8 having,1 9 like,3 10 am,1 11 my,1 12 edgedef>node1 VARCHAR,node2 VARCHAR, weight DOUBLE 13 coffee,beer,1 14 i,beer,2 15 and,beer,1 16 with,friend,1 17 coffee,with,1 18 i,and,1 19 having,friend,1 20 like,beer,2 21 am,friend,1 22 i,am,1 23 i,coffee,3 24 i,with,1 25 am,having,1 26 i,having,1 27 coffee,and,1 28 like,coffee,2
  • 13. Co-occurrences Networks Other co-occurrence based methods Next meetings The idea How to represent the cooccurrences graphically? A two-step approach 1 Save as a GDF file (the format seems easy to understand, so we could write a function for this in Python) 2 Open the GDF file in Gephi for visualization and/or network analysis Big Data and Automated Content Analysis Damian Trilling
  • 14. Co-occurrences Networks Other co-occurrence based methods Next meetings The idea Gephi • Install (NOT in the VM) from https://gephi.org • By problems on MacOS, see what I wrote about Gephi here: http://www.damiantrilling.net/ setting-up-my-new-macbook/ • I made a screencast on how to visualize the GDF file in Gephi: https://streamingmedia.uva.nl/asset/detail/ t2KWKVZtQWZIe2Cj8qXcW5KF • Further: see the materials I mailed to you Big Data and Automated Content Analysis Damian Trilling
  • 15. Co-occurrences Networks Other co-occurrence based methods Next meetings A real-life example A real-life example Trilling, D. (2015). Two different debates? Investigating the relationship between a political debate on TV and simultaneous comments on Twitter. Social Science Computer Review,33, 259–276. doi: 10.1177/0894439314537886 Big Data and Automated Content Analysis Damian Trilling
  • 16. Co-occurrences Networks Other co-occurrence based methods Next meetings A real-life example Commenting the TV debate on Twitter The viewers • Commenting television programs on social networks has become a regular pattern of behavior (Courtois & d’Heer, 2012) • User comments have shown to reflect the structure of the debate (Shamma, Churchill, & Kennedy, 2010; Shamma, Kennedy, & Churchill, 2009) • Topic and speaker effect more influential than, e.g., rhetorical skills (Nagel, Maurer, & Reinemann, 2012; De Mooy & Maier, 2014) Big Data and Automated Content Analysis Damian Trilling
  • 17. Co-occurrences Networks Other co-occurrence based methods Next meetings A real-life example Research Questions To which extent are the statements politicians make during a TV debate reflected in online live discussions of the debate? RQ1 Which topics are emphasized by the candidates? RQ2 Which topics are emphasized by the Twitter users? RQ3 With which topics are the two candidates associated on Twitter? Big Data and Automated Content Analysis Damian Trilling
  • 18. Co-occurrences Networks Other co-occurrence based methods Next meetings A real-life example Method Big Data and Automated Content Analysis Damian Trilling
  • 19. Co-occurrences Networks Other co-occurrence based methods Next meetings A real-life example Method The data • debate transcript • tweets containing #tvduell • N = 120, 557 tweets by N = 24, 796 users • 22-9-2013, 20.30-22.00 Big Data and Automated Content Analysis Damian Trilling
  • 20. Co-occurrences Networks Other co-occurrence based methods Next meetings A real-life example Method The data • debate transcript • tweets containing #tvduell • N = 120, 557 tweets by N = 24, 796 users • 22-9-2013, 20.30-22.00 The analysis • Series of self-written Python scripts: 1 preprocessing (stemming, stopword removal) 2 word counts 3 word log likelihood (corpus comparison) • Stata: regression analysis Big Data and Automated Content Analysis Damian Trilling
  • 21. 02000400060008000 −60 −50 −40 −30 −20 −10 10 20 30 40 50 60 70 80 100 110 120 130 140 150 start end
  • 22. Co-occurrences Networks Other co-occurrence based methods Next meetings A real-life example Relationship between words on TV and on Twitter 0246810 ln(wordonTwitter+1) 0 1 2 3 ln (word on TV +1) Big Data and Automated Content Analysis Damian Trilling
  • 23. Co-occurrences Networks Other co-occurrence based methods Next meetings A real-life example Word frequency TV ⇒ word frequency Twitter Model 1 Model 2 Model 3 ln(Twitter +1) ln(Twitter +1) ln(Twitter +1) together w/ M. together w/ S. b (SE) b(SE) b(SE) beta beta beta ln (TV M. +1) 1.59 (.052) *** 1.54 (.041) *** .77 (.037) *** .21 .26 .14 ln (TV S. +1) 1.29 (.051) *** .88 (.041) *** 1.25 (.037) *** .17 .15 .24 intercept 1.64 (.008) *** .87 (.007) *** .60 (.006) *** R2 .100 .115 .100 b M. & S. differ? F(1, 21408) = 12.29 F(1, 21408) = 96.69 F(1, 21408) = p <.001 p <.001 63.38 p <.001 M = Merkel; S = Steinbrück Big Data and Automated Content Analysis Damian Trilling
  • 24. Co-occurrences Networks Other co-occurrence based methods Next meetings A real-life example Most distinctive words on TV LL word Frequency Merkel Frequency Steinbrüc 27,73 merkel 0 20 19,41 arbeitsplatz [job] 14 0 15,25 steinbruck 11 0 9,70 koalition [coaltion] 7 0 9,70 international 7 0 9,70 gemeinsam [together] 7 0 8,55 griechenland [Greece] 10 1 8,32 investi [investment] 6 0 6,93 uberzeug [belief] 5 0 6,93 okonom [economic] 0 5 Big Data and Automated Content Analysis Damian Trilling
  • 25. Co-occurrences Networks Other co-occurrence based methods Next meetings A real-life example Most distinctive words on Twitter LL word Frequency Merkel Frequency Ste 32443,39 merkel 29672 0 30751,65 steinbrueck 0 17780 1507,08 kett [necklace] 1628 34 1241,14 vertrau [trust] 1240 12 863,84 fdp [a coalition partner] 985 29 775,93 nsa 1809 298 626,49 wikipedia 40 502 574,65 twittert [tweets] 40 469 544,87 koalition [coalition] 864 77 517,99 gold 669 34 Big Data and Automated Content Analysis Damian Trilling
  • 26. Co-occurrences Networks Other co-occurrence based methods Next meetings A real-life example Putting the pieces together Merkel • necklace • trust (sarcastic) • nsa affair • coalition partners Steinbrück • suggestion to look sth. up on Wikipedia • tweets from his account during the debate Big Data and Automated Content Analysis Damian Trilling
  • 27.
  • 28. Other (non-networkbased, statistical) co-occurrence based methods
  • 30. Enter unsupervised machine learning (something you aready did in your Bachelor – no kidding.)
  • 31. Co-occurrences Networks Other co-occurrence based methods Next meetings Some terminology Supervised machine learning You have a dataset with both predictor and outcome (dependent and independent variables) — a labeled dataset. Big Data and Automated Content Analysis Damian Trilling
  • 32. Co-occurrences Networks Other co-occurrence based methods Next meetings Some terminology Supervised machine learning You have a dataset with both predictor and outcome (dependent and independent variables) — a labeled dataset. Think of regression: You measured x1, x2, x3 and you want to predict y, which you also measured Big Data and Automated Content Analysis Damian Trilling
  • 33. Co-occurrences Networks Other co-occurrence based methods Next meetings Some terminology Supervised machine learning You have a dataset with both predictor and outcome (dependent and independent variables) — a labeled dataset. Unsupervised machine learning You have no labels. Big Data and Automated Content Analysis Damian Trilling
  • 34. Co-occurrences Networks Other co-occurrence based methods Next meetings Some terminology Supervised machine learning You have a dataset with both predictor and outcome (dependent and independent variables) — a labeled dataset. Unsupervised machine learning You have no labels. (You did not measure y) Big Data and Automated Content Analysis Damian Trilling
  • 35. Co-occurrences Networks Other co-occurrence based methods Next meetings Some terminology Unsupervised machine learning You have no labels. Again, you already know some techniques to find out how x1, x2,. . . x_i co-occur from other courses: • Principal Component Analysis (PCA) • Cluster analysis • . . . Big Data and Automated Content Analysis Damian Trilling
  • 36. Co-occurrences Networks Other co-occurrence based methods Next meetings PCA Principal Component Analysis? How does that fit in here? Big Data and Automated Content Analysis Damian Trilling
  • 37. Co-occurrences Networks Other co-occurrence based methods Next meetings PCA Principal Component Analysis? How does that fit in here? In fact, PCA is used everywhere, even in image compression Big Data and Automated Content Analysis Damian Trilling
  • 38. Co-occurrences Networks Other co-occurrence based methods Next meetings PCA Principal Component Analysis? How does that fit in here? PCA in ACA • Find out what word cooccur (inductive frame analysis) • Basically, transform each document in a vector of word frequencies and do a PCA Big Data and Automated Content Analysis Damian Trilling
  • 39. Co-occurrences Networks Other co-occurrence based methods Next meetings PCA A so-called term-document-matrix 1 w1,w2,w3,w4,w5,w6 ... 2 text1, 2, 0, 0, 1, 2, 3 ... 3 text2, 0, 0, 1, 2, 3, 4 ... 4 text3, 9, 0, 1, 1, 0, 0 ... 5 ... Big Data and Automated Content Analysis Damian Trilling
  • 40. Co-occurrences Networks Other co-occurrence based methods Next meetings PCA A so-called term-document-matrix 1 w1,w2,w3,w4,w5,w6 ... 2 text1, 2, 0, 0, 1, 2, 3 ... 3 text2, 0, 0, 1, 2, 3, 4 ... 4 text3, 9, 0, 1, 1, 0, 0 ... 5 ... These can be simple counts, but also more advanced metrics, like tf-idf scores (where you weigh the frequency by the number of documents in which it occurs), cosine distances, etc. Big Data and Automated Content Analysis Damian Trilling
  • 41. Co-occurrences Networks Other co-occurrence based methods Next meetings PCA PCA: implications and problems • given a term-document matrix, easy to do with any tool • probably extremely skewed distributions • some problematic assumptions: does the goal of PCA, to find a solution in which one word loads on one component match real life, where a word can belong to several topics or frames? Big Data and Automated Content Analysis Damian Trilling
  • 42. Co-occurrences Networks Other co-occurrence based methods Next meetings LDA Enter topic modeling with Latent Dirichlet Allocation (LDA) Big Data and Automated Content Analysis Damian Trilling
  • 43. Co-occurrences Networks Other co-occurrence based methods Next meetings LDA LDA, what’s that? No mathematical details here, but the general idea • There are k topics, T1. . . Tk • Each document Di consists of a mixture of these topics, e.g.80%T1, 15%T2, 0%T3, . . . 5%Tk • On the next level, each topic consists of a specific probability distribution of words • Thus, based on the frequencies of words in Di , one can infer its distribution of topics • Note that LDA (likek PCA) is a Bag-of-Words (BOW) approach Big Data and Automated Content Analysis Damian Trilling
  • 44. Co-occurrences Networks Other co-occurrence based methods Next meetings LDA Doing a LDA in Python You can use gensim (Řehůřek & Sojka, 2010) for this. 1 sudo pip3 install gensim Furthermore, let us assume you have a list of lists of words (!) called texts: 1 articles=[’The tax deficit is higher than expected. This said xxx ...’, ’Germany won the World Cup. After a’] 2 texts=[art.split() for art in articles] which looks like this: 1 [[’The’, ’tax’, ’deficit’, ’is’, ’higher’, ’than’, ’expected.’, ’This’, ’said’, ’xxx’, ’...’], [’Germany’, ’won’, ’the’, ’World’, ’Cup.’, ’ After’, ’a’]] Řehůřek, R., & Sojka, P. (2010). Software framework for topic modelling with large corpora. Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. Valletta, Malta: ELRA. Big Data and Automated Content Analysis Damian Trilling
  • 45. 1 from gensim import corpora, models 2 3 NTOPICS = 100 4 LDAOUTPUTFILE="topicscores.tsv" 5 6 # Create a BOW represenation of the texts 7 id2word = corpora.Dictionary(texts) 8 mm =[id2word.doc2bow(text) for text in texts] 9 10 # Train the LDA models. 11 lda = models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics= NTOPICS, alpha="auto") 12 13 # Print the topics. 14 for top in lda.print_topics(num_topics=NTOPICS, num_words=5): 15 print ("n",top) 16 17 print ("nFor further analysis, a dataset with the topic score for each document is saved to",LDAOUTPUTFILE) 18 19 scoresperdoc=lda.inference(mm) 20 21 with open(LDAOUTPUTFILE,"w",encoding="utf-8") as fo: 22 for row in scoresperdoc[0]: 23 fo.write("t".join(["{:0.3f}".format(score) for score in row])) 24 fo.write("n")
  • 46. Co-occurrences Networks Other co-occurrence based methods Next meetings LDA Output: Topics (below) & topic scores (next slide) 1 0.069*fusie + 0.058*brussel + 0.045*europesecommissie + 0.036*europese + 0.023*overname 2 0.109*bank + 0.066*britse + 0.041*regering + 0.035*financien + 0.033* minister 3 0.114*nederlandse + 0.106*nederland + 0.070*bedrijven + 0.042*rusland + 0.038*russische 4 0.093*nederlandsespoorwegen + 0.074*den + 0.036*jaar + 0.029*onderzoek + 0.027*raad 5 0.099*banen + 0.045*jaar + 0.045*productie + 0.036*ton + 0.029*aantal 6 0.041*grote + 0.038*bedrijven + 0.027*ondernemers + 0.023*goed + 0.015* jaar 7 0.108*werknemers + 0.037*jongeren + 0.035*werkgevers + 0.029*jaar + 0.025*werk 8 0.171*bank + 0.122* + 0.041*klanten + 0.035*verzekeraar + 0.028*euro 9 0.162*banken + 0.055*bank + 0.039*centrale + 0.027*leningen + 0.024* financiele 10 0.052*post + 0.042*media + 0.038*nieuwe + 0.034*netwerk + 0.025* personeel 11 ... Big Data and Automated Content Analysis Damian Trilling
  • 47.
  • 48. Co-occurrences Networks Other co-occurrence based methods Next meetings Next meetings Big Data and Automated Content Analysis Damian Trilling
  • 49. Co-occurrences Networks Other co-occurrence based methods Next meetings Next meetings Wednesday, 11–5 Lab session Conduct an analysis based on word co-occurrences (Chapter 8 and/or 9.2). Install Gephi in advance! No meeting on Monday (Pentecost) Wednesday, 18–5 Supervised machine learning Big Data and Automated Content Analysis Damian Trilling