SlideShare a Scribd company logo
1 of 44
Download to read offline
Word co-occurrences Some suggestions on where to look further Next meetings
Big Data and Automated Content Analysis
Week 7 – Monday
»Word co-occurrances, Gephi
— and some suggestions«
Damian Trilling
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
Afdeling Communicatiewetenschap
Universiteit van Amsterdam
11 May 2015
Big Data and Automated Content Analysis Damian Trilling
Word co-occurrences Some suggestions on where to look further Next meetings
Today
1 Integrating word counts and network analysis: Word
co-occurrences
The idea
A real-life example
2 Some suggestions on where to look further
Useful packages
Some more tips
3 Next meetings, & final project
Big Data and Automated Content Analysis Damian Trilling
Word co-occurrences Some suggestions on where to look further Next meetings
Integrating word counts and network analysis:
Word co-occurrences
Big Data and Automated Content Analysis Damian Trilling
Word co-occurrences Some suggestions on where to look further Next meetings
The idea
Simple word count
We already know this.
1 from collections import Counter
2 tekst="this is a test where many test words occur several times this is
because it is a test yes indeed it is"
3 c=Counter(tekst.split())
4 print "The top 5 are: "
5 for woord,aantal in c.most_common(5):
6 print (aantal,woord)
Big Data and Automated Content Analysis Damian Trilling
Word co-occurrences Some suggestions on where to look further Next meetings
The idea
Simple word count
The output:
1 The top 5 are:
2 4 is
3 3 test
4 2 a
5 2 this
6 2 it
Big Data and Automated Content Analysis Damian Trilling
Word co-occurrences Some suggestions on where to look further Next meetings
The idea
What if we could. . .
. . . count the frequency of combinations of words?
Big Data and Automated Content Analysis Damian Trilling
Word co-occurrences Some suggestions on where to look further Next meetings
The idea
What if we could. . .
. . . count the frequency of combinations of words?
As in: Which words do typical occur together in the same
tweet (or paragraph, or sentence, . . . )
Big Data and Automated Content Analysis Damian Trilling
Word co-occurrences Some suggestions on where to look further Next meetings
The idea
We can — with the combinations() function
1 >>> from itertools import combinations
2 >>> words="Hoi this is a test test test a test it is".split()
3 >>> print ([e for e in combinations(words,2)])
4 [(’Hoi’, ’this’), (’Hoi’, ’is’), (’Hoi’, ’a’), (’Hoi’, ’test’), (’Hoi’,
’test’), (’Hoi’, ’test’), (’Hoi’, ’a’), (’Hoi’, ’test’), (’Hoi’, ’
it’), (’Hoi’, ’is’), (’this’, ’is’), (’this’, ’a’), (’this’, ’test
’), (’this’, ’test’), (’this’, ’test’), (’this’, ’a’), (’this’, ’
test’), (’this’, ’it’), (’this’, ’is’), (’is’, ’a’), (’is’, ’test’)
, (’is’, ’test’), (’is’, ’test’), (’is’, ’a’), (’is’, ’test’), (’is
’, ’it’), (’is’, ’is’), (’a’, ’test’), (’a’, ’test’), (’a’, ’test’)
, (’a’, ’a’), (’a’, ’test’), (’a’, ’it’), (’a’, ’is’), (’test’, ’
test’), (’test’, ’test’), (’test’, ’a’), (’test’, ’test’), (’test’,
’it’), (’test’, ’is’), (’test’, ’test’), (’test’, ’a’), (’test’, ’
test’), (’test’, ’it’), (’test’, ’is’), (’test’, ’a’), (’test’, ’
test’), (’test’, ’it’), (’test’, ’is’), (’a’, ’test’), (’a’, ’it’),
(’a’, ’is’), (’test’, ’it’), (’test’, ’is’), (’it’, ’is’)]
Big Data and Automated Content Analysis Damian Trilling
Word co-occurrences Some suggestions on where to look further Next meetings
The idea
Count co-occurrences
1 from collections import defaultdict
2 from itertools import combinations
3
4 tweets=["i am having coffee with my friend","i like coffee","i like
coffee and beer","beer i like"]
5 cooc=defaultdict(int)
6
7 for tweet in tweets:
8 words=tweet.split()
9 for a,b in set(combinations(words,2)):
10 if (b,a) in cooc:
11 a,b = b,a
12 if a!=b:
13 cooc[(a,b)]+=1
14
15 for combi in sorted(cooc,key=cooc.get,reverse=True):
16 print (cooc[combi],"t",combi)
Big Data and Automated Content Analysis Damian Trilling
Word co-occurrences Some suggestions on where to look further Next meetings
The idea
Count co-occurrences
The output:
1 3 (’i’, ’coffee’)
2 3 (’i’, ’like’)
3 2 (’i’, ’beer’)
4 2 (’like’, ’beer’)
5 2 (’like’, ’coffee’)
6 1 (’coffee’, ’beer’)
7 1 (’and’, ’beer’)
8 ...
9 ...
10 ...
Big Data and Automated Content Analysis Damian Trilling
Word co-occurrences Some suggestions on where to look further Next meetings
The idea
From a list of co-occurrences to a network
Let’s conceptualize each word as a node and each
cooccurrence as an edge
• node weight = word frequency
• edge weight = number of coocurrences
A GDF file offers all of this and looks like this:
Big Data and Automated Content Analysis Damian Trilling
1 nodedef>name VARCHAR, width DOUBLE
2 coffee,3
3 beer,2
4 i,4
5 and,1
6 with,1
7 friend,1
8 having,1
9 like,3
10 am,1
11 my,1
12 edgedef>node1 VARCHAR,node2 VARCHAR, weight DOUBLE
13 coffee,beer,1
14 i,beer,2
15 and,beer,1
16 with,friend,1
17 coffee,with,1
18 i,and,1
19 having,friend,1
20 like,beer,2
21 am,friend,1
22 i,am,1
23 i,coffee,3
24 i,with,1
25 am,having,1
26 i,having,1
27 coffee,and,1
28 like,coffee,2
Word co-occurrences Some suggestions on where to look further Next meetings
The idea
How to represent the cooccurrences graphically?
A two-step approach
1 Save as a GDF file (the format seems easy to understand, so
we could write a function for this in Python)
2 Open the GDF file in Gephi for visualization and/or network
analysis
Big Data and Automated Content Analysis Damian Trilling
Word co-occurrences Some suggestions on where to look further Next meetings
The idea
Gephi
• Install (NOT in the VM) from https://gephi.org
• By problems on MacOS, see what I wrote about Gephi here:
http://www.damiantrilling.net/
setting-up-my-new-macbook/
• I made a screencast on how to visualize the GDF file in Gephi:
https://streamingmedia.uva.nl/asset/detail/
t2KWKVZtQWZIe2Cj8qXcW5KF
• Further: see the materials I mailed to you
Big Data and Automated Content Analysis Damian Trilling
Word co-occurrences Some suggestions on where to look further Next meetings
A real-life example
A real-life example
Trilling, D. (2014). Two different debates? Investigating the
relationship between a political debate on TV and simultaneous
comments on Twitter. Social Science Computer Review, Advance
online publication. doi: 10.1177/0894439314537886
Big Data and Automated Content Analysis Damian Trilling
Word co-occurrences Some suggestions on where to look further Next meetings
A real-life example
Commenting the TV debate on Twitter
The debating politicians
• issues largely set by the interviewers
• but candidates actively try to highlight the issues (⇒ agenda
setting) and aspects of the issues (⇒ framing).
Big Data and Automated Content Analysis Damian Trilling
Word co-occurrences Some suggestions on where to look further Next meetings
A real-life example
Commenting the TV debate on Twitter
The viewers
• Commenting television programs on social networks has
become a regular pattern of behavior (Courtois & d’Heer, 2012)
• User comments have shown to reflect the structure of the
debate (Shamma, Churchill, & Kennedy, 2010; Shamma, Kennedy, & Churchill, 2009)
• Topic and speaker effect more influential than, e.g., rhetorical
skills (Nagel, Maurer, & Reinemann, 2012; De Mooy & Maier, 2014)
Big Data and Automated Content Analysis Damian Trilling
Word co-occurrences Some suggestions on where to look further Next meetings
A real-life example
Research Questions
To which extent are the statements politicians make during a
TV debate reflected in online live discussions of the debate?
RQ1 Which topics are emphasized by the candidates?
RQ2 Which topics are emphasized by the Twitter users?
RQ3 With which topics are the two candidates associated
on Twitter?
Big Data and Automated Content Analysis Damian Trilling
Word co-occurrences Some suggestions on where to look further Next meetings
A real-life example
Method
Big Data and Automated Content Analysis Damian Trilling
Word co-occurrences Some suggestions on where to look further Next meetings
A real-life example
Method
The data
• debate transcript
• tweets containing
#tvduell
• N = 120, 557 tweets
by N = 24, 796 users
• 22-9-2013,
20.30-22.00
Big Data and Automated Content Analysis Damian Trilling
Word co-occurrences Some suggestions on where to look further Next meetings
A real-life example
Method
The data
• debate transcript
• tweets containing
#tvduell
• N = 120, 557 tweets
by N = 24, 796 users
• 22-9-2013,
20.30-22.00
The analysis
• Series of self-written Python
scripts:
1 preprocessing (stemming,
stopword removal)
2 word counts
3 word log likelihood (corpus
comparison)
• Stata: regression analysis
Big Data and Automated Content Analysis Damian Trilling
02000400060008000
−60 −50 −40 −30 −20 −10 10 20 30 40 50 60 70 80 100 110 120 130 140 150
start
end
Word co-occurrences Some suggestions on where to look further Next meetings
A real-life example
Relationship between words on TV and on Twitter
0246810
ln(wordonTwitter+1)
0 1 2 3
ln (word on TV +1)
Big Data and Automated Content Analysis Damian Trilling
Word co-occurrences Some suggestions on where to look further Next meetings
A real-life example
Word frequency TV ⇒ word frequency Twitter
Model 1 Model 2 Model 3
ln(Twitter +1) ln(Twitter +1) ln(Twitter +1)
together w/ M. together w/ S.
b (SE) b(SE) b(SE)
beta beta beta
ln (TV M. +1) 1.59 (.052) *** 1.54 (.041) *** .77 (.037) ***
.21 .26 .14
ln (TV S. +1) 1.29 (.051) *** .88 (.041) *** 1.25 (.037) ***
.17 .15 .24
intercept 1.64 (.008) *** .87 (.007) *** .60 (.006) ***
R2 .100 .115 .100
b M. & S. differ? F(1, 21408) = 12.29 F(1, 21408) = 96.69 F(1, 21408) =
p <.001 p <.001 63.38
p <.001
M = Merkel; S = Steinbrück
Big Data and Automated Content Analysis Damian Trilling
Word co-occurrences Some suggestions on where to look further Next meetings
A real-life example
Most distinctive words on TV
LL word Frequency Merkel Frequency Steinbrüc
27,73 merkel 0 20
19,41 arbeitsplatz [job] 14 0
15,25 steinbruck 11 0
9,70 koalition [coaltion] 7 0
9,70 international 7 0
9,70 gemeinsam [together] 7 0
8,55 griechenland [Greece] 10 1
8,32 investi [investment] 6 0
6,93 uberzeug [belief] 5 0
6,93 okonom [economic] 0 5
Big Data and Automated Content Analysis Damian Trilling
Word co-occurrences Some suggestions on where to look further Next meetings
A real-life example
Most distinctive words on Twitter
LL word Frequency Merkel Frequency Ste
32443,39 merkel 29672 0
30751,65 steinbrueck 0 17780
1507,08 kett [necklace] 1628 34
1241,14 vertrau [trust] 1240 12
863,84 fdp [a coalition partner] 985 29
775,93 nsa 1809 298
626,49 wikipedia 40 502
574,65 twittert [tweets] 40 469
544,87 koalition [coalition] 864 77
517,99 gold 669 34
Big Data and Automated Content Analysis Damian Trilling
Word co-occurrences Some suggestions on where to look further Next meetings
A real-life example
Putting the pieces together
Merkel
• necklace
• trust (sarcastic)
• nsa affair
• coalition partners
Steinbrück
• suggestion to look sth. up
on Wikipedia
• tweets from his account
during the debate
Big Data and Automated Content Analysis Damian Trilling
Word co-occurrences Some suggestions on where to look further Next meetings
Useful packages
Some suggestions on where to look further
Big Data and Automated Content Analysis Damian Trilling
Word co-occurrences Some suggestions on where to look further Next meetings
Useful packages
Further analysis
Ways to further analyze the data
Big Data and Automated Content Analysis Damian Trilling
Word co-occurrences Some suggestions on where to look further Next meetings
Useful packages
Further analysis
Ways to further analyze the data
• Write the data in a specific format to link to special extenral
program (GDF-example)
Big Data and Automated Content Analysis Damian Trilling
Word co-occurrences Some suggestions on where to look further Next meetings
Useful packages
Further analysis
Ways to further analyze the data
• Write the data in a specific format to link to special extenral
program (GDF-example)
• Export to CSV files and analyze using R, Stata, SPSS, Excel,
. . .
Big Data and Automated Content Analysis Damian Trilling
Word co-occurrences Some suggestions on where to look further Next meetings
Useful packages
Further analysis
Ways to further analyze the data
• Write the data in a specific format to link to special extenral
program (GDF-example)
• Export to CSV files and analyze using R, Stata, SPSS, Excel,
. . .
• Do it in Python, using. . . . . . . . .
Big Data and Automated Content Analysis Damian Trilling
Word co-occurrences Some suggestions on where to look further Next meetings
Useful packages
Packages for statistics and graphics
Already installed with anaconda:
• numpy
• scipy
• pandas
• mathplotlib
We won’t cover these packages in detail, but you are very much
encouraged to have a look at these packages yourself if you feel
they are useful.
Big Data and Automated Content Analysis Damian Trilling
Word co-occurrences Some suggestions on where to look further Next meetings
Useful packages
numpy
1 >>> x = [1,2,3,4,3,2]
2 >>> y = [2,2,4,3,4,2]
3 >>> np.mean(x)
4 2.5
5 >>> np.std(x)
6 0.9574271077563381
7 >>> np.corrcoef(x,y)
8 array([[ 1. , 0.67883359],
9 [ 0.67883359, 1. ]])
Big Data and Automated Content Analysis Damian Trilling
Word co-occurrences Some suggestions on where to look further Next meetings
Useful packages
pandas
1 import pandas as pd
2 from pandas.stats.api import ols
3 df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C
": [32, 234, 23, 23, 42523]})
4 result = ols(y=df[’A’], x=df[[’B’,’C’]])
5 print(result)
prints a regression table like you would expect from any statistics
program:
Big Data and Automated Content Analysis Damian Trilling
Word co-occurrences Some suggestions on where to look further Next meetings
Useful packages
-------------------------Summary of Regression Analysis-------------------------
Formula: Y ~ <B> + <C> + <intercept>
Number of Observations: 5
Number of Degrees of Freedom: 3
R-squared: 0.5789
Adj R-squared: 0.1577
Rmse: 14.5108
F-stat (2, 2): 1.3746, p-value: 0.4211
Degrees of Freedom: model 2, resid 2
-----------------------Summary of Estimated Coefficients------------------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
B 0.4012 0.6497 0.62 0.5999 -0.8723 1.6746
C 0.0004 0.0005 0.65 0.5826 -0.0007 0.0014
intercept 14.9525 17.7643 0.84 0.4886 -19.8655 49.7705
---------------------------------End of Summary---------------------------------
... but you can get much more, like a list of predicted values
(result.y_predict), . . .
Big Data and Automated Content Analysis Damian Trilling
Word co-occurrences Some suggestions on where to look further Next meetings
Useful packages
matplotlib
1 import matplotlib.pyplot as plt
2 x = [1,2,3,4,3,2]
3 y = [2,2,4,3,4,2]
4 plt.hist(x)
5 plt.plot(x,y)
Big Data and Automated Content Analysis Damian Trilling
Word co-occurrences Some suggestions on where to look further Next meetings
Useful packages
Big Data and Automated Content Analysis Damian Trilling
Word co-occurrences Some suggestions on where to look further Next meetings
Some more tips
Some tips
• Make use of IPython features in Spyder (tab completion,
object inspector)
• Try things out in the IPython console (think of RStudio of
STATA!)
• Watch this video on “Python for data analysis" with pandas:
https://vimeo.com/59324550
Big Data and Automated Content Analysis Damian Trilling
Word co-occurrences Some suggestions on where to look further Next meetings
Final project Next meetings
Big Data and Automated Content Analysis Damian Trilling
Word co-occurrences Some suggestions on where to look further Next meetings
Final project
On 29–5, you have to hand in your final project
• Details and rules: ⇒ course manual
• Similar to take-home exam
• But: Much more advanced, and now, the result counts as well
• And: Be creative! You can use code from class, but you need
to extend it
• Start working on it!
Big Data and Automated Content Analysis Damian Trilling
Word co-occurrences Some suggestions on where to look further Next meetings
Next meeting
Wednesday, 13–5
Lab session, focus on INDIVIDUAL PROJECTS! Prepare!
(No common exercise)
Big Data and Automated Content Analysis Damian Trilling

More Related Content

Viewers also liked (11)

BD-ACA Week8a
BD-ACA Week8aBD-ACA Week8a
BD-ACA Week8a
 
BD-ACA week4a
BD-ACA week4aBD-ACA week4a
BD-ACA week4a
 
MAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine LearningMAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine Learning
 
Media diets in an age of apps and social media: Dealing with a third layer of...
Media diets in an age of apps and social media: Dealing with a third layer of...Media diets in an age of apps and social media: Dealing with a third layer of...
Media diets in an age of apps and social media: Dealing with a third layer of...
 
01 c++ Intro.ppt
01 c++ Intro.ppt01 c++ Intro.ppt
01 c++ Intro.ppt
 
Presentation on storage devices
Presentation on storage devicesPresentation on storage devices
Presentation on storage devices
 
C++ ppt
C++ pptC++ ppt
C++ ppt
 
The Great State of Design with CSS Grid Layout and Friends
The Great State of Design with CSS Grid Layout and FriendsThe Great State of Design with CSS Grid Layout and Friends
The Great State of Design with CSS Grid Layout and Friends
 
2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShare2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShare
 
What to Upload to SlideShare
What to Upload to SlideShareWhat to Upload to SlideShare
What to Upload to SlideShare
 
Getting Started With SlideShare
Getting Started With SlideShareGetting Started With SlideShare
Getting Started With SlideShare
 

Similar to BD-ACA week7a

AI and Python: Developing a Conversational Interface using Python
AI and Python: Developing a Conversational Interface using PythonAI and Python: Developing a Conversational Interface using Python
AI and Python: Developing a Conversational Interface using Pythonamyiris
 
Final Presentation
Final PresentationFinal Presentation
Final PresentationLove Tyagi
 
Similarity at scale
Similarity at scaleSimilarity at scale
Similarity at scaleKen Krugler
 
Technology Choices for Enterprise Integration
Technology Choices for Enterprise IntegrationTechnology Choices for Enterprise Integration
Technology Choices for Enterprise IntegrationSandeep Purao
 
Applied Data Science: Building a Beer Recommender | Data Science MD - Oct 2014
Applied Data Science: Building a Beer Recommender | Data Science MD - Oct 2014Applied Data Science: Building a Beer Recommender | Data Science MD - Oct 2014
Applied Data Science: Building a Beer Recommender | Data Science MD - Oct 2014Austin Ogilvie
 
2020 02 29 TechDay Conf - Getting started with Machine Learning.Net
2020 02 29 TechDay Conf - Getting started with Machine Learning.Net2020 02 29 TechDay Conf - Getting started with Machine Learning.Net
2020 02 29 TechDay Conf - Getting started with Machine Learning.NetBruno Capuano
 

Similar to BD-ACA week7a (20)

BDACA1516s2 - Lecture3
BDACA1516s2 - Lecture3BDACA1516s2 - Lecture3
BDACA1516s2 - Lecture3
 
BD-ACA week3a
BD-ACA week3aBD-ACA week3a
BD-ACA week3a
 
BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3
 
BDACA1617s2 - Lecture7
BDACA1617s2 - Lecture7BDACA1617s2 - Lecture7
BDACA1617s2 - Lecture7
 
BDACA - Lecture7
BDACA - Lecture7BDACA - Lecture7
BDACA - Lecture7
 
Similarity at Scale
Similarity at ScaleSimilarity at Scale
Similarity at Scale
 
BDACA - Lecture3
BDACA - Lecture3BDACA - Lecture3
BDACA - Lecture3
 
BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2
 
AI and Python: Developing a Conversational Interface using Python
AI and Python: Developing a Conversational Interface using PythonAI and Python: Developing a Conversational Interface using Python
AI and Python: Developing a Conversational Interface using Python
 
Final Presentation
Final PresentationFinal Presentation
Final Presentation
 
Similarity at scale
Similarity at scaleSimilarity at scale
Similarity at scale
 
BDACA1516s2 - Lecture2
BDACA1516s2 - Lecture2BDACA1516s2 - Lecture2
BDACA1516s2 - Lecture2
 
BDACA1617s2 - Lecture4
BDACA1617s2 - Lecture4BDACA1617s2 - Lecture4
BDACA1617s2 - Lecture4
 
BDACA1516s2 - Lecture5
BDACA1516s2 - Lecture5BDACA1516s2 - Lecture5
BDACA1516s2 - Lecture5
 
BDACA - Lecture4
BDACA - Lecture4BDACA - Lecture4
BDACA - Lecture4
 
BDACA1516s2 - Lecture4
 BDACA1516s2 - Lecture4 BDACA1516s2 - Lecture4
BDACA1516s2 - Lecture4
 
Technology Choices for Enterprise Integration
Technology Choices for Enterprise IntegrationTechnology Choices for Enterprise Integration
Technology Choices for Enterprise Integration
 
Applied Data Science: Building a Beer Recommender | Data Science MD - Oct 2014
Applied Data Science: Building a Beer Recommender | Data Science MD - Oct 2014Applied Data Science: Building a Beer Recommender | Data Science MD - Oct 2014
Applied Data Science: Building a Beer Recommender | Data Science MD - Oct 2014
 
BD-ACA week5
BD-ACA week5BD-ACA week5
BD-ACA week5
 
2020 02 29 TechDay Conf - Getting started with Machine Learning.Net
2020 02 29 TechDay Conf - Getting started with Machine Learning.Net2020 02 29 TechDay Conf - Getting started with Machine Learning.Net
2020 02 29 TechDay Conf - Getting started with Machine Learning.Net
 

More from Department of Communication Science, University of Amsterdam

More from Department of Communication Science, University of Amsterdam (16)

BDACA - Lecture8
BDACA - Lecture8BDACA - Lecture8
BDACA - Lecture8
 
BDACA - Lecture6
BDACA - Lecture6BDACA - Lecture6
BDACA - Lecture6
 
BDACA - Lecture5
BDACA - Lecture5BDACA - Lecture5
BDACA - Lecture5
 
BDACA - Lecture2
BDACA - Lecture2BDACA - Lecture2
BDACA - Lecture2
 
BDACA - Tutorial1
BDACA - Tutorial1BDACA - Tutorial1
BDACA - Tutorial1
 
BDACA - Lecture1
BDACA - Lecture1BDACA - Lecture1
BDACA - Lecture1
 
BDACA1617s2 - Lecture6
BDACA1617s2 - Lecture6BDACA1617s2 - Lecture6
BDACA1617s2 - Lecture6
 
BDACA1617s2 - Lecture5
BDACA1617s2 - Lecture5BDACA1617s2 - Lecture5
BDACA1617s2 - Lecture5
 
BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1
 
BDACA1617s2 - Lecture 1
BDACA1617s2 - Lecture 1BDACA1617s2 - Lecture 1
BDACA1617s2 - Lecture 1
 
Conceptualizing and measuring news exposure as network of users and news items
Conceptualizing and measuring news exposure as network of users and news itemsConceptualizing and measuring news exposure as network of users and news items
Conceptualizing and measuring news exposure as network of users and news items
 
Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 2/2"Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 2/2"
 
Data Science: Case "Political Communication 1/2"
Data Science: Case "Political Communication 1/2"Data Science: Case "Political Communication 1/2"
Data Science: Case "Political Communication 1/2"
 
BDACA1516s2 - Lecture8
BDACA1516s2 - Lecture8BDACA1516s2 - Lecture8
BDACA1516s2 - Lecture8
 
BDACA1516s2 - Lecture6
BDACA1516s2 - Lecture6BDACA1516s2 - Lecture6
BDACA1516s2 - Lecture6
 
BDACA1516s2 - Lecture1
BDACA1516s2 - Lecture1BDACA1516s2 - Lecture1
BDACA1516s2 - Lecture1
 

Recently uploaded

Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 

Recently uploaded (20)

Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 

BD-ACA week7a

  • 1. Word co-occurrences Some suggestions on where to look further Next meetings Big Data and Automated Content Analysis Week 7 – Monday »Word co-occurrances, Gephi — and some suggestions« Damian Trilling d.c.trilling@uva.nl @damian0604 www.damiantrilling.net Afdeling Communicatiewetenschap Universiteit van Amsterdam 11 May 2015 Big Data and Automated Content Analysis Damian Trilling
  • 2. Word co-occurrences Some suggestions on where to look further Next meetings Today 1 Integrating word counts and network analysis: Word co-occurrences The idea A real-life example 2 Some suggestions on where to look further Useful packages Some more tips 3 Next meetings, & final project Big Data and Automated Content Analysis Damian Trilling
  • 3. Word co-occurrences Some suggestions on where to look further Next meetings Integrating word counts and network analysis: Word co-occurrences Big Data and Automated Content Analysis Damian Trilling
  • 4. Word co-occurrences Some suggestions on where to look further Next meetings The idea Simple word count We already know this. 1 from collections import Counter 2 tekst="this is a test where many test words occur several times this is because it is a test yes indeed it is" 3 c=Counter(tekst.split()) 4 print "The top 5 are: " 5 for woord,aantal in c.most_common(5): 6 print (aantal,woord) Big Data and Automated Content Analysis Damian Trilling
  • 5. Word co-occurrences Some suggestions on where to look further Next meetings The idea Simple word count The output: 1 The top 5 are: 2 4 is 3 3 test 4 2 a 5 2 this 6 2 it Big Data and Automated Content Analysis Damian Trilling
  • 6. Word co-occurrences Some suggestions on where to look further Next meetings The idea What if we could. . . . . . count the frequency of combinations of words? Big Data and Automated Content Analysis Damian Trilling
  • 7. Word co-occurrences Some suggestions on where to look further Next meetings The idea What if we could. . . . . . count the frequency of combinations of words? As in: Which words do typical occur together in the same tweet (or paragraph, or sentence, . . . ) Big Data and Automated Content Analysis Damian Trilling
  • 8. Word co-occurrences Some suggestions on where to look further Next meetings The idea We can — with the combinations() function 1 >>> from itertools import combinations 2 >>> words="Hoi this is a test test test a test it is".split() 3 >>> print ([e for e in combinations(words,2)]) 4 [(’Hoi’, ’this’), (’Hoi’, ’is’), (’Hoi’, ’a’), (’Hoi’, ’test’), (’Hoi’, ’test’), (’Hoi’, ’test’), (’Hoi’, ’a’), (’Hoi’, ’test’), (’Hoi’, ’ it’), (’Hoi’, ’is’), (’this’, ’is’), (’this’, ’a’), (’this’, ’test ’), (’this’, ’test’), (’this’, ’test’), (’this’, ’a’), (’this’, ’ test’), (’this’, ’it’), (’this’, ’is’), (’is’, ’a’), (’is’, ’test’) , (’is’, ’test’), (’is’, ’test’), (’is’, ’a’), (’is’, ’test’), (’is ’, ’it’), (’is’, ’is’), (’a’, ’test’), (’a’, ’test’), (’a’, ’test’) , (’a’, ’a’), (’a’, ’test’), (’a’, ’it’), (’a’, ’is’), (’test’, ’ test’), (’test’, ’test’), (’test’, ’a’), (’test’, ’test’), (’test’, ’it’), (’test’, ’is’), (’test’, ’test’), (’test’, ’a’), (’test’, ’ test’), (’test’, ’it’), (’test’, ’is’), (’test’, ’a’), (’test’, ’ test’), (’test’, ’it’), (’test’, ’is’), (’a’, ’test’), (’a’, ’it’), (’a’, ’is’), (’test’, ’it’), (’test’, ’is’), (’it’, ’is’)] Big Data and Automated Content Analysis Damian Trilling
  • 9. Word co-occurrences Some suggestions on where to look further Next meetings The idea Count co-occurrences 1 from collections import defaultdict 2 from itertools import combinations 3 4 tweets=["i am having coffee with my friend","i like coffee","i like coffee and beer","beer i like"] 5 cooc=defaultdict(int) 6 7 for tweet in tweets: 8 words=tweet.split() 9 for a,b in set(combinations(words,2)): 10 if (b,a) in cooc: 11 a,b = b,a 12 if a!=b: 13 cooc[(a,b)]+=1 14 15 for combi in sorted(cooc,key=cooc.get,reverse=True): 16 print (cooc[combi],"t",combi) Big Data and Automated Content Analysis Damian Trilling
  • 10. Word co-occurrences Some suggestions on where to look further Next meetings The idea Count co-occurrences The output: 1 3 (’i’, ’coffee’) 2 3 (’i’, ’like’) 3 2 (’i’, ’beer’) 4 2 (’like’, ’beer’) 5 2 (’like’, ’coffee’) 6 1 (’coffee’, ’beer’) 7 1 (’and’, ’beer’) 8 ... 9 ... 10 ... Big Data and Automated Content Analysis Damian Trilling
  • 11. Word co-occurrences Some suggestions on where to look further Next meetings The idea From a list of co-occurrences to a network Let’s conceptualize each word as a node and each cooccurrence as an edge • node weight = word frequency • edge weight = number of coocurrences A GDF file offers all of this and looks like this: Big Data and Automated Content Analysis Damian Trilling
  • 12. 1 nodedef>name VARCHAR, width DOUBLE 2 coffee,3 3 beer,2 4 i,4 5 and,1 6 with,1 7 friend,1 8 having,1 9 like,3 10 am,1 11 my,1 12 edgedef>node1 VARCHAR,node2 VARCHAR, weight DOUBLE 13 coffee,beer,1 14 i,beer,2 15 and,beer,1 16 with,friend,1 17 coffee,with,1 18 i,and,1 19 having,friend,1 20 like,beer,2 21 am,friend,1 22 i,am,1 23 i,coffee,3 24 i,with,1 25 am,having,1 26 i,having,1 27 coffee,and,1 28 like,coffee,2
  • 13. Word co-occurrences Some suggestions on where to look further Next meetings The idea How to represent the cooccurrences graphically? A two-step approach 1 Save as a GDF file (the format seems easy to understand, so we could write a function for this in Python) 2 Open the GDF file in Gephi for visualization and/or network analysis Big Data and Automated Content Analysis Damian Trilling
  • 14. Word co-occurrences Some suggestions on where to look further Next meetings The idea Gephi • Install (NOT in the VM) from https://gephi.org • By problems on MacOS, see what I wrote about Gephi here: http://www.damiantrilling.net/ setting-up-my-new-macbook/ • I made a screencast on how to visualize the GDF file in Gephi: https://streamingmedia.uva.nl/asset/detail/ t2KWKVZtQWZIe2Cj8qXcW5KF • Further: see the materials I mailed to you Big Data and Automated Content Analysis Damian Trilling
  • 15. Word co-occurrences Some suggestions on where to look further Next meetings A real-life example A real-life example Trilling, D. (2014). Two different debates? Investigating the relationship between a political debate on TV and simultaneous comments on Twitter. Social Science Computer Review, Advance online publication. doi: 10.1177/0894439314537886 Big Data and Automated Content Analysis Damian Trilling
  • 16. Word co-occurrences Some suggestions on where to look further Next meetings A real-life example Commenting the TV debate on Twitter The debating politicians • issues largely set by the interviewers • but candidates actively try to highlight the issues (⇒ agenda setting) and aspects of the issues (⇒ framing). Big Data and Automated Content Analysis Damian Trilling
  • 17. Word co-occurrences Some suggestions on where to look further Next meetings A real-life example Commenting the TV debate on Twitter The viewers • Commenting television programs on social networks has become a regular pattern of behavior (Courtois & d’Heer, 2012) • User comments have shown to reflect the structure of the debate (Shamma, Churchill, & Kennedy, 2010; Shamma, Kennedy, & Churchill, 2009) • Topic and speaker effect more influential than, e.g., rhetorical skills (Nagel, Maurer, & Reinemann, 2012; De Mooy & Maier, 2014) Big Data and Automated Content Analysis Damian Trilling
  • 18. Word co-occurrences Some suggestions on where to look further Next meetings A real-life example Research Questions To which extent are the statements politicians make during a TV debate reflected in online live discussions of the debate? RQ1 Which topics are emphasized by the candidates? RQ2 Which topics are emphasized by the Twitter users? RQ3 With which topics are the two candidates associated on Twitter? Big Data and Automated Content Analysis Damian Trilling
  • 19. Word co-occurrences Some suggestions on where to look further Next meetings A real-life example Method Big Data and Automated Content Analysis Damian Trilling
  • 20. Word co-occurrences Some suggestions on where to look further Next meetings A real-life example Method The data • debate transcript • tweets containing #tvduell • N = 120, 557 tweets by N = 24, 796 users • 22-9-2013, 20.30-22.00 Big Data and Automated Content Analysis Damian Trilling
  • 21. Word co-occurrences Some suggestions on where to look further Next meetings A real-life example Method The data • debate transcript • tweets containing #tvduell • N = 120, 557 tweets by N = 24, 796 users • 22-9-2013, 20.30-22.00 The analysis • Series of self-written Python scripts: 1 preprocessing (stemming, stopword removal) 2 word counts 3 word log likelihood (corpus comparison) • Stata: regression analysis Big Data and Automated Content Analysis Damian Trilling
  • 22. 02000400060008000 −60 −50 −40 −30 −20 −10 10 20 30 40 50 60 70 80 100 110 120 130 140 150 start end
  • 23. Word co-occurrences Some suggestions on where to look further Next meetings A real-life example Relationship between words on TV and on Twitter 0246810 ln(wordonTwitter+1) 0 1 2 3 ln (word on TV +1) Big Data and Automated Content Analysis Damian Trilling
  • 24. Word co-occurrences Some suggestions on where to look further Next meetings A real-life example Word frequency TV ⇒ word frequency Twitter Model 1 Model 2 Model 3 ln(Twitter +1) ln(Twitter +1) ln(Twitter +1) together w/ M. together w/ S. b (SE) b(SE) b(SE) beta beta beta ln (TV M. +1) 1.59 (.052) *** 1.54 (.041) *** .77 (.037) *** .21 .26 .14 ln (TV S. +1) 1.29 (.051) *** .88 (.041) *** 1.25 (.037) *** .17 .15 .24 intercept 1.64 (.008) *** .87 (.007) *** .60 (.006) *** R2 .100 .115 .100 b M. & S. differ? F(1, 21408) = 12.29 F(1, 21408) = 96.69 F(1, 21408) = p <.001 p <.001 63.38 p <.001 M = Merkel; S = Steinbrück Big Data and Automated Content Analysis Damian Trilling
  • 25. Word co-occurrences Some suggestions on where to look further Next meetings A real-life example Most distinctive words on TV LL word Frequency Merkel Frequency Steinbrüc 27,73 merkel 0 20 19,41 arbeitsplatz [job] 14 0 15,25 steinbruck 11 0 9,70 koalition [coaltion] 7 0 9,70 international 7 0 9,70 gemeinsam [together] 7 0 8,55 griechenland [Greece] 10 1 8,32 investi [investment] 6 0 6,93 uberzeug [belief] 5 0 6,93 okonom [economic] 0 5 Big Data and Automated Content Analysis Damian Trilling
  • 26. Word co-occurrences Some suggestions on where to look further Next meetings A real-life example Most distinctive words on Twitter LL word Frequency Merkel Frequency Ste 32443,39 merkel 29672 0 30751,65 steinbrueck 0 17780 1507,08 kett [necklace] 1628 34 1241,14 vertrau [trust] 1240 12 863,84 fdp [a coalition partner] 985 29 775,93 nsa 1809 298 626,49 wikipedia 40 502 574,65 twittert [tweets] 40 469 544,87 koalition [coalition] 864 77 517,99 gold 669 34 Big Data and Automated Content Analysis Damian Trilling
  • 27. Word co-occurrences Some suggestions on where to look further Next meetings A real-life example Putting the pieces together Merkel • necklace • trust (sarcastic) • nsa affair • coalition partners Steinbrück • suggestion to look sth. up on Wikipedia • tweets from his account during the debate Big Data and Automated Content Analysis Damian Trilling
  • 28.
  • 29. Word co-occurrences Some suggestions on where to look further Next meetings Useful packages Some suggestions on where to look further Big Data and Automated Content Analysis Damian Trilling
  • 30. Word co-occurrences Some suggestions on where to look further Next meetings Useful packages Further analysis Ways to further analyze the data Big Data and Automated Content Analysis Damian Trilling
  • 31. Word co-occurrences Some suggestions on where to look further Next meetings Useful packages Further analysis Ways to further analyze the data • Write the data in a specific format to link to special extenral program (GDF-example) Big Data and Automated Content Analysis Damian Trilling
  • 32. Word co-occurrences Some suggestions on where to look further Next meetings Useful packages Further analysis Ways to further analyze the data • Write the data in a specific format to link to special extenral program (GDF-example) • Export to CSV files and analyze using R, Stata, SPSS, Excel, . . . Big Data and Automated Content Analysis Damian Trilling
  • 33. Word co-occurrences Some suggestions on where to look further Next meetings Useful packages Further analysis Ways to further analyze the data • Write the data in a specific format to link to special extenral program (GDF-example) • Export to CSV files and analyze using R, Stata, SPSS, Excel, . . . • Do it in Python, using. . . . . . . . . Big Data and Automated Content Analysis Damian Trilling
  • 34. Word co-occurrences Some suggestions on where to look further Next meetings Useful packages Packages for statistics and graphics Already installed with anaconda: • numpy • scipy • pandas • mathplotlib We won’t cover these packages in detail, but you are very much encouraged to have a look at these packages yourself if you feel they are useful. Big Data and Automated Content Analysis Damian Trilling
  • 35. Word co-occurrences Some suggestions on where to look further Next meetings Useful packages numpy 1 >>> x = [1,2,3,4,3,2] 2 >>> y = [2,2,4,3,4,2] 3 >>> np.mean(x) 4 2.5 5 >>> np.std(x) 6 0.9574271077563381 7 >>> np.corrcoef(x,y) 8 array([[ 1. , 0.67883359], 9 [ 0.67883359, 1. ]]) Big Data and Automated Content Analysis Damian Trilling
  • 36. Word co-occurrences Some suggestions on where to look further Next meetings Useful packages pandas 1 import pandas as pd 2 from pandas.stats.api import ols 3 df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C ": [32, 234, 23, 23, 42523]}) 4 result = ols(y=df[’A’], x=df[[’B’,’C’]]) 5 print(result) prints a regression table like you would expect from any statistics program: Big Data and Automated Content Analysis Damian Trilling
  • 37. Word co-occurrences Some suggestions on where to look further Next meetings Useful packages -------------------------Summary of Regression Analysis------------------------- Formula: Y ~ <B> + <C> + <intercept> Number of Observations: 5 Number of Degrees of Freedom: 3 R-squared: 0.5789 Adj R-squared: 0.1577 Rmse: 14.5108 F-stat (2, 2): 1.3746, p-value: 0.4211 Degrees of Freedom: model 2, resid 2 -----------------------Summary of Estimated Coefficients------------------------ Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5% -------------------------------------------------------------------------------- B 0.4012 0.6497 0.62 0.5999 -0.8723 1.6746 C 0.0004 0.0005 0.65 0.5826 -0.0007 0.0014 intercept 14.9525 17.7643 0.84 0.4886 -19.8655 49.7705 ---------------------------------End of Summary--------------------------------- ... but you can get much more, like a list of predicted values (result.y_predict), . . . Big Data and Automated Content Analysis Damian Trilling
  • 38. Word co-occurrences Some suggestions on where to look further Next meetings Useful packages matplotlib 1 import matplotlib.pyplot as plt 2 x = [1,2,3,4,3,2] 3 y = [2,2,4,3,4,2] 4 plt.hist(x) 5 plt.plot(x,y) Big Data and Automated Content Analysis Damian Trilling
  • 39. Word co-occurrences Some suggestions on where to look further Next meetings Useful packages Big Data and Automated Content Analysis Damian Trilling
  • 40. Word co-occurrences Some suggestions on where to look further Next meetings Some more tips Some tips • Make use of IPython features in Spyder (tab completion, object inspector) • Try things out in the IPython console (think of RStudio of STATA!) • Watch this video on “Python for data analysis" with pandas: https://vimeo.com/59324550 Big Data and Automated Content Analysis Damian Trilling
  • 41.
  • 42. Word co-occurrences Some suggestions on where to look further Next meetings Final project Next meetings Big Data and Automated Content Analysis Damian Trilling
  • 43. Word co-occurrences Some suggestions on where to look further Next meetings Final project On 29–5, you have to hand in your final project • Details and rules: ⇒ course manual • Similar to take-home exam • But: Much more advanced, and now, the result counts as well • And: Be creative! You can use code from class, but you need to extend it • Start working on it! Big Data and Automated Content Analysis Damian Trilling
  • 44. Word co-occurrences Some suggestions on where to look further Next meetings Next meeting Wednesday, 13–5 Lab session, focus on INDIVIDUAL PROJECTS! Prepare! (No common exercise) Big Data and Automated Content Analysis Damian Trilling