The data

The script

Your turn

Questions?

Hands-on-Workshop
Big (Twitter) Data
Damian Trilling
d.c.trilling@uva.nl
@dam...
The data

The script

Your turn

Questions?

In this sesion (2/4):
1 The data

Recording tweets with yourTwapperkeeper
CSV...
The data

The script

Your turn

Questions?

Recording tweets with yourTwapperkeeper

The data:
Recording tweets with your...
The data

The script

Your turn

Questions?

Recording tweets with yourTwapperkeeper

yourTwapperkeeper

#bigdata

Damian ...
The data

The script

Your turn

Questions?

Recording tweets with yourTwapperkeeper

yourTwapperkeeper

Storage
Continuos...
The data

The script

Your turn

Questions?

Recording tweets with yourTwapperkeeper

yourTwapperkeeper

#bigdata

Damian ...
The data

The script

Your turn

Questions?

Recording tweets with yourTwapperkeeper

yourTwapperkeeper

Retrieving the da...
The data

The script

Your turn

Questions?

CSV-files

The data:
CSV-files

#bigdata

Damian Trilling
The data

The script

Your turn

Questions?

CSV-files

CSV-files

The format of our choice
• All programs can read it
• Eve...
The data

The script

Your turn

Questions?

CSV-files

1

2

3

text,to_user_id,from_user,id,from_user_id,
iso_language_co...
The data

The script

Your turn

Questions?

Other ways to collect tweets

The data:
Other ways to collect tweets

#bigdat...
The data

The script

Your turn

Questions?

Other ways to collect tweets

Other ways to collect tweets
Again, we want a C...
The data

The script

Your turn

Questions?

Not that different: Facebook posts

The data:
Not that different: Facebook post...
The data

The script

Your turn

Questions?

Not that different: Facebook posts

Not that different: Facebook posts
Have a l...
The data

The script

Your turn

Questions?

Not that different: Facebook posts

Not that different: Facebook posts
Have a l...
The data

The script

Your turn

Questions?

Pseudo-code

The script:
Pseudo-code

#bigdata

Damian Trilling
The data

The script

Your turn

Questions?

Pseudo-code

Our task: Identify all tweets that include a reference to Poland...
The data

The script

Your turn

Questions?

Python code

The script:
Python code

#bigdata

Damian Trilling
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

#!/usr/bin/python
from unicsv import CsvUnicodeRe...
The data

The script

Your turn

Questions?

Python code

1
2
3
4
5

#!/usr/bin/python
# We start with importing some modu...
The data

The script

Your turn

Questions?

Python code

1
2
3
4
5
6

# We create some empty lists that we will use later...
The data

The script

Your turn

Questions?

Python code

1
2

# What do we want to look for?
searchstring1 = re.compile(r...
The data

The script

Your turn

Questions?

Python code

1
2
3
4
5
6
7
8

# Now we read the file line by line.
# The inde...
The data

The script

Your turn

Questions?

Python code

1
2

# Time to put all the data in one container
# and save it:
...
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

#!/usr/bin/python
from unicsv import CsvUnicodeRe...
The data

The script

Your turn

Questions?

The output

The script:
myoutput.csv

#bigdata

Damian Trilling
The data

The script

Your turn

Questions?

The output

1
2

3

4

5

tweet,user,how often is Poland mentioned?
:-) #Lect...
The data

The script

Your turn

Questions?

The output

#bigdata

Damian Trilling
The data

The script

Your turn

Questions?

Try it yourself!
We’ll help you getting started. Please go to
http://beehub.n...
The data

The script

Your turn

Questions?

Recap
1 The data

Recording tweets with yourTwapperkeeper
CSV-files
Other ways...
The data

The script

Your turn

Questions?

This afternoon

Your own script

#bigdata

Damian Trilling
The data

The script

Your turn

Questions?

Vragen of opmerkingen?

Damian Trilling
d.c.trilling@uva.nl
@damian0604
www.d...
Analyzing social media with Python and other tools (2/4)
Analyzing social media with Python and other tools (2/4)
Analyzing social media with Python and other tools (2/4)
Analyzing social media with Python and other tools (2/4)
Upcoming SlideShare
Loading in …5
×

Analyzing social media with Python and other tools (2/4)

1,362 views
1,097 views

Published on

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,362
On SlideShare
0
From Embeds
0
Number of Embeds
139
Actions
Shares
0
Downloads
26
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Analyzing social media with Python and other tools (2/4)

  1. 1. The data The script Your turn Questions? Hands-on-Workshop Big (Twitter) Data Damian Trilling d.c.trilling@uva.nl @damian0604 www.damiantrilling.net Afdeling Communicatiewetenschap Universiteit van Amsterdam 30 January 2014 10.45 #bigdata Damian Trilling
  2. 2. The data The script Your turn Questions? In this sesion (2/4): 1 The data Recording tweets with yourTwapperkeeper CSV-files Other ways to collect tweets Not that different: Facebook posts 2 The script Pseudo-code Python code The output 3 Your turn 4 Questions? #bigdata Damian Trilling
  3. 3. The data The script Your turn Questions? Recording tweets with yourTwapperkeeper The data: Recording tweets with yourTwapperkeeper http://datacollection.followthenews-uva.cloudlet.sara.nl #bigdata Damian Trilling
  4. 4. The data The script Your turn Questions? Recording tweets with yourTwapperkeeper yourTwapperkeeper #bigdata Damian Trilling
  5. 5. The data The script Your turn Questions? Recording tweets with yourTwapperkeeper yourTwapperkeeper Storage Continuosly calls the Twitter-API and saves all tweets containing specific hashtags to a mySQL-database. You tell it once which data to collect – and wait some months. #bigdata Damian Trilling
  6. 6. The data The script Your turn Questions? Recording tweets with yourTwapperkeeper yourTwapperkeeper #bigdata Damian Trilling
  7. 7. The data The script Your turn Questions? Recording tweets with yourTwapperkeeper yourTwapperkeeper Retrieving the data You could access the MySQL-database directly. But yourTwapperkeeper has a nice interface that allows you to export the data to a format we can use for the analysis. #bigdata Damian Trilling
  8. 8. The data The script Your turn Questions? CSV-files The data: CSV-files #bigdata Damian Trilling
  9. 9. The data The script Your turn Questions? CSV-files CSV-files The format of our choice • All programs can read it • Even human-readable in a simple text editor: • Plain text, with a comma (or a semicolon) denoting column breaks • No limits regarging the size #bigdata Damian Trilling
  10. 10. The data The script Your turn Questions? CSV-files 1 2 3 text,to_user_id,from_user,id,from_user_id, iso_language_code,source,profile_image_url,geo_type, geo_coordinates_0,geo_coordinates_1,created_at,time :-) #Lectrr #wereldleiders #uitspraken #Wikileaks # klimaattop http://t.co/Udjpk48EIB,,henklbr ,407085917011079169,118374840,nl,web,http://pbs.twimg. com/profile_images/378800000673845195/ b47785b1595e6a1c63b93e463f3d0ccc_normal.jpeg,,0,0,Sun Dec 01 09:57:00 +0000 2013,1385891820 Wat zijn de resulaten vd #klimaattop in #Warschau waard? @EP_Environment ontmoet voorzitter klimaattop @MarcinKorolec http://t.co/4Lmiaopf60,,Europarl_NL ,406058792573730816,37623918,en,<a href="http://www. hootsuite.com" rel="nofollow">HootSuite</a>,http://pbs .twimg.com/profile_images/2943831271/ b6631b23a86502fae808ca3efde23d0d_normal.png,,0,0,Thu Nov 28 13:55:35 +0000 2013,1385646935 #bigdata Damian Trilling
  11. 11. The data The script Your turn Questions? Other ways to collect tweets The data: Other ways to collect tweets #bigdata Damian Trilling
  12. 12. The data The script Your turn Questions? Other ways to collect tweets Other ways to collect tweets Again, we want a CSV file. . . • If you want tweets per person: www.allmytweets.net • Up to six days backwards: www.scraperwiki.com • Buy it from a commercial vendor • TCAT (from the guys at DMI/mediastudies) • For specific purposes, write your own Python script to access the Twitter-API (if you want to, I can show you more about this tomorrow) #bigdata Damian Trilling
  13. 13. The data The script Your turn Questions? Not that different: Facebook posts The data: Not that different: Facebook posts #bigdata Damian Trilling
  14. 14. The data The script Your turn Questions? Not that different: Facebook posts Not that different: Facebook posts Have a look at netvizz • Gephi-files for network analysis • . . . and a tab-seperated (essentially the same as CSV) file with the content) #bigdata Damian Trilling
  15. 15. The data The script Your turn Questions? Not that different: Facebook posts Not that different: Facebook posts Have a look at netvizz • Gephi-files for network analysis • . . . and a tab-seperated (essentially the same as CSV) file with the content) An alternative: Facepager • Tool to query different APIs (a.o. Twitter and Facebook) and to store the result in a CSV table • http://www.ls1.ifkw.uni-muenchen.de/personen/ wiss_ma/keyling_till/software.html #bigdata Damian Trilling
  16. 16. The data The script Your turn Questions? Pseudo-code The script: Pseudo-code #bigdata Damian Trilling
  17. 17. The data The script Your turn Questions? Pseudo-code Our task: Identify all tweets that include a reference to Poland Let’s start with some pseudo-code! 1 2 3 4 5 6 7 open csv-table for each line: append column 1 to a list of tweets append column 3 to a list of corresponding users look for searchstring in column 1 append search result to a list of results save lists to a new csv-file #bigdata Damian Trilling
  18. 18. The data The script Your turn Questions? Python code The script: Python code #bigdata Damian Trilling
  19. 19. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 #!/usr/bin/python from unicsv import CsvUnicodeReader from unicsv import CsvUnicodeWriter import re inputfilename="mytweets.csv" outputfilename="myoutput.csv" user_list=[] tweet_list=[] search_list=[] searchstring1 = re.compile(r’[Pp]olen|[Pp]ool|[Ww]arschau|[Ww]arszawa’) print "Opening "+inputfilename reader=CsvUnicodeReader(open(inputfilename,"r")) for row in reader: tweet_list.append(row[0]) user_list.append(row[2]) matches1 = searchstring1.findall(row[0]) matchcount1=0 for word in matches1: matchcount1=matchcount1+1 search_list.append(matchcount1) print "Constructing data matrix" outputdata=zip(tweet_list,user_list,search_list) headers=zip(["tweet"],["user"],["how often is Poland mentioned?"]) print "Write data matrix to ",outputfilename writer=CsvUnicodeWriter(open(outputfilename,"wb")) writer.writerows(headers) writer.writerows(outputdata)
  20. 20. The data The script Your turn Questions? Python code 1 2 3 4 5 #!/usr/bin/python # We start with importing some modules: from unicsv import CsvUnicodeReader from unicsv import CsvUnicodeWriter import re 6 7 8 9 10 # Let us define two variables that contain # the names of the files we want to use inputfilename="mytweets.csv" outputfilename="myoutput.csv" #bigdata Damian Trilling
  21. 21. The data The script Your turn Questions? Python code 1 2 3 4 5 6 # We create some empty lists that we will use later on. # A list can contain several variables # and is denoted by square brackets. user_list=[] tweet_list=[] search_list=[] #bigdata Damian Trilling
  22. 22. The data The script Your turn Questions? Python code 1 2 # What do we want to look for? searchstring1 = re.compile(r’[Pp]olen|[Pp]ool|[Ww]arschau |[Ww]arszawa’) 3 4 5 6 # Enough preparation, let the program begin! # We tell the user what is going on... print "Opening "+inputfilename 7 8 9 # ... and call the module that reads the input file. reader=CsvUnicodeReader(open(inputfilename,"r")) #bigdata Damian Trilling
  23. 23. The data The script Your turn Questions? Python code 1 2 3 4 5 6 7 8 # Now we read the file line by line. # The indented block is repeated for each row # (thus, each tweet) for row in reader: # append data from the current row to our lists. # Note that we start counting with 0. tweet_list.append(row[0]) user_list.append(row[2]) 9 10 11 12 13 14 15 16 #bigdata # Let us count how often our searchstring is used in # in this tweet matches1 = searchstring1.findall(row[0]) matchcount1=0 for word in matches1: matchcount1=matchcount1+1 search_list.append(matchcount1) Damian Trilling
  24. 24. The data The script Your turn Questions? Python code 1 2 # Time to put all the data in one container # and save it: 3 4 5 6 7 8 9 10 print "Constructing data matrix" outputdata=zip(tweet_list,user_list,search_list) headers=zip(["tweet"],["user"],["how often is Poland mentioned?"]) print "Write data matrix to ",outputfilename writer=CsvUnicodeWriter(open(outputfilename,"wb")) writer.writerows(headers) writer.writerows(outputdata) #bigdata Damian Trilling
  25. 25. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 #!/usr/bin/python from unicsv import CsvUnicodeReader from unicsv import CsvUnicodeWriter import re inputfilename="mytweets.csv" outputfilename="myoutput.csv" user_list=[] tweet_list=[] search_list=[] searchstring1 = re.compile(r’[Pp]olen|[Pp]ool|[Ww]arschau|[Ww]arszawa’) print "Opening "+inputfilename reader=CsvUnicodeReader(open(inputfilename,"r")) for row in reader: tweet_list.append(row[0]) user_list.append(row[2]) matches1 = searchstring1.findall(row[0]) matchcount1=0 for word in matches1: matchcount1=matchcount1+1 search_list.append(matchcount1) print "Constructing data matrix" outputdata=zip(tweet_list,user_list,search_list) headers=zip(["tweet"],["user"],["how often is Poland mentioned?"]) print "Write data matrix to ",outputfilename writer=CsvUnicodeWriter(open(outputfilename,"wb")) writer.writerows(headers) writer.writerows(outputdata)
  26. 26. The data The script Your turn Questions? The output The script: myoutput.csv #bigdata Damian Trilling
  27. 27. The data The script Your turn Questions? The output 1 2 3 4 5 tweet,user,how often is Poland mentioned? :-) #Lectrr #wereldleiders #uitspraken #Wikileaks # klimaattop http://t.co/Udjpk48EIB,henklbr,0 Wat zijn de resulaten vd #klimaattop in #Warschau waard? @EP_Environment ontmoet voorzitter klimaattop @MarcinKorolec http://t.co/4Lmiaopf60,Europarl_NL,1 RT @greenami1: De winnaars en verliezers van de lachwekkende #klimaattop in #Warschau (interview): http://t.co/DEYqnqXHdy #Misserfolg #Kli...,LarsMoratis ,1 De winnaars en verliezers van de lachwekkende #klimaattop in #Warschau (interview): http://t.co/DEYqnqXHdy # Misserfolg #Klimaschutz #FAZ,greenami1,1 #bigdata Damian Trilling
  28. 28. The data The script Your turn Questions? The output #bigdata Damian Trilling
  29. 29. The data The script Your turn Questions? Try it yourself! We’ll help you getting started. Please go to http://beehub.nl/bigdata-cw/workshop and download the some files. Save the Python files unicsv.py myfirstscript.py as well as the dataset mytweets.csv in a new folder called workshop on your H-drive. When you are done, start Python (GUI) from the Windows Start Menu. #bigdata Damian Trilling
  30. 30. The data The script Your turn Questions? Recap 1 The data Recording tweets with yourTwapperkeeper CSV-files Other ways to collect tweets Not that different: Facebook posts 2 The script Pseudo-code Python code The output 3 Your turn 4 Questions? #bigdata Damian Trilling
  31. 31. The data The script Your turn Questions? This afternoon Your own script #bigdata Damian Trilling
  32. 32. The data The script Your turn Questions? Vragen of opmerkingen? Damian Trilling d.c.trilling@uva.nl @damian0604 www.damiantrilling.net #bigdata Damian Trilling

×