Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Analyzing social media with Python and other tools (2/4)

1,617 views

Published on

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

Analyzing social media with Python and other tools (2/4)

  1. 1. The data The script Your turn Questions? Hands-on-Workshop Big (Twitter) Data Damian Trilling d.c.trilling@uva.nl @damian0604 www.damiantrilling.net Afdeling Communicatiewetenschap Universiteit van Amsterdam 30 January 2014 10.45 #bigdata Damian Trilling
  2. 2. The data The script Your turn Questions? In this sesion (2/4): 1 The data Recording tweets with yourTwapperkeeper CSV-files Other ways to collect tweets Not that different: Facebook posts 2 The script Pseudo-code Python code The output 3 Your turn 4 Questions? #bigdata Damian Trilling
  3. 3. The data The script Your turn Questions? Recording tweets with yourTwapperkeeper The data: Recording tweets with yourTwapperkeeper http://datacollection.followthenews-uva.cloudlet.sara.nl #bigdata Damian Trilling
  4. 4. The data The script Your turn Questions? Recording tweets with yourTwapperkeeper yourTwapperkeeper #bigdata Damian Trilling
  5. 5. The data The script Your turn Questions? Recording tweets with yourTwapperkeeper yourTwapperkeeper Storage Continuosly calls the Twitter-API and saves all tweets containing specific hashtags to a mySQL-database. You tell it once which data to collect – and wait some months. #bigdata Damian Trilling
  6. 6. The data The script Your turn Questions? Recording tweets with yourTwapperkeeper yourTwapperkeeper #bigdata Damian Trilling
  7. 7. The data The script Your turn Questions? Recording tweets with yourTwapperkeeper yourTwapperkeeper Retrieving the data You could access the MySQL-database directly. But yourTwapperkeeper has a nice interface that allows you to export the data to a format we can use for the analysis. #bigdata Damian Trilling
  8. 8. The data The script Your turn Questions? CSV-files The data: CSV-files #bigdata Damian Trilling
  9. 9. The data The script Your turn Questions? CSV-files CSV-files The format of our choice • All programs can read it • Even human-readable in a simple text editor: • Plain text, with a comma (or a semicolon) denoting column breaks • No limits regarging the size #bigdata Damian Trilling
  10. 10. The data The script Your turn Questions? CSV-files 1 2 3 text,to_user_id,from_user,id,from_user_id, iso_language_code,source,profile_image_url,geo_type, geo_coordinates_0,geo_coordinates_1,created_at,time :-) #Lectrr #wereldleiders #uitspraken #Wikileaks # klimaattop http://t.co/Udjpk48EIB,,henklbr ,407085917011079169,118374840,nl,web,http://pbs.twimg. com/profile_images/378800000673845195/ b47785b1595e6a1c63b93e463f3d0ccc_normal.jpeg,,0,0,Sun Dec 01 09:57:00 +0000 2013,1385891820 Wat zijn de resulaten vd #klimaattop in #Warschau waard? @EP_Environment ontmoet voorzitter klimaattop @MarcinKorolec http://t.co/4Lmiaopf60,,Europarl_NL ,406058792573730816,37623918,en,<a href="http://www. hootsuite.com" rel="nofollow">HootSuite</a>,http://pbs .twimg.com/profile_images/2943831271/ b6631b23a86502fae808ca3efde23d0d_normal.png,,0,0,Thu Nov 28 13:55:35 +0000 2013,1385646935 #bigdata Damian Trilling
  11. 11. The data The script Your turn Questions? Other ways to collect tweets The data: Other ways to collect tweets #bigdata Damian Trilling
  12. 12. The data The script Your turn Questions? Other ways to collect tweets Other ways to collect tweets Again, we want a CSV file. . . • If you want tweets per person: www.allmytweets.net • Up to six days backwards: www.scraperwiki.com • Buy it from a commercial vendor • TCAT (from the guys at DMI/mediastudies) • For specific purposes, write your own Python script to access the Twitter-API (if you want to, I can show you more about this tomorrow) #bigdata Damian Trilling
  13. 13. The data The script Your turn Questions? Not that different: Facebook posts The data: Not that different: Facebook posts #bigdata Damian Trilling
  14. 14. The data The script Your turn Questions? Not that different: Facebook posts Not that different: Facebook posts Have a look at netvizz • Gephi-files for network analysis • . . . and a tab-seperated (essentially the same as CSV) file with the content) #bigdata Damian Trilling
  15. 15. The data The script Your turn Questions? Not that different: Facebook posts Not that different: Facebook posts Have a look at netvizz • Gephi-files for network analysis • . . . and a tab-seperated (essentially the same as CSV) file with the content) An alternative: Facepager • Tool to query different APIs (a.o. Twitter and Facebook) and to store the result in a CSV table • http://www.ls1.ifkw.uni-muenchen.de/personen/ wiss_ma/keyling_till/software.html #bigdata Damian Trilling
  16. 16. The data The script Your turn Questions? Pseudo-code The script: Pseudo-code #bigdata Damian Trilling
  17. 17. The data The script Your turn Questions? Pseudo-code Our task: Identify all tweets that include a reference to Poland Let’s start with some pseudo-code! 1 2 3 4 5 6 7 open csv-table for each line: append column 1 to a list of tweets append column 3 to a list of corresponding users look for searchstring in column 1 append search result to a list of results save lists to a new csv-file #bigdata Damian Trilling
  18. 18. The data The script Your turn Questions? Python code The script: Python code #bigdata Damian Trilling
  19. 19. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 #!/usr/bin/python from unicsv import CsvUnicodeReader from unicsv import CsvUnicodeWriter import re inputfilename="mytweets.csv" outputfilename="myoutput.csv" user_list=[] tweet_list=[] search_list=[] searchstring1 = re.compile(r’[Pp]olen|[Pp]ool|[Ww]arschau|[Ww]arszawa’) print "Opening "+inputfilename reader=CsvUnicodeReader(open(inputfilename,"r")) for row in reader: tweet_list.append(row[0]) user_list.append(row[2]) matches1 = searchstring1.findall(row[0]) matchcount1=0 for word in matches1: matchcount1=matchcount1+1 search_list.append(matchcount1) print "Constructing data matrix" outputdata=zip(tweet_list,user_list,search_list) headers=zip(["tweet"],["user"],["how often is Poland mentioned?"]) print "Write data matrix to ",outputfilename writer=CsvUnicodeWriter(open(outputfilename,"wb")) writer.writerows(headers) writer.writerows(outputdata)
  20. 20. The data The script Your turn Questions? Python code 1 2 3 4 5 #!/usr/bin/python # We start with importing some modules: from unicsv import CsvUnicodeReader from unicsv import CsvUnicodeWriter import re 6 7 8 9 10 # Let us define two variables that contain # the names of the files we want to use inputfilename="mytweets.csv" outputfilename="myoutput.csv" #bigdata Damian Trilling
  21. 21. The data The script Your turn Questions? Python code 1 2 3 4 5 6 # We create some empty lists that we will use later on. # A list can contain several variables # and is denoted by square brackets. user_list=[] tweet_list=[] search_list=[] #bigdata Damian Trilling
  22. 22. The data The script Your turn Questions? Python code 1 2 # What do we want to look for? searchstring1 = re.compile(r’[Pp]olen|[Pp]ool|[Ww]arschau |[Ww]arszawa’) 3 4 5 6 # Enough preparation, let the program begin! # We tell the user what is going on... print "Opening "+inputfilename 7 8 9 # ... and call the module that reads the input file. reader=CsvUnicodeReader(open(inputfilename,"r")) #bigdata Damian Trilling
  23. 23. The data The script Your turn Questions? Python code 1 2 3 4 5 6 7 8 # Now we read the file line by line. # The indented block is repeated for each row # (thus, each tweet) for row in reader: # append data from the current row to our lists. # Note that we start counting with 0. tweet_list.append(row[0]) user_list.append(row[2]) 9 10 11 12 13 14 15 16 #bigdata # Let us count how often our searchstring is used in # in this tweet matches1 = searchstring1.findall(row[0]) matchcount1=0 for word in matches1: matchcount1=matchcount1+1 search_list.append(matchcount1) Damian Trilling
  24. 24. The data The script Your turn Questions? Python code 1 2 # Time to put all the data in one container # and save it: 3 4 5 6 7 8 9 10 print "Constructing data matrix" outputdata=zip(tweet_list,user_list,search_list) headers=zip(["tweet"],["user"],["how often is Poland mentioned?"]) print "Write data matrix to ",outputfilename writer=CsvUnicodeWriter(open(outputfilename,"wb")) writer.writerows(headers) writer.writerows(outputdata) #bigdata Damian Trilling
  25. 25. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 #!/usr/bin/python from unicsv import CsvUnicodeReader from unicsv import CsvUnicodeWriter import re inputfilename="mytweets.csv" outputfilename="myoutput.csv" user_list=[] tweet_list=[] search_list=[] searchstring1 = re.compile(r’[Pp]olen|[Pp]ool|[Ww]arschau|[Ww]arszawa’) print "Opening "+inputfilename reader=CsvUnicodeReader(open(inputfilename,"r")) for row in reader: tweet_list.append(row[0]) user_list.append(row[2]) matches1 = searchstring1.findall(row[0]) matchcount1=0 for word in matches1: matchcount1=matchcount1+1 search_list.append(matchcount1) print "Constructing data matrix" outputdata=zip(tweet_list,user_list,search_list) headers=zip(["tweet"],["user"],["how often is Poland mentioned?"]) print "Write data matrix to ",outputfilename writer=CsvUnicodeWriter(open(outputfilename,"wb")) writer.writerows(headers) writer.writerows(outputdata)
  26. 26. The data The script Your turn Questions? The output The script: myoutput.csv #bigdata Damian Trilling
  27. 27. The data The script Your turn Questions? The output 1 2 3 4 5 tweet,user,how often is Poland mentioned? :-) #Lectrr #wereldleiders #uitspraken #Wikileaks # klimaattop http://t.co/Udjpk48EIB,henklbr,0 Wat zijn de resulaten vd #klimaattop in #Warschau waard? @EP_Environment ontmoet voorzitter klimaattop @MarcinKorolec http://t.co/4Lmiaopf60,Europarl_NL,1 RT @greenami1: De winnaars en verliezers van de lachwekkende #klimaattop in #Warschau (interview): http://t.co/DEYqnqXHdy #Misserfolg #Kli...,LarsMoratis ,1 De winnaars en verliezers van de lachwekkende #klimaattop in #Warschau (interview): http://t.co/DEYqnqXHdy # Misserfolg #Klimaschutz #FAZ,greenami1,1 #bigdata Damian Trilling
  28. 28. The data The script Your turn Questions? The output #bigdata Damian Trilling
  29. 29. The data The script Your turn Questions? Try it yourself! We’ll help you getting started. Please go to http://beehub.nl/bigdata-cw/workshop and download the some files. Save the Python files unicsv.py myfirstscript.py as well as the dataset mytweets.csv in a new folder called workshop on your H-drive. When you are done, start Python (GUI) from the Windows Start Menu. #bigdata Damian Trilling
  30. 30. The data The script Your turn Questions? Recap 1 The data Recording tweets with yourTwapperkeeper CSV-files Other ways to collect tweets Not that different: Facebook posts 2 The script Pseudo-code Python code The output 3 Your turn 4 Questions? #bigdata Damian Trilling
  31. 31. The data The script Your turn Questions? This afternoon Your own script #bigdata Damian Trilling
  32. 32. The data The script Your turn Questions? Vragen of opmerkingen? Damian Trilling d.c.trilling@uva.nl @damian0604 www.damiantrilling.net #bigdata Damian Trilling

×