Analyzing social media with Python and other tools (1/4)
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,385
On Slideshare
603
From Embeds
782
Number of Embeds
4

Actions

Shares
Downloads
14
Comments
0
Likes
1

Embeds 782

http://www.damiantrilling.net 675
http://www.scoop.it 104
https://twitter.com 2
http://www.google.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Good morning! Enjoy your coffee and install Putty and NotepadPlus via "Software Maintance/Application Catalgue". And the Pattern-package (see my e-mail). Thanks.
  • 2. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Hands-on-Workshop Big (Twitter) Data Damian Trilling d.c.trilling@uva.nl @damian0604 www.damiantrilling.net Afdeling Communicatiewetenschap Universiteit van Amsterdam 30 January 2014 9.30 #bigdata Damian Trilling
  • 3. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? The next one and a half days You’ll hear about • Collecting social media data via APIs, RSS and scraping (and the tools for it) • Technical infrastructure (via surfsara) • Python • Sentiment analysis • Automated coding • Frequencies and other statistics • Social network analysis with Gephi • ... #bigdata Damian Trilling
  • 4. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? In this session (1/4): 1 Big Data? What are we talking about? Exploring the field Some examples 2 The process: collect, store, analyze A scheme Our implementation 3 Python What it is When to use it When not to use it 4 Questions? #bigdata Damian Trilling
  • 5. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Exploring the field What’s big data? What are we talking about? #bigdata Damian Trilling
  • 6. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Exploring the field What are we talking about? Today, it’s a hands-on workshop, so let’s keep this important (!) discussion for later. #bigdata Damian Trilling
  • 7. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Exploring the field What are we talking about? So, no definition, but some brief thoughts • Existing data ( = experiments or surveys) • Too big to code manually • Too big to handle with normal tools • New research questions • Call to revisit the relationship between theory and empirical research #bigdata Damian Trilling
  • 8. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Exploring the field What are we talking about? Today, . . . • we are not going to talk about REALLY BIG data, • but we will have some exercises on datasets a normal computer can handle #bigdata Damian Trilling
  • 9. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Exploring the field What are we talking about? Today, . . . • we are not going to talk about REALLY BIG data, • but we will have some exercises on datasets a normal computer can handle Tomorrow, . . . • we will also learn about scaling up these techniques • SurfSARA provides infrastructure for this #bigdata Damian Trilling
  • 10. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Exploring the field What are we talking about? Some sources • Social Network Sites • RSS-feeds • Databases • Scraping text from the web • ... #bigdata Damian Trilling
  • 11. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Exploring the field It’s out there! You only have to collect it. #bigdata Damian Trilling
  • 12. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Exploring the field But why should we care? We can answer new questions • Find needles in haystacks • Identify networks, co-word analysis, linguistic analysis, . . . • Verify our theories in larger datasets #bigdata Damian Trilling
  • 13. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Exploring the field But why should we care? We can answer new questions • Find needles in haystacks • Identify networks, co-word analysis, linguistic analysis, . . . • Verify our theories in larger datasets It makes sense • There are things that computers are simply better at than humans, e.g. in counting things • Having human coders look for words in texts is like calculating a regression analysis by hand #bigdata Damian Trilling
  • 14. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples Some examples #bigdata Damian Trilling
  • 15. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples A recent master thesis The needle in the haystack #bigdata Damian Trilling
  • 16. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples A recent master thesis The needle in the haystack Imagine you want to analyze some very rare content. #bigdata Damian Trilling
  • 17. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples A recent master thesis The needle in the haystack Imagine you want to analyze some very rare content. Normal sampling won’t work, that’s for sure. #bigdata Damian Trilling
  • 18. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples So you’d better collect everything first Getting all news coverage from Dutch news sites Pöll, B. (2013). Social media: new sources, new profession? A content analysis of the use of social media as a source for journalists in online news articles. Master Thesis, Universiteit van Amsterdam. #bigdata Damian Trilling
  • 19. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples So you’d better collect everything first Getting all news coverage from Dutch news sites 1 Collect all articles from nine news sites during a period of two months, resulting in a database with 74.000 articles. Pöll, B. (2013). Social media: new sources, new profession? A content analysis of the use of social media as a source for journalists in online news articles. Master Thesis, Universiteit van Amsterdam. #bigdata Damian Trilling
  • 20. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples So you’d better collect everything first Getting all news coverage from Dutch news sites 1 Collect all articles from nine news sites during a period of two months, resulting in a database with 74.000 articles. 2 Filter articles containing specific keywords. Pöll, B. (2013). Social media: new sources, new profession? A content analysis of the use of social media as a source for journalists in online news articles. Master Thesis, Universiteit van Amsterdam. #bigdata Damian Trilling
  • 21. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples So you’d better collect everything first Getting all news coverage from Dutch news sites 1 Collect all articles from nine news sites during a period of two months, resulting in a database with 74.000 articles. 2 Filter articles containing specific keywords. 3 Those 292 articles where then manually coded. Pöll, B. (2013). Social media: new sources, new profession? A content analysis of the use of social media as a source for journalists in online news articles. Master Thesis, Universiteit van Amsterdam. #bigdata Damian Trilling
  • 22. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples #bigdata Damian Trilling
  • 23. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples It’s just one line of code! url.txt http://www.gmx.at/themen/wissen/mensch/108g5xi-baeuerlich-schiefe-zaehne http://www.gmx.at/themen/unterhaltung/klatsch-tratsch/408g740-fuermannbittet-um-verzeihung http://www.gmx.at/themen/nachrichten/aufruhr-arabien/268g70u-regierungwill-zuruecktreten http://www.gmx.at/themen/nachrichten/panorama/828g54y-neues-zur-klagegegen-republik http://www.gmx.at/themen/nachrichten/panorama/968g72s-millionstrafewegen-oelpest http://www.gmx.at/themen/unterhaltung/klatsch-tratsch/368g6yc-keinbabybauch-nur-fast-food ... ... ... #bigdata wget-commando wget -i urls.txt Damian Trilling
  • 24. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples A recent bachelor thesis Tone in tweets #bigdata Damian Trilling
  • 25. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples A recent bachelor thesis Tone in tweets Imagine you want to know something about someone’s behavior on twitter. Or how a specific topic is discussed on Twitter. #bigdata Damian Trilling
  • 26. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples A recent bachelor thesis Tone in tweets Imagine you want to know something about someone’s behavior on twitter. Or how a specific topic is discussed on Twitter. Do you really want to go through thousands of tweets by hand? #bigdata Damian Trilling
  • 27. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples So you’d better think about automating your coding Finding out how negative or positive politicians are towards their opponents Schut, L. (2013). Verenigde Staten vs. Verenigd Koningrijk: Een automatische inhoudsanalyse naar verklarende factoren voor het gebruik van positive campaigning en negative campaigning door vooraanstaande politici en politieke partijen op Twitter. Bachelor Thesis, Universiteit van Amsterdam. #bigdata Damian Trilling
  • 28. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples So you’d better think about automating your coding Finding out how negative or positive politicians are towards their opponents The student took lists with positive and negative words and made additional ones with a politician’s opponents. Schut, L. (2013). Verenigde Staten vs. Verenigd Koningrijk: Een automatische inhoudsanalyse naar verklarende factoren voor het gebruik van positive campaigning en negative campaigning door vooraanstaande politici en politieke partijen op Twitter. Bachelor Thesis, Universiteit van Amsterdam. #bigdata Damian Trilling
  • 29. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples So you’d better think about automating your coding Finding out how negative or positive politicians are towards their opponents The student took lists with positive and negative words and made additional ones with a politician’s opponents. She used a Python-script to check which type of words was used to refer to opponents. Schut, L. (2013). Verenigde Staten vs. Verenigd Koningrijk: Een automatische inhoudsanalyse naar verklarende factoren voor het gebruik van positive campaigning en negative campaigning door vooraanstaande politici en politieke partijen op Twitter. Bachelor Thesis, Universiteit van Amsterdam. #bigdata Damian Trilling
  • 30. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples So you’d better think about automating your coding Finding out how negative or positive politicians are towards their opponents The student took lists with positive and negative words and made additional ones with a politician’s opponents. She used a Python-script to check which type of words was used to refer to opponents. For further analysis, the results where imported in SPSS. Schut, L. (2013). Verenigde Staten vs. Verenigd Koningrijk: Een automatische inhoudsanalyse naar verklarende factoren voor het gebruik van positive campaigning en negative campaigning door vooraanstaande politici en politieke partijen op Twitter. Bachelor Thesis, Universiteit van Amsterdam. #bigdata Damian Trilling
  • 31. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples #bigdata Damian Trilling
  • 32. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples #bigdata Damian Trilling
  • 33. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples Frame adoption on Twitter Which phrases used by Merkel and Steinbrück on TV make it to the #tvduell discussion on Twitter? Identify frequently used words in the transcript of the debate and in tweets. Find co-occurrances. #bigdata Damian Trilling
  • 34. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples Frame adoption on Twitter #bigdata Damian Trilling
  • 35. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? A scheme The process: collect, store, analyze A scheme #bigdata Damian Trilling
  • 36. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Our implementation datacollection.followthenews-uva.cloudlet.sara.nl #bigdata Damian Trilling
  • 37. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Our implementation datacollection.followthenews-uva.cloudlet.sara.nl yourTwapperkeeper Continuosly calls the Twitter-API and saves all tweets containing specific hashtags to a mySQL-database. #bigdata Damian Trilling
  • 38. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Our implementation datacollection.followthenews-uva.cloudlet.sara.nl yourTwapperkeeper Continuosly calls the Twitter-API and saves all tweets containing specific hashtags to a mySQL-database. rsshond Calls the RSS-feeds of news sites 1x/hour, saves title, time, header, and teaser of all new articles into a CSV-table, follows the link to the full text and downloads them. #bigdata Damian Trilling
  • 39. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Our implementation datacollection.followthenews-uva.cloudlet.sara.nl yourTwapperkeeper Continuosly calls the Twitter-API and saves all tweets containing specific hashtags to a mySQL-database. rsshond Calls the RSS-feeds of news sites 1x/hour, saves title, time, header, and teaser of all new articles into a CSV-table, follows the link to the full text and downloads them. snapshot Visits some URLs every 4x/day and downloads them. #bigdata Damian Trilling
  • 40. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Our implementation How to access the collected data? #bigdata Damian Trilling
  • 41. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Our implementation How to access the collected data? Apache-webserver Download the data from http://datacollection. followthenews-uva.cloudlet.sara.nl. #bigdata Damian Trilling
  • 42. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Our implementation How to access the collected data? Apache-webserver Download the data from http://datacollection. followthenews-uva.cloudlet.sara.nl. SSH (scp) Transfer data directly to your computer or another server (like speeltuin.followthenews-uva.cloudlet.sara.nl) #bigdata Damian Trilling
  • 43. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Our implementation How to access the collected data? Apache-webserver Download the data from http://datacollection. followthenews-uva.cloudlet.sara.nl. SSH (scp) Transfer data directly to your computer or another server (like speeltuin.followthenews-uva.cloudlet.sara.nl) Beehub Connect the server to beehub, which can be mounted like the "p-schijf" or accessed online. #bigdata Damian Trilling
  • 44. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? What it is Python #bigdata Damian Trilling
  • 45. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? What it is One tool to rule them all? #bigdata Damian Trilling
  • 46. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? What it is One tool to rule them all? Of course there are ready-made tool for some of the questions we want to answer. But for many, there isn’t. Python offers us the possibility to build exactly the tool we need. #bigdata Damian Trilling
  • 47. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? What it is One tool to rule them all? Of course there are ready-made tool for some of the questions we want to answer. But for many, there isn’t. Python offers us the possibility to build exactly the tool we need. fun! #bigdata And it’s Damian Trilling
  • 48. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? What it is What is Python? It is a programming language • It is flexible. You can use it for (in principle) any kind of data • There are virtually no limits regarding the amount of data to process • You can run it on every platform #bigdata Damian Trilling
  • 49. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? What it is What is Python? It is a programming language • It is flexible. You can use it for (in principle) any kind of data • There are virtually no limits regarding the amount of data to process • You can run it on every platform • And yet it is easy to learn! #bigdata Damian Trilling
  • 50. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? What it is What is Python? It is a programming language • It is flexible. You can use it for (in principle) any kind of data • There are virtually no limits regarding the amount of data to process • You can run it on every platform • And yet it is easy to learn! It is widely used for content analysis • Many online ressources and toolkits • Books about NLP and Web Scraping with Python #bigdata Damian Trilling
  • 51. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? What it is You do not have to become a programmer. #bigdata Damian Trilling
  • 52. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? What it is You do not have to become a programmer. If you know how to write SPSS or STATA syntax, you will understand Python. #bigdata Damian Trilling
  • 53. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? What it is You do not have to become a programmer. If you know how to write SPSS or STATA syntax, you will understand Python. (But if you have ever had contact with whatever programming language, it helps.) #bigdata Damian Trilling
  • 54. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? What it is You do not have to become a programmer. If you know how to write SPSS or STATA syntax, you will understand Python. (But if you have ever had contact with whatever programming language, It’s enough if you can read and modify the code. it helps.) #bigdata Damian Trilling
  • 55. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? What it is Think of the following task RQ: What are the differences in terms of actors mentioned between Israeli and Palestinian news coverage? #bigdata Damian Trilling
  • 56. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? What it is Think of the following task RQ: What are the differences in terms of actors mentioned between Israeli and Palestinian news coverage? 1 #bigdata The data structure: You have a folder with articles Damian Trilling
  • 57. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? What it is Think of the following task RQ: What are the differences in terms of actors mentioned between Israeli and Palestinian news coverage? 1 2 #bigdata The data structure: You have a folder with articles The desired output: You want a table with the file names and a column per actor, counting how often they are mentioned Damian Trilling
  • 58. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? What it is Think of the following task RQ: What are the differences in terms of actors mentioned between Israeli and Palestinian news coverage? 1 2 The desired output: You want a table with the file names and a column per actor, counting how often they are mentioned 3 #bigdata The data structure: You have a folder with articles A typical task for a short Python script! Damian Trilling
  • 59. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? What it is You need someting like this: for every file in folder: read the file count actors add new row to table with filename and actor counts save table (such a notation is called pseudo-code) #bigdata Damian Trilling
  • 60. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? What it is mypath ="C:UsersRicardaDocumentsArtikelen" regex54 = re.compile(r’Israel.*[minister|politician.*|[Aa]uthorit’) filename_list=[] matchcount54=0 matchcount54_list=[] onlyfiles = [ f for f in listdir(mypath) if isfile(join(mypath,f)) ] for f in onlyfiles: matchcount54=0 artikel=open(join(mypath,f),"r") for line in artikel: matches54 = regex54.findall(line) for word in matches54: matchcount54=matchcount54+1 filename_list.append(f) matchcount54_list.append(matchcount54) artikel.close() output=zip(filename_list,matchcount54_list) writer = csv.writer(open("overzichtstabel.csv", ’wb’)) writer.writerows(output) #bigdata Damian Trilling
  • 61. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? What it is This is not too different from a script Jelle uses for his dissertation. The main difference: He doesn’t code regular expressions, but calculates document similarity. slides-jelle.pdf #bigdata Damian Trilling
  • 62. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? When to use it When to use Python #bigdata Damian Trilling
  • 63. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? When to use it 1st group of tasks Highly repetitive tasks Simple tasks (counting things, comparing texts, . . . ) that can be described in a formalized way. Saves time even with few cases, but there is virtually no size limit. Example: Retweets start with RT, optionally followed by a space, and some letters. So it is very easy to identify them automatically #bigdata Damian Trilling
  • 64. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? When to use it 2nd group of tasks Task for which specific Python modules exist There are thousands of modules suitable for text analysis. You basically only have to write code for data input and output. Example: Sentiment analysis #bigdata Damian Trilling
  • 65. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? When to use it 3rd group of tasks API’s, RSS, webscraping . . . You can use Python if you want to collect and store information. Example: Collecting bio’s of Twitter users, scraping the web (data journalism!), downloading Facebook data #bigdata Damian Trilling
  • 66. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? When not to use it When not to use Python #bigdata Damian Trilling
  • 67. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? When not to use it Maybe you do not need to write a Python script . . . . . . when there are already suitable tools available. Sometimes, the perfect ready-made tool already exists. Example: Axel Bruns’ awk-scripts for Twitter analysis (www. mappingonlinepublics. net ). If I had to write such a tool, I’d do it in Python, but hey, he did it already with awk and it works. #bigdata Damian Trilling
  • 68. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? When not to use it Maybe you do not need to write a Python script . . . . . . when there are already suitable tools available. Sometimes, the perfect ready-made tool already exists. But still, sometimes it is more efficient to write something that does exactly what you want Example: Axel Bruns’ awk-scripts for Twitter analysis (www. mappingonlinepublics. net ). If I had to write such a tool, I’d do it in Python, but hey, he did it already with awk and it works. #bigdata Damian Trilling
  • 69. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? When not to use it And, let’s face it,. . . . . . we are no programmers. So maybe, some tasks are too complex for us to program ourselves. #bigdata Damian Trilling
  • 70. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? When not to use it And, let’s face it,. . . . . . we are no programmers. So maybe, some tasks are too complex for us to program ourselves. But there is a huge online community that helps you. #bigdata Damian Trilling
  • 71. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? When not to use it Recap 1 Big Data? What are we talking about? Exploring the field Some examples 2 The process: collect, store, analyze A scheme Our implementation 3 Python What it is When to use it When not to use it 4 Questions? #bigdata Damian Trilling
  • 72. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? When not to use it After the break Hand’s on! Exploring a basic Python script #bigdata Damian Trilling
  • 73. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Vragen of opmerkingen? Damian Trilling d.c.trilling@uva.nl @damian0604 www.damiantrilling.net #bigdata Damian Trilling