SlideShare a Scribd company logo
1 of 58
Download to read offline
Last week’s exercise Data harvesting and storage Storing data Next meetings
Big Data and Automated Content Analysis
Week 3 – Monday
»Data harvesting and storage«
Damian Trilling
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
Afdeling Communicatiewetenschap
Universiteit van Amsterdam
11 April 2016
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Today
1 Last week’s exercise
Step by step
Concluding remarks
2 Data harvesting and storage
APIs
RSS feeds
Scraping and crawling
Parsing text files
3 Storing data
CSV tables
JSON and XML
4 Next meetings
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise
Discussing the code
Last week’s exercise Data harvesting and storage Storing data Next meetings
Step by step
Reading a JSON file into a dict, looping over the dict
Task 1: Print all titles of all videos
1 import json
2
3 with open("/home/damian/pornexercise/xhamster.json") as fi:
4 data=json.load(fi)
5
6 for item in data:
7 print (data[item]["title"])
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Step by step
Reading a JSON file into a dict, looping over the dict
Task 1: Print all titles of all videos
1 import json
2
3 with open("/home/damian/pornexercise/xhamster.json") as fi:
4 data=json.load(fi)
5
6 for item in data:
7 print (data[item]["title"])
NB: You have to know (e.g., by reading the documentation of the dataset)
that the key is called title
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Step by step
Reading a JSON file into a dict, looping over the dict
Task 1: Print all titles of all videos
1 import json
2
3 with open("/home/damian/pornexercise/xhamster.json") as fi:
4 data=json.load(fi)
5
6 for item in data:
7 print (data[item]["title"])
NB: You have to know (e.g., by reading the documentation of the dataset)
that the key is called title
NB: data is in fact a dict of dicts, such that data[item] is another dict.
For each of these dicts, we retrieve the value that corresponds to the key title
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Step by step
Initializing variables, merging two lists, using a counter
Task 2: Average tags per video and most frequently used tags
1 from collections import Counter
2
3 alltags=[]
4 i=0
5 for item in data:
6 i+=1
7 alltags+=data[item]["channels"]
8
9 print(len(alltags),"tags are describing",i,"different videos")
10 print("Thus, we have an average of",len(alltags)/i,"tags per video")
11
12 c=Counter(alltags)
13 print (c.most_common(100))
(there are other, more efficient ways of doing this)
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Step by step
Nesting blocks, using a defaultdict to count, error handling
Task 3: What porn category is most frequently commented on?
1 from collections import defaultdict
2
3 commentspercat=defaultdict(int)
4 for item in data:
5 for tag in data[item]["channels"]:
6 try:
7 commentspercat[tag]+=int(data[item]["nb_comments"])
8 except:
9 pass
10 print(commentspercat)
11 # if you want to print in a fancy way, you can do it like this:
12 for tag in sorted(commentspercat, key=commentspercat.get, reverse=True):
13 print( tag,"t", commentspercat[tag])
A defaultdict is a normal dict, with the difference that the type of each value is
pre-defined and it doesn’t give an error if you look up a non-existing key
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Step by step
Nesting blocks, using a defaultdict to count, error handling
Task 3: What porn category is most frequently commented on?
1 from collections import defaultdict
2
3 commentspercat=defaultdict(int)
4 for item in data:
5 for tag in data[item]["channels"]:
6 try:
7 commentspercat[tag]+=int(data[item]["nb_comments"])
8 except:
9 pass
10 print(commentspercat)
11 # if you want to print in a fancy way, you can do it like this:
12 for tag in sorted(commentspercat, key=commentspercat.get, reverse=True):
13 print( tag,"t", commentspercat[tag])
A defaultdict is a normal dict, with the difference that the type of each value is
pre-defined and it doesn’t give an error if you look up a non-existing key
NB: In line 7, we assume the value to be an int, but the datasets sometimes contains
the string “NA” instead of a string representing an int. That’s why we need the
try/except construction
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Step by step
Adding elements to a list, sum() and len()
Task 4: Average length of descriptions
1 length=[]
2 for item in data:
3 length.append(len(data[item]["description"]))
4
5 print ("Average length",sum(length)/len(length))
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Step by step
Tokenizing with .split()
Task 5: Most frequently used words
1 allwords=[]
2 for item in data:
3 allwords+=data[item]["description"].split()
4 c2=Counter(allwords)
5 print(c2.most_common(100))
.split() changes a string to a list of words.
"This is cool”.split()
results in
["This", "is", "cool"]
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Concluding remarks
Concluding remarks
Make sure you fully understand the code!
Re-read the corresponding chapters
PLAY AROUND!!!
Big Data and Automated Content Analysis Damian Trilling
Data harvesting and storage
An overview of APIs, scrapers, crawlers, RSS-feeds, and different
file formats
Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
Collecting data:
APIs
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
APIs
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
Querying an API
1 # contact the Twitter API
2 auth = OAuth(access_key, access_secret, consumer_key, consumer_secret)
3 twitter = Twitter(auth = auth)
4
5 # get all info about the user ’username’)
6 tweepinfo=twitter.users.show(screen_name=username)
7
8 # save his bio statement to the variable bio
9 bio=tweepinfo["description"])
10
11 # save his location to the variable location
12 location=tweepinfo["location"]
(abbreviated Python example of how to query the Twitter REST API)
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
Who offers APIs?
The usual suspects: Twitter, Facebook, Google – but also Reddit,
Youtube, . . .
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
Who offers APIs?
The usual suspects: Twitter, Facebook, Google – but also Reddit,
Youtube, . . .
If you ever leave your bag on a bus on Chicago
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
Who offers APIs?
The usual suspects: Twitter, Facebook, Google – but also Reddit,
Youtube, . . .
If you ever leave your bag on a bus on Chicago
. . . but do have Python on your laptop, watch this:
https://www.youtube.com/watch?v=RrPZza_vZ3w.
That guy queries the Chicago bus company’s API to calculate
when exactly the vehicle with his bag arrives the next time at the
bus stop in front of his office.
(Yes, he tried calling the help desk before, but they didn’t know. He got his bag back.)
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
APIs
Pro
• Structured data
• Easy to process automatically
• Can be directy embedded in your script
Morstatter, F., Pfeffer, J., Liu, H., & Carley, K. M. (2013). Is the sample good enough? Comparing data from
Twitter’s Streaming API with Twitter’s Firehose. International AAAI Conference on Weblogs and Social Media.
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
APIs
Pro
• Structured data
• Easy to process automatically
• Can be directy embedded in your script
Con
• Often limitations (requests per minute, sampling, . . . )
• You have to trust the provider that he delivers the right
content (⇒ Morstatter e.a., 2013)
• Some APIs won’t allow you to go back in time!
Morstatter, F., Pfeffer, J., Liu, H., & Carley, K. M. (2013). Is the sample good enough? Comparing data from
Twitter’s Streaming API with Twitter’s Firehose. International AAAI Conference on Weblogs and Social Media.
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
So we have learned that we can access an API directly.
But what if we have to do so 24/7?
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
So we have learned that we can access an API directly.
But what if we have to do so 24/7?
Collecting tweets with a tool running on a server that does query
the API 24/7.
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
So we have learned that we can access an API directly.
But what if we have to do so 24/7?
Collecting tweets with a tool running on a server that does query
the API 24/7.
For example, we are running DMI-TCAT (Borra & Rieder, 2014) on
http://datacollection3.followthenews-uva.cloudlet.sara.nl
Borra, E., & Rieder, B. (2014). Programmed method: Developing a toolset for capturing and analyzing tweets.
Aslib Journal of Information Management, 66(3), 262–278. doi:10.1108/AJIM-09-2013-0094
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
DMI-TCAT
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
DMI-TCAT
It queries the API and stores the result
It continuosly calls the Twitter-API and saves
all tweets containing specific hashtags to a
MySQL-database.
You tell it once which data to collect – and
wait some months.
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
DMI-TCAT
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
DMI-TCAT
Retrieving the data for analysis
You could access the MySQL-database directly.
But that’s not neccessary, as DMI-TCAT has a
nice interface that allows you to save the data
as a CSV file.
(and offers tools for direct analysis as well)
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
RSS feeds
Collecting data:
RSS feeds
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
RSS feeds
RSS feeds
What’s that?
• A structured (XML) format in which for example news sites
and blogs offer their content
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
RSS feeds
RSS feeds
What’s that?
• A structured (XML) format in which for example news sites
and blogs offer their content
• You get only the new news items (and that’s great!)
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
RSS feeds
RSS feeds
What’s that?
• A structured (XML) format in which for example news sites
and blogs offer their content
• You get only the new news items (and that’s great!)
• Title, teaser (or full text), date and time, link
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
RSS feeds
RSS feeds
What’s that?
• A structured (XML) format in which for example news sites
and blogs offer their content
• You get only the new news items (and that’s great!)
• Title, teaser (or full text), date and time, link
http://www.nu.nl/rss
http://datacollection3.followthenews-uva.cloudlet.sara.nl/rsshond
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
RSS feeds
RSS feeds
Pro
• One protocol for all services
• Easy to use
Con
• Full text often not included, you have to download the link
separately (⇒ Problems associated with scraping)
• You can’t go back in time! But we have archived a lot of RSS
feeds
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Scraping and crawling
Collecting data:
Scraping and crawling
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Scraping and crawling
Scraping and crawling
If you have no chance of getting already structured data via
one of the approaches above
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Scraping and crawling
Scraping and crawling
If you have no chance of getting already structured data via
one of the approaches above
• Download web pages, try to identify the structure yourself
• You have to parse the data
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Scraping and crawling
Scraping and crawling
If you have no chance of getting already structured data via
one of the approaches above
• Download web pages, try to identify the structure yourself
• You have to parse the data
• Can get very complicated (depending on the specific task),
especially if the structure of the web pages changes
Further reading:
http://scrapy.org
https:
//github.com/anthonydb/python-get-started/blob/master/5-web-scraping.py
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Parsing text files
Collecting data:
Parsing text files
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Parsing text files
For messy input data or for semi-structured data
Guiding question: Can we identify some kind of pattern?
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Parsing text files
For messy input data or for semi-structured data
Guiding question: Can we identify some kind of pattern?
Examples
• Lewis, Zamith, & Hermida (2013) had a corrupt CSV-file
Lewis, S. C., Zamith, R., & Hermida, A. (2013). Content analysis in an era of Big Data: A hybrid approach to
computational and manual methods. Journal of Broadcasting & Electronic Media, 57(1), 34–52.
doi:10.1080/08838151.2012.761702
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Parsing text files
For messy input data or for semi-structured data
Guiding question: Can we identify some kind of pattern?
Examples
• Lewis, Zamith, & Hermida (2013) had a corrupt CSV-file
• LexisNexis gives you a chunk of text (rather than, e.g., a
structured JSON or XML object)
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Parsing text files
For messy input data or for semi-structured data
Guiding question: Can we identify some kind of pattern?
Examples
• Lewis, Zamith, & Hermida (2013) had a corrupt CSV-file
• LexisNexis gives you a chunk of text (rather than, e.g., a
structured JSON or XML object)
But in both cases, as long as you can find any pattern or structure
in it, you can try to write a Python script to parse the data.
Big Data and Automated Content Analysis Damian Trilling
1 with open(bestandsnaam) as f:
2 for line in f:
3 line=line.replace("r","")
4 if line=="n":
5 continue
6 matchObj=re.match(r"s+(d+) of (d+) DOCUMENTS",line)
7 if matchObj:
8 artikel= int(matchObj.group(1))
9 tekst[artikel]=""
10 continue
11 if line.startswith("SECTION"):
12 section[artikel]=line.replace("SECTION: ","").rstrip("n")
13 elif line.startswith("LENGTH"):
14 length[artikel]=line.replace("LENGTH: ","").rstrip("n")
15 ...
16 ...
17 ...
18
19 else:
20 tekst[artikel]=tekst[artikel]+line
Last week’s exercise Data harvesting and storage Storing data Next meetings
CSV tables
Storing data:
CSV tables
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
CSV tables
CSV-files
Always a good choice
• All programs can read it
• Even human-readable in a simple text editor:
• Plain text, with a comma (or a semicolon) denoting column
breaks
• No limits regarging the size
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
CSV tables
A CSV-file with tweets
1 text,to_user_id,from_user,id,from_user_id,iso_language_code,source,
profile_image_url,geo_type,geo_coordinates_0,geo_coordinates_1,
created_at,time
2 :-) #Lectrr #wereldleiders #uitspraken #Wikileaks #klimaattop http://t.
co/Udjpk48EIB,,henklbr,407085917011079169,118374840,nl,web,http://
pbs.twimg.com/profile_images/378800000673845195/
b47785b1595e6a1c63b93e463f3d0ccc_normal.jpeg,,0,0,Sun Dec 01
09:57:00 +0000 2013,1385891820
3 Wat zijn de resulaten vd #klimaattop in #Warschau waard? @EP_Environment
ontmoet voorzitter klimaattop @MarcinKorolec http://t.co/4
Lmiaopf60,,Europarl_NL,406058792573730816,37623918,en,<a href="http
://www.hootsuite.com" rel="nofollow">HootSuite</a>,http://pbs.twimg
.com/profile_images/2943831271/
b6631b23a86502fae808ca3efde23d0d_normal.png,,0,0,Thu Nov 28
13:55:35 +0000 2013,1385646935
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
JSON and XML
Storing data:
JSON and XML
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
JSON and XML
JSON and XML
Great if we have a nested data structure
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
JSON and XML
JSON and XML
Great if we have a nested data structure
• Items within feeds
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
JSON and XML
JSON and XML
Great if we have a nested data structure
• Items within feeds
• Personal data within authors within books
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
JSON and XML
JSON and XML
Great if we have a nested data structure
• Items within feeds
• Personal data within authors within books
• Tweets within followers within users
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
JSON and XML
A JSON object containing GoogleBooks data
1 {’totalItems’: 574, ’items’: [{’kind’: ’books#volume’, ’volumeInfo’: {’
publisher’: ’"O’Reilly Media, Inc."’, ’description’: u’Get a
comprehensive, in-depth introduction to the core Python language
with this hands-on book. Based on author Mark Lutzu2019s popular
training course, this updated fifth edition will help you quickly
write efficient, high-quality code with Python. Itu2019s an ideal
way to begin, whether youu2019re new to programming or a
professional developer versed in other languages. Complete with
quizzes, exercises, and helpful illustrations, this easy-to-follow,
self-paced tutorial gets you started with both Python 2.7 and 3.3
u2014 the
2 ...
3 ...
4 ’kind’: ’books#volumes’}
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
JSON and XML
An XML object containing an RSS feed
1 ...
2 <item>
3 <title>Agema doet aangifte tegen Samsom en Spekman</title>
4 <link>http://www.nu.nl/politiek/3743441/agema-doet-aangifte-
samsom-en-spekman.html</link>
5 <guid>http://www.nu.nl/politiek/3743441/index.html</guid>
6 <description>PVV-Kamerlid Fleur Agema gaat vrijdag aangifte doen
tegen PvdA-leider Diederik Samsom en PvdA-voorzitter Hans
Spekman wegens uitspraken die zij hebben gedaan over
Marokkanen. </description>
7 <pubDate>Thu, 03 Apr 2014 21:58:48 +0200</pubDate>
8 <category>Algemeen</category>
9 <enclosure url="http://bin.snmmd.nl/m/m1mxwpka6nn2_sqr256.jpg"
type="image/jpeg" />
10 <copyrightPhoto>nu.nl</copyrightPhoto>
11 </item>
12 ...
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Next meetings
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Next meetings
Wednesday, 15–4
Writing some first data collection scripts
Big Data and Automated Content Analysis Damian Trilling

More Related Content

What's hot

The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?Frank van Harmelen
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engineLars Marius Garshol
 
From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?Constantin Orasan
 
Big Data LDN 2017: Machine Learning on Structured Data. Why Is Learning Rules...
Big Data LDN 2017: Machine Learning on Structured Data. Why Is Learning Rules...Big Data LDN 2017: Machine Learning on Structured Data. Why Is Learning Rules...
Big Data LDN 2017: Machine Learning on Structured Data. Why Is Learning Rules...Matt Stubbs
 
Presentation1.pdf
Presentation1.pdfPresentation1.pdf
Presentation1.pdfZixunZhou
 

What's hot (20)

BDACA - Lecture2
BDACA - Lecture2BDACA - Lecture2
BDACA - Lecture2
 
BDACA1617s2 - Lecture7
BDACA1617s2 - Lecture7BDACA1617s2 - Lecture7
BDACA1617s2 - Lecture7
 
BDACA - Lecture4
BDACA - Lecture4BDACA - Lecture4
BDACA - Lecture4
 
BDACA1617s2 - Lecture5
BDACA1617s2 - Lecture5BDACA1617s2 - Lecture5
BDACA1617s2 - Lecture5
 
BDACA1617s2 - Lecture6
BDACA1617s2 - Lecture6BDACA1617s2 - Lecture6
BDACA1617s2 - Lecture6
 
BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2
 
BDACA1516s2 - Lecture4
 BDACA1516s2 - Lecture4 BDACA1516s2 - Lecture4
BDACA1516s2 - Lecture4
 
BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1
 
Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)
 
BD-ACA week3a
BD-ACA week3aBD-ACA week3a
BD-ACA week3a
 
BDACA - Lecture7
BDACA - Lecture7BDACA - Lecture7
BDACA - Lecture7
 
BDACA - Lecture8
BDACA - Lecture8BDACA - Lecture8
BDACA - Lecture8
 
BDACA - Lecture6
BDACA - Lecture6BDACA - Lecture6
BDACA - Lecture6
 
BD-ACA week5
BD-ACA week5BD-ACA week5
BD-ACA week5
 
Analyzing social media with Python and other tools (2/4)
Analyzing social media with Python and other tools (2/4) Analyzing social media with Python and other tools (2/4)
Analyzing social media with Python and other tools (2/4)
 
The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
 
From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?
 
Big Data LDN 2017: Machine Learning on Structured Data. Why Is Learning Rules...
Big Data LDN 2017: Machine Learning on Structured Data. Why Is Learning Rules...Big Data LDN 2017: Machine Learning on Structured Data. Why Is Learning Rules...
Big Data LDN 2017: Machine Learning on Structured Data. Why Is Learning Rules...
 
Presentation1.pdf
Presentation1.pdfPresentation1.pdf
Presentation1.pdf
 

Viewers also liked (12)

C.V Sukaina Harasis-2-2
C.V Sukaina Harasis-2-2C.V Sukaina Harasis-2-2
C.V Sukaina Harasis-2-2
 
Apararto reproductor. zabdiel rivera
Apararto reproductor. zabdiel riveraApararto reproductor. zabdiel rivera
Apararto reproductor. zabdiel rivera
 
Miguelde libes
Miguelde libesMiguelde libes
Miguelde libes
 
Leith Planning Brochure
Leith Planning BrochureLeith Planning Brochure
Leith Planning Brochure
 
Pedro diapositivas
Pedro diapositivasPedro diapositivas
Pedro diapositivas
 
Techyy
TechyyTechyy
Techyy
 
Ejercicio 14
Ejercicio 14Ejercicio 14
Ejercicio 14
 
Aparato Genital Masculino
Aparato Genital MasculinoAparato Genital Masculino
Aparato Genital Masculino
 
Presentación ANGEL y las enfermedades lisosomales
Presentación ANGEL y las enfermedades lisosomalesPresentación ANGEL y las enfermedades lisosomales
Presentación ANGEL y las enfermedades lisosomales
 
Región Glútea
Región GlúteaRegión Glútea
Región Glútea
 
IN-salary-guide-2015
IN-salary-guide-2015IN-salary-guide-2015
IN-salary-guide-2015
 
Cromosomas tipos, funciones, Careotipo
Cromosomas tipos, funciones, CareotipoCromosomas tipos, funciones, Careotipo
Cromosomas tipos, funciones, Careotipo
 

Similar to BDACA1516s2 - Lecture3

Adventure in Data: A tour of visualization projects at Twitter
Adventure in Data: A tour of visualization projects at TwitterAdventure in Data: A tour of visualization projects at Twitter
Adventure in Data: A tour of visualization projects at TwitterKrist Wongsuphasawat
 
Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Trieu Nguyen
 
Introduction to Text Mining and Visualization with Interactive Web Application
Introduction to Text Mining and Visualization with Interactive Web ApplicationIntroduction to Text Mining and Visualization with Interactive Web Application
Introduction to Text Mining and Visualization with Interactive Web ApplicationOlga Scrivner
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Yael Garten
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryNeo4j
 
Into The Wonderful
Into The WonderfulInto The Wonderful
Into The WonderfulMatt Wood
 
Monitoring as Software Validation
Monitoring as Software ValidationMonitoring as Software Validation
Monitoring as Software ValidationBioDec
 
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...Hima Patel
 
Introduction to Data Science With R Notes
Introduction to Data Science With R NotesIntroduction to Data Science With R Notes
Introduction to Data Science With R NotesLakshmiSarvani6
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Eli White
 
Recommendation systems
Recommendation systemsRecommendation systems
Recommendation systemsAnton Ermak
 
Python webinar 4th june
Python webinar 4th junePython webinar 4th june
Python webinar 4th juneEdureka!
 
Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataHostedbyConfluent
 

Similar to BDACA1516s2 - Lecture3 (20)

BD-ACA week7a
BD-ACA week7aBD-ACA week7a
BD-ACA week7a
 
BD-ACA week2
BD-ACA week2BD-ACA week2
BD-ACA week2
 
BDACA - Tutorial5
BDACA - Tutorial5BDACA - Tutorial5
BDACA - Tutorial5
 
Adventure in Data: A tour of visualization projects at Twitter
Adventure in Data: A tour of visualization projects at TwitterAdventure in Data: A tour of visualization projects at Twitter
Adventure in Data: A tour of visualization projects at Twitter
 
Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)
 
Introduction to Text Mining and Visualization with Interactive Web Application
Introduction to Text Mining and Visualization with Interactive Web ApplicationIntroduction to Text Mining and Visualization with Interactive Web Application
Introduction to Text Mining and Visualization with Interactive Web Application
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
 
Python webinar 2nd july
Python webinar 2nd julyPython webinar 2nd july
Python webinar 2nd july
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
Into The Wonderful
Into The WonderfulInto The Wonderful
Into The Wonderful
 
Monitoring as Software Validation
Monitoring as Software ValidationMonitoring as Software Validation
Monitoring as Software Validation
 
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
 
Introduction to Data Science With R Notes
Introduction to Data Science With R NotesIntroduction to Data Science With R Notes
Introduction to Data Science With R Notes
 
Analyzing social media with Python and other tools (4/4)
Analyzing social media with Python and other tools (4/4) Analyzing social media with Python and other tools (4/4)
Analyzing social media with Python and other tools (4/4)
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011
 
Recommendation systems
Recommendation systemsRecommendation systems
Recommendation systems
 
Python webinar 4th june
Python webinar 4th junePython webinar 4th june
Python webinar 4th june
 
Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier Data
 

More from Department of Communication Science, University of Amsterdam (10)

BDACA - Lecture5
BDACA - Lecture5BDACA - Lecture5
BDACA - Lecture5
 
BDACA - Tutorial1
BDACA - Tutorial1BDACA - Tutorial1
BDACA - Tutorial1
 
BDACA - Lecture1
BDACA - Lecture1BDACA - Lecture1
BDACA - Lecture1
 
BDACA1617s2 - Lecture 1
BDACA1617s2 - Lecture 1BDACA1617s2 - Lecture 1
BDACA1617s2 - Lecture 1
 
Media diets in an age of apps and social media: Dealing with a third layer of...
Media diets in an age of apps and social media: Dealing with a third layer of...Media diets in an age of apps and social media: Dealing with a third layer of...
Media diets in an age of apps and social media: Dealing with a third layer of...
 
Conceptualizing and measuring news exposure as network of users and news items
Conceptualizing and measuring news exposure as network of users and news itemsConceptualizing and measuring news exposure as network of users and news items
Conceptualizing and measuring news exposure as network of users and news items
 
Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 2/2"Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 2/2"
 
Data Science: Case "Political Communication 1/2"
Data Science: Case "Political Communication 1/2"Data Science: Case "Political Communication 1/2"
Data Science: Case "Political Communication 1/2"
 
BDACA1516s2 - Lecture1
BDACA1516s2 - Lecture1BDACA1516s2 - Lecture1
BDACA1516s2 - Lecture1
 
Should we worry about filter bubbles?
Should we worry about filter bubbles?Should we worry about filter bubbles?
Should we worry about filter bubbles?
 

Recently uploaded

Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,Virag Sontakke
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfadityarao40181
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerunnathinaik
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 

Recently uploaded (20)

Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdf
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developer
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 

BDACA1516s2 - Lecture3

  • 1. Last week’s exercise Data harvesting and storage Storing data Next meetings Big Data and Automated Content Analysis Week 3 – Monday »Data harvesting and storage« Damian Trilling d.c.trilling@uva.nl @damian0604 www.damiantrilling.net Afdeling Communicatiewetenschap Universiteit van Amsterdam 11 April 2016 Big Data and Automated Content Analysis Damian Trilling
  • 2. Last week’s exercise Data harvesting and storage Storing data Next meetings Today 1 Last week’s exercise Step by step Concluding remarks 2 Data harvesting and storage APIs RSS feeds Scraping and crawling Parsing text files 3 Storing data CSV tables JSON and XML 4 Next meetings Big Data and Automated Content Analysis Damian Trilling
  • 4. Last week’s exercise Data harvesting and storage Storing data Next meetings Step by step Reading a JSON file into a dict, looping over the dict Task 1: Print all titles of all videos 1 import json 2 3 with open("/home/damian/pornexercise/xhamster.json") as fi: 4 data=json.load(fi) 5 6 for item in data: 7 print (data[item]["title"]) Big Data and Automated Content Analysis Damian Trilling
  • 5. Last week’s exercise Data harvesting and storage Storing data Next meetings Step by step Reading a JSON file into a dict, looping over the dict Task 1: Print all titles of all videos 1 import json 2 3 with open("/home/damian/pornexercise/xhamster.json") as fi: 4 data=json.load(fi) 5 6 for item in data: 7 print (data[item]["title"]) NB: You have to know (e.g., by reading the documentation of the dataset) that the key is called title Big Data and Automated Content Analysis Damian Trilling
  • 6. Last week’s exercise Data harvesting and storage Storing data Next meetings Step by step Reading a JSON file into a dict, looping over the dict Task 1: Print all titles of all videos 1 import json 2 3 with open("/home/damian/pornexercise/xhamster.json") as fi: 4 data=json.load(fi) 5 6 for item in data: 7 print (data[item]["title"]) NB: You have to know (e.g., by reading the documentation of the dataset) that the key is called title NB: data is in fact a dict of dicts, such that data[item] is another dict. For each of these dicts, we retrieve the value that corresponds to the key title Big Data and Automated Content Analysis Damian Trilling
  • 7. Last week’s exercise Data harvesting and storage Storing data Next meetings Step by step Initializing variables, merging two lists, using a counter Task 2: Average tags per video and most frequently used tags 1 from collections import Counter 2 3 alltags=[] 4 i=0 5 for item in data: 6 i+=1 7 alltags+=data[item]["channels"] 8 9 print(len(alltags),"tags are describing",i,"different videos") 10 print("Thus, we have an average of",len(alltags)/i,"tags per video") 11 12 c=Counter(alltags) 13 print (c.most_common(100)) (there are other, more efficient ways of doing this) Big Data and Automated Content Analysis Damian Trilling
  • 8. Last week’s exercise Data harvesting and storage Storing data Next meetings Step by step Nesting blocks, using a defaultdict to count, error handling Task 3: What porn category is most frequently commented on? 1 from collections import defaultdict 2 3 commentspercat=defaultdict(int) 4 for item in data: 5 for tag in data[item]["channels"]: 6 try: 7 commentspercat[tag]+=int(data[item]["nb_comments"]) 8 except: 9 pass 10 print(commentspercat) 11 # if you want to print in a fancy way, you can do it like this: 12 for tag in sorted(commentspercat, key=commentspercat.get, reverse=True): 13 print( tag,"t", commentspercat[tag]) A defaultdict is a normal dict, with the difference that the type of each value is pre-defined and it doesn’t give an error if you look up a non-existing key Big Data and Automated Content Analysis Damian Trilling
  • 9. Last week’s exercise Data harvesting and storage Storing data Next meetings Step by step Nesting blocks, using a defaultdict to count, error handling Task 3: What porn category is most frequently commented on? 1 from collections import defaultdict 2 3 commentspercat=defaultdict(int) 4 for item in data: 5 for tag in data[item]["channels"]: 6 try: 7 commentspercat[tag]+=int(data[item]["nb_comments"]) 8 except: 9 pass 10 print(commentspercat) 11 # if you want to print in a fancy way, you can do it like this: 12 for tag in sorted(commentspercat, key=commentspercat.get, reverse=True): 13 print( tag,"t", commentspercat[tag]) A defaultdict is a normal dict, with the difference that the type of each value is pre-defined and it doesn’t give an error if you look up a non-existing key NB: In line 7, we assume the value to be an int, but the datasets sometimes contains the string “NA” instead of a string representing an int. That’s why we need the try/except construction Big Data and Automated Content Analysis Damian Trilling
  • 10. Last week’s exercise Data harvesting and storage Storing data Next meetings Step by step Adding elements to a list, sum() and len() Task 4: Average length of descriptions 1 length=[] 2 for item in data: 3 length.append(len(data[item]["description"])) 4 5 print ("Average length",sum(length)/len(length)) Big Data and Automated Content Analysis Damian Trilling
  • 11. Last week’s exercise Data harvesting and storage Storing data Next meetings Step by step Tokenizing with .split() Task 5: Most frequently used words 1 allwords=[] 2 for item in data: 3 allwords+=data[item]["description"].split() 4 c2=Counter(allwords) 5 print(c2.most_common(100)) .split() changes a string to a list of words. "This is cool”.split() results in ["This", "is", "cool"] Big Data and Automated Content Analysis Damian Trilling
  • 12. Last week’s exercise Data harvesting and storage Storing data Next meetings Concluding remarks Concluding remarks Make sure you fully understand the code! Re-read the corresponding chapters PLAY AROUND!!! Big Data and Automated Content Analysis Damian Trilling
  • 13. Data harvesting and storage An overview of APIs, scrapers, crawlers, RSS-feeds, and different file formats
  • 14. Last week’s exercise Data harvesting and storage Storing data Next meetings APIs Collecting data: APIs Big Data and Automated Content Analysis Damian Trilling
  • 15. Last week’s exercise Data harvesting and storage Storing data Next meetings APIs APIs Big Data and Automated Content Analysis Damian Trilling
  • 16. Last week’s exercise Data harvesting and storage Storing data Next meetings APIs Querying an API 1 # contact the Twitter API 2 auth = OAuth(access_key, access_secret, consumer_key, consumer_secret) 3 twitter = Twitter(auth = auth) 4 5 # get all info about the user ’username’) 6 tweepinfo=twitter.users.show(screen_name=username) 7 8 # save his bio statement to the variable bio 9 bio=tweepinfo["description"]) 10 11 # save his location to the variable location 12 location=tweepinfo["location"] (abbreviated Python example of how to query the Twitter REST API) Big Data and Automated Content Analysis Damian Trilling
  • 17. Last week’s exercise Data harvesting and storage Storing data Next meetings APIs Who offers APIs? The usual suspects: Twitter, Facebook, Google – but also Reddit, Youtube, . . . Big Data and Automated Content Analysis Damian Trilling
  • 18. Last week’s exercise Data harvesting and storage Storing data Next meetings APIs Who offers APIs? The usual suspects: Twitter, Facebook, Google – but also Reddit, Youtube, . . . If you ever leave your bag on a bus on Chicago Big Data and Automated Content Analysis Damian Trilling
  • 19. Last week’s exercise Data harvesting and storage Storing data Next meetings APIs Who offers APIs? The usual suspects: Twitter, Facebook, Google – but also Reddit, Youtube, . . . If you ever leave your bag on a bus on Chicago . . . but do have Python on your laptop, watch this: https://www.youtube.com/watch?v=RrPZza_vZ3w. That guy queries the Chicago bus company’s API to calculate when exactly the vehicle with his bag arrives the next time at the bus stop in front of his office. (Yes, he tried calling the help desk before, but they didn’t know. He got his bag back.) Big Data and Automated Content Analysis Damian Trilling
  • 20. Last week’s exercise Data harvesting and storage Storing data Next meetings APIs APIs Pro • Structured data • Easy to process automatically • Can be directy embedded in your script Morstatter, F., Pfeffer, J., Liu, H., & Carley, K. M. (2013). Is the sample good enough? Comparing data from Twitter’s Streaming API with Twitter’s Firehose. International AAAI Conference on Weblogs and Social Media. Big Data and Automated Content Analysis Damian Trilling
  • 21. Last week’s exercise Data harvesting and storage Storing data Next meetings APIs APIs Pro • Structured data • Easy to process automatically • Can be directy embedded in your script Con • Often limitations (requests per minute, sampling, . . . ) • You have to trust the provider that he delivers the right content (⇒ Morstatter e.a., 2013) • Some APIs won’t allow you to go back in time! Morstatter, F., Pfeffer, J., Liu, H., & Carley, K. M. (2013). Is the sample good enough? Comparing data from Twitter’s Streaming API with Twitter’s Firehose. International AAAI Conference on Weblogs and Social Media. Big Data and Automated Content Analysis Damian Trilling
  • 22. Last week’s exercise Data harvesting and storage Storing data Next meetings APIs So we have learned that we can access an API directly. But what if we have to do so 24/7? Big Data and Automated Content Analysis Damian Trilling
  • 23. Last week’s exercise Data harvesting and storage Storing data Next meetings APIs So we have learned that we can access an API directly. But what if we have to do so 24/7? Collecting tweets with a tool running on a server that does query the API 24/7. Big Data and Automated Content Analysis Damian Trilling
  • 24. Last week’s exercise Data harvesting and storage Storing data Next meetings APIs So we have learned that we can access an API directly. But what if we have to do so 24/7? Collecting tweets with a tool running on a server that does query the API 24/7. For example, we are running DMI-TCAT (Borra & Rieder, 2014) on http://datacollection3.followthenews-uva.cloudlet.sara.nl Borra, E., & Rieder, B. (2014). Programmed method: Developing a toolset for capturing and analyzing tweets. Aslib Journal of Information Management, 66(3), 262–278. doi:10.1108/AJIM-09-2013-0094 Big Data and Automated Content Analysis Damian Trilling
  • 25. Last week’s exercise Data harvesting and storage Storing data Next meetings APIs DMI-TCAT Big Data and Automated Content Analysis Damian Trilling
  • 26. Last week’s exercise Data harvesting and storage Storing data Next meetings APIs DMI-TCAT It queries the API and stores the result It continuosly calls the Twitter-API and saves all tweets containing specific hashtags to a MySQL-database. You tell it once which data to collect – and wait some months. Big Data and Automated Content Analysis Damian Trilling
  • 27. Last week’s exercise Data harvesting and storage Storing data Next meetings APIs DMI-TCAT Big Data and Automated Content Analysis Damian Trilling
  • 28. Last week’s exercise Data harvesting and storage Storing data Next meetings APIs DMI-TCAT Retrieving the data for analysis You could access the MySQL-database directly. But that’s not neccessary, as DMI-TCAT has a nice interface that allows you to save the data as a CSV file. (and offers tools for direct analysis as well) Big Data and Automated Content Analysis Damian Trilling
  • 29. Last week’s exercise Data harvesting and storage Storing data Next meetings RSS feeds Collecting data: RSS feeds Big Data and Automated Content Analysis Damian Trilling
  • 30. Last week’s exercise Data harvesting and storage Storing data Next meetings RSS feeds RSS feeds What’s that? • A structured (XML) format in which for example news sites and blogs offer their content Big Data and Automated Content Analysis Damian Trilling
  • 31. Last week’s exercise Data harvesting and storage Storing data Next meetings RSS feeds RSS feeds What’s that? • A structured (XML) format in which for example news sites and blogs offer their content • You get only the new news items (and that’s great!) Big Data and Automated Content Analysis Damian Trilling
  • 32. Last week’s exercise Data harvesting and storage Storing data Next meetings RSS feeds RSS feeds What’s that? • A structured (XML) format in which for example news sites and blogs offer their content • You get only the new news items (and that’s great!) • Title, teaser (or full text), date and time, link Big Data and Automated Content Analysis Damian Trilling
  • 33. Last week’s exercise Data harvesting and storage Storing data Next meetings RSS feeds RSS feeds What’s that? • A structured (XML) format in which for example news sites and blogs offer their content • You get only the new news items (and that’s great!) • Title, teaser (or full text), date and time, link http://www.nu.nl/rss http://datacollection3.followthenews-uva.cloudlet.sara.nl/rsshond Big Data and Automated Content Analysis Damian Trilling
  • 34. Last week’s exercise Data harvesting and storage Storing data Next meetings RSS feeds RSS feeds Pro • One protocol for all services • Easy to use Con • Full text often not included, you have to download the link separately (⇒ Problems associated with scraping) • You can’t go back in time! But we have archived a lot of RSS feeds Big Data and Automated Content Analysis Damian Trilling
  • 35. Last week’s exercise Data harvesting and storage Storing data Next meetings Scraping and crawling Collecting data: Scraping and crawling Big Data and Automated Content Analysis Damian Trilling
  • 36. Last week’s exercise Data harvesting and storage Storing data Next meetings Scraping and crawling Scraping and crawling If you have no chance of getting already structured data via one of the approaches above Big Data and Automated Content Analysis Damian Trilling
  • 37. Last week’s exercise Data harvesting and storage Storing data Next meetings Scraping and crawling Scraping and crawling If you have no chance of getting already structured data via one of the approaches above • Download web pages, try to identify the structure yourself • You have to parse the data Big Data and Automated Content Analysis Damian Trilling
  • 38. Last week’s exercise Data harvesting and storage Storing data Next meetings Scraping and crawling Scraping and crawling If you have no chance of getting already structured data via one of the approaches above • Download web pages, try to identify the structure yourself • You have to parse the data • Can get very complicated (depending on the specific task), especially if the structure of the web pages changes Further reading: http://scrapy.org https: //github.com/anthonydb/python-get-started/blob/master/5-web-scraping.py Big Data and Automated Content Analysis Damian Trilling
  • 39. Last week’s exercise Data harvesting and storage Storing data Next meetings Parsing text files Collecting data: Parsing text files Big Data and Automated Content Analysis Damian Trilling
  • 40. Last week’s exercise Data harvesting and storage Storing data Next meetings Parsing text files For messy input data or for semi-structured data Guiding question: Can we identify some kind of pattern? Big Data and Automated Content Analysis Damian Trilling
  • 41. Last week’s exercise Data harvesting and storage Storing data Next meetings Parsing text files For messy input data or for semi-structured data Guiding question: Can we identify some kind of pattern? Examples • Lewis, Zamith, & Hermida (2013) had a corrupt CSV-file Lewis, S. C., Zamith, R., & Hermida, A. (2013). Content analysis in an era of Big Data: A hybrid approach to computational and manual methods. Journal of Broadcasting & Electronic Media, 57(1), 34–52. doi:10.1080/08838151.2012.761702 Big Data and Automated Content Analysis Damian Trilling
  • 42. Last week’s exercise Data harvesting and storage Storing data Next meetings Parsing text files For messy input data or for semi-structured data Guiding question: Can we identify some kind of pattern? Examples • Lewis, Zamith, & Hermida (2013) had a corrupt CSV-file • LexisNexis gives you a chunk of text (rather than, e.g., a structured JSON or XML object) Big Data and Automated Content Analysis Damian Trilling
  • 43. Last week’s exercise Data harvesting and storage Storing data Next meetings Parsing text files For messy input data or for semi-structured data Guiding question: Can we identify some kind of pattern? Examples • Lewis, Zamith, & Hermida (2013) had a corrupt CSV-file • LexisNexis gives you a chunk of text (rather than, e.g., a structured JSON or XML object) But in both cases, as long as you can find any pattern or structure in it, you can try to write a Python script to parse the data. Big Data and Automated Content Analysis Damian Trilling
  • 44.
  • 45.
  • 46. 1 with open(bestandsnaam) as f: 2 for line in f: 3 line=line.replace("r","") 4 if line=="n": 5 continue 6 matchObj=re.match(r"s+(d+) of (d+) DOCUMENTS",line) 7 if matchObj: 8 artikel= int(matchObj.group(1)) 9 tekst[artikel]="" 10 continue 11 if line.startswith("SECTION"): 12 section[artikel]=line.replace("SECTION: ","").rstrip("n") 13 elif line.startswith("LENGTH"): 14 length[artikel]=line.replace("LENGTH: ","").rstrip("n") 15 ... 16 ... 17 ... 18 19 else: 20 tekst[artikel]=tekst[artikel]+line
  • 47. Last week’s exercise Data harvesting and storage Storing data Next meetings CSV tables Storing data: CSV tables Big Data and Automated Content Analysis Damian Trilling
  • 48. Last week’s exercise Data harvesting and storage Storing data Next meetings CSV tables CSV-files Always a good choice • All programs can read it • Even human-readable in a simple text editor: • Plain text, with a comma (or a semicolon) denoting column breaks • No limits regarging the size Big Data and Automated Content Analysis Damian Trilling
  • 49. Last week’s exercise Data harvesting and storage Storing data Next meetings CSV tables A CSV-file with tweets 1 text,to_user_id,from_user,id,from_user_id,iso_language_code,source, profile_image_url,geo_type,geo_coordinates_0,geo_coordinates_1, created_at,time 2 :-) #Lectrr #wereldleiders #uitspraken #Wikileaks #klimaattop http://t. co/Udjpk48EIB,,henklbr,407085917011079169,118374840,nl,web,http:// pbs.twimg.com/profile_images/378800000673845195/ b47785b1595e6a1c63b93e463f3d0ccc_normal.jpeg,,0,0,Sun Dec 01 09:57:00 +0000 2013,1385891820 3 Wat zijn de resulaten vd #klimaattop in #Warschau waard? @EP_Environment ontmoet voorzitter klimaattop @MarcinKorolec http://t.co/4 Lmiaopf60,,Europarl_NL,406058792573730816,37623918,en,<a href="http ://www.hootsuite.com" rel="nofollow">HootSuite</a>,http://pbs.twimg .com/profile_images/2943831271/ b6631b23a86502fae808ca3efde23d0d_normal.png,,0,0,Thu Nov 28 13:55:35 +0000 2013,1385646935 Big Data and Automated Content Analysis Damian Trilling
  • 50. Last week’s exercise Data harvesting and storage Storing data Next meetings JSON and XML Storing data: JSON and XML Big Data and Automated Content Analysis Damian Trilling
  • 51. Last week’s exercise Data harvesting and storage Storing data Next meetings JSON and XML JSON and XML Great if we have a nested data structure Big Data and Automated Content Analysis Damian Trilling
  • 52. Last week’s exercise Data harvesting and storage Storing data Next meetings JSON and XML JSON and XML Great if we have a nested data structure • Items within feeds Big Data and Automated Content Analysis Damian Trilling
  • 53. Last week’s exercise Data harvesting and storage Storing data Next meetings JSON and XML JSON and XML Great if we have a nested data structure • Items within feeds • Personal data within authors within books Big Data and Automated Content Analysis Damian Trilling
  • 54. Last week’s exercise Data harvesting and storage Storing data Next meetings JSON and XML JSON and XML Great if we have a nested data structure • Items within feeds • Personal data within authors within books • Tweets within followers within users Big Data and Automated Content Analysis Damian Trilling
  • 55. Last week’s exercise Data harvesting and storage Storing data Next meetings JSON and XML A JSON object containing GoogleBooks data 1 {’totalItems’: 574, ’items’: [{’kind’: ’books#volume’, ’volumeInfo’: {’ publisher’: ’"O’Reilly Media, Inc."’, ’description’: u’Get a comprehensive, in-depth introduction to the core Python language with this hands-on book. Based on author Mark Lutzu2019s popular training course, this updated fifth edition will help you quickly write efficient, high-quality code with Python. Itu2019s an ideal way to begin, whether youu2019re new to programming or a professional developer versed in other languages. Complete with quizzes, exercises, and helpful illustrations, this easy-to-follow, self-paced tutorial gets you started with both Python 2.7 and 3.3 u2014 the 2 ... 3 ... 4 ’kind’: ’books#volumes’} Big Data and Automated Content Analysis Damian Trilling
  • 56. Last week’s exercise Data harvesting and storage Storing data Next meetings JSON and XML An XML object containing an RSS feed 1 ... 2 <item> 3 <title>Agema doet aangifte tegen Samsom en Spekman</title> 4 <link>http://www.nu.nl/politiek/3743441/agema-doet-aangifte- samsom-en-spekman.html</link> 5 <guid>http://www.nu.nl/politiek/3743441/index.html</guid> 6 <description>PVV-Kamerlid Fleur Agema gaat vrijdag aangifte doen tegen PvdA-leider Diederik Samsom en PvdA-voorzitter Hans Spekman wegens uitspraken die zij hebben gedaan over Marokkanen. </description> 7 <pubDate>Thu, 03 Apr 2014 21:58:48 +0200</pubDate> 8 <category>Algemeen</category> 9 <enclosure url="http://bin.snmmd.nl/m/m1mxwpka6nn2_sqr256.jpg" type="image/jpeg" /> 10 <copyrightPhoto>nu.nl</copyrightPhoto> 11 </item> 12 ... Big Data and Automated Content Analysis Damian Trilling
  • 57. Last week’s exercise Data harvesting and storage Storing data Next meetings Next meetings Big Data and Automated Content Analysis Damian Trilling
  • 58. Last week’s exercise Data harvesting and storage Storing data Next meetings Next meetings Wednesday, 15–4 Writing some first data collection scripts Big Data and Automated Content Analysis Damian Trilling