Part I: File I/O, URL I/O, Dictionaries and other Data Structures in Python
The goal of this part of the lab is to practice working with file i/o, reading from a URL, and using
a dictionary in Python.
Output files: your program should produce an output file corresponding to each of the input files.
Please upload these files to dropbox along with your source code (as plain text files, please do
not archive them). Each output file should contain the top 25 terms (and their counts) found in
the corresponding input file\'s URLs. Sample output file:
Assignment Description: We will attempt to answer the following question: Are Internet
Programming practices significantly different in different countries? As our input data, we will
consider the web pages of the top five universities of several countries: US, Brazil, France,
Germany, India, Russia, and South Africa. I have already gathered the relevant URLs in the
input files (available top5_unis.zip). Please note that all of the source code is of interest for the
purposes of answering this question (so you do not want to drop or parse out various html tags,
script lines, etc. -- this is all relevant data).
Look at the output files your code produces. What do you think? Are there differences in the way
web pages are made in different countries? You do not need to submit anything to answer these
questions, but we will likely discuss the results in class.
URL Addresses:
Brazil
http://www5.usp.br/english/?lang=en
http://www.unicamp.br/unicamp/
http://www.unesp.br/international/
http://www.puc-rio.br/english/
http://www.ufrgs.br/english/home
China
http://www.tsinghua.edu.cn/publish/newthuen/
http://english.pku.edu.cn/
http://www.fudan.edu.cn/en/
http://en.ustc.edu.cn/
http://en.sjtu.edu.cn/
France
http://www.ens.fr/?lang=en
https://www.polytechnique.edu/en
http://www.upmc.fr/en/
http://www.u-psud.fr/en/index.html
http://www.ens-lyon.fr/en/english-ens-de-lyon-269761.kjsp
Germany
http://www.rwth-aachen.de/cms/~a/root/?lidx=1
https://www.heidelberg.edu/
http://www.uni-freiburg.de/universitaet-en
http://www.portal.uni-koeln.de/uoc_home.html?&L=1
http://www.fu-berlin.de/en/
India
http://www.iisc.ac.in/
http://www.iitb.ac.in/
http://www.iitd.ac.in/
https://www.iitm.ac.in/
http://iitk.ac.in/
Russia
http://www.msu.ru/en/
http://english.spbu.ru/
http://www.nsu.ru/?lang=en
http://www.bmstu.ru/en/
http://en.tsu.ru/
South Africa
https://www.uct.ac.za/
https://www.wits.ac.za/
http://www.sun.ac.za/english
http://www.up.ac.za/
https://www.uj.ac.za/
U.S.A.
http://www.caltech.edu
http://www.stanford.edu
http://www.harvard.edu
http://www.mit.edu
http://www.princeton.edu
Part II: Reading from a URLwhile working with an API (using Mediawiki API as an example)
Input: Will be obtained from a URL using Mediawiki API -- starter code below
Output: Up to you... sort of.
Assignment Description: Compare how Wikipedia articles describe various items in the same
category. The choice of items and category is up to you. Briefly describ.
Part I File IO, URL IO, Dictionaries and other Data Structures in.pdf
1. Part I: File I/O, URL I/O, Dictionaries and other Data Structures in Python
The goal of this part of the lab is to practice working with file i/o, reading from a URL, and using
a dictionary in Python.
Output files: your program should produce an output file corresponding to each of the input files.
Please upload these files to dropbox along with your source code (as plain text files, please do
not archive them). Each output file should contain the top 25 terms (and their counts) found in
the corresponding input file's URLs. Sample output file:
Assignment Description: We will attempt to answer the following question: Are Internet
Programming practices significantly different in different countries? As our input data, we will
consider the web pages of the top five universities of several countries: US, Brazil, France,
Germany, India, Russia, and South Africa. I have already gathered the relevant URLs in the
input files (available top5_unis.zip). Please note that all of the source code is of interest for the
purposes of answering this question (so you do not want to drop or parse out various html tags,
script lines, etc. -- this is all relevant data).
Look at the output files your code produces. What do you think? Are there differences in the way
web pages are made in different countries? You do not need to submit anything to answer these
questions, but we will likely discuss the results in class.
URL Addresses:
Brazil
http://www5.usp.br/english/?lang=en
http://www.unicamp.br/unicamp/
http://www.unesp.br/international/
http://www.puc-rio.br/english/
http://www.ufrgs.br/english/home
China
http://www.tsinghua.edu.cn/publish/newthuen/
http://english.pku.edu.cn/
http://www.fudan.edu.cn/en/
http://en.ustc.edu.cn/
http://en.sjtu.edu.cn/
France
http://www.ens.fr/?lang=en
https://www.polytechnique.edu/en
http://www.upmc.fr/en/
http://www.u-psud.fr/en/index.html
3. your hypothesis in your report. Example categories/items/questions:
1) Automotive Brands; Toyota vs. Honda vs. Ford vs. Chevy; Do Wikipedia articles use
significantly different terms when describing these brands? Are brands associated with certain
countries described differently?
2) College football teams; similar questions as in (1)
3) Universities; similar questions as in (1)
4) Historical eras or significant events; Classical/bronze age history topics vs. Medieval vs.
Modern; Does the terminology historians use change significantly (not the content being
described -- obviously that will be different, but the historians' language itself)?
Detailed information about the API can be found here:
https://www.mediawiki.org/w/api.php?action=help&modules=query
https://www.mediawiki.org/wiki/Extension:TextExtracts
Starter code to help you get started using the Mediawiki API:
___________________________________________________
import requests
response = requests.get(
'https://en.wikipedia.org/w/api.php',
params={
'action': 'query',
'format': 'json',
'titles': 'Moscow_State_University',
'prop': 'extracts',
'exintro': True,
'explaintext': True,
}
).json()
page = next(iter(response['query']['pages'].values()))
print(page['extract'])
__________________________________________________
Action, format, and title are standard API parameters.
prop: extracts -- uses TextExtract extension
exintro: True -- Return only content before the first section
explaintext: Return extracts as plain text instead of HTML
(see "detailed information" section's link for more info)
You may choose to work with extracts or full articles -- this is up to you.
Note: You may use one of the many "third-party" Python Wikipedia parsers available online if
4. you choose. Please cite it properly if you do. I'm not 100% sure about this, but I think it may
actually make the lab more difficult though... We could say this: "If you'd like to make Part II
of the lab more challenging, learn how to use a third-party parser to extract text from Wikipedia
articles".
______________________________________________________
Part I Hints
1) Use functions/modularity (def somefunction(): ... ) to keep your code organized. Start by
creating a function that takes a string, breaks it up into terms, and stores key-value (term-count)
pairs in a dictionary. See hint #5 for a note on how to split the input strings best for this
particular problem.
2) Read the urls from each input file line-by-line, don't read in any ' ' characters.
3) For each link read in from input file, use a try-except block when reading:
try:
remote = urllib.urlopen(link)
... (more code that does stuff)
...
except IOError:
print "failed to open: ", link, " successfully :("
Note: this is necessary because we can't guarantee that reading from each URL will be
successful. If it fails, we need to know. There could be all kinds of reasons, and the way we
handle it depends on why we think that operation failed.
4) Read the entire content from a URL as a single string
5) Split the string on spaces, but prior to doing so, replace certain characters with spaces. You
can do that by either using reg ex (re module in Python), or just the string replace function:
line = line.replace('"', " ")
line = line.replace("'", " ")
line = line.replace('<', " ")
line = line.replace('>', " ")
line = line.replace('=', " ")
line = line.replace('/', " ")
line = line.replace("", " ")
(and so on)
6) Avoid blanks/spaces. Use str = str.strip(). Also, if an element is blank (empty string), skip it:
if elem == "":
continue
7) Use a dictionary. Terms should be the keys, counts - the values.
5. 8) When done, sort the dictionary by values:
for elem in sorted(data, key=data.get, reverse=True):
....
9) You can add a counter to the loop in (6) to print out only the top 25 terms. The printing to file
code should also go in that loop. Don't forget to close the file after you're done writing to it.
10) Use the join function to get your data in the right format:
f.write(' '.join((elem, str(data[elem]),' '))) top25 Notepad File Edit Format View Help a 1599
Class 1514 div 1181 li 1091 href 848 en 640 WWW 448 fr 413 http 410 nav-list 367 html 298
287 title 284 script 268 link 264 span 234 ens-lyon 228 type 226 text 221 Src 218 ul 198 CSS
196 img 184 id 167 Content 161
Solution
import urllib
with open('urlfile.txt') as urlf:
uf=urlf.readlines()
for i in range(len(uf)):
link = uf[i]
f = urllib.urlopen(link)
myfile = f.read()
fline=myfile.split(' ')
di={}
for j in range(len(fline)):
line = fline[j]
line = line.replace('"', " ")
line = line.replace("'", " ")
line = line.replace('<', " ")
line = line.replace('>', " ")
line = line.replace('=', " ")
line = line.replace('/', " ")
line = line.replace("", " ")
ffline=line.split(' ')
for k in range(len(ffline)):
di[ffline[k]]-=1
sx = sorted(di.items(), key=operator.itemgetter(1))
rr=0
6. for key, value in di:
if(rr==25): break
print key,value
rr+=1