SlideShare a Scribd company logo
1 of 6
Download to read offline
Part I: File I/O, URL I/O, Dictionaries and other Data Structures in Python
The goal of this part of the lab is to practice working with file i/o, reading from a URL, and using
a dictionary in Python.
Output files: your program should produce an output file corresponding to each of the input files.
Please upload these files to dropbox along with your source code (as plain text files, please do
not archive them). Each output file should contain the top 25 terms (and their counts) found in
the corresponding input file's URLs. Sample output file:
Assignment Description: We will attempt to answer the following question: Are Internet
Programming practices significantly different in different countries? As our input data, we will
consider the web pages of the top five universities of several countries: US, Brazil, France,
Germany, India, Russia, and South Africa. I have already gathered the relevant URLs in the
input files (available top5_unis.zip). Please note that all of the source code is of interest for the
purposes of answering this question (so you do not want to drop or parse out various html tags,
script lines, etc. -- this is all relevant data).
Look at the output files your code produces. What do you think? Are there differences in the way
web pages are made in different countries? You do not need to submit anything to answer these
questions, but we will likely discuss the results in class.
URL Addresses:
Brazil
http://www5.usp.br/english/?lang=en
http://www.unicamp.br/unicamp/
http://www.unesp.br/international/
http://www.puc-rio.br/english/
http://www.ufrgs.br/english/home
China
http://www.tsinghua.edu.cn/publish/newthuen/
http://english.pku.edu.cn/
http://www.fudan.edu.cn/en/
http://en.ustc.edu.cn/
http://en.sjtu.edu.cn/
France
http://www.ens.fr/?lang=en
https://www.polytechnique.edu/en
http://www.upmc.fr/en/
http://www.u-psud.fr/en/index.html
http://www.ens-lyon.fr/en/english-ens-de-lyon-269761.kjsp
Germany
http://www.rwth-aachen.de/cms/~a/root/?lidx=1
https://www.heidelberg.edu/
http://www.uni-freiburg.de/universitaet-en
http://www.portal.uni-koeln.de/uoc_home.html?&L=1
http://www.fu-berlin.de/en/
India
http://www.iisc.ac.in/
http://www.iitb.ac.in/
http://www.iitd.ac.in/
https://www.iitm.ac.in/
http://iitk.ac.in/
Russia
http://www.msu.ru/en/
http://english.spbu.ru/
http://www.nsu.ru/?lang=en
http://www.bmstu.ru/en/
http://en.tsu.ru/
South Africa
https://www.uct.ac.za/
https://www.wits.ac.za/
http://www.sun.ac.za/english
http://www.up.ac.za/
https://www.uj.ac.za/
U.S.A.
http://www.caltech.edu
http://www.stanford.edu
http://www.harvard.edu
http://www.mit.edu
http://www.princeton.edu
Part II: Reading from a URLwhile working with an API (using Mediawiki API as an example)
Input: Will be obtained from a URL using Mediawiki API -- starter code below
Output: Up to you... sort of.
Assignment Description: Compare how Wikipedia articles describe various items in the same
category. The choice of items and category is up to you. Briefly describe the category, items, and
your hypothesis in your report. Example categories/items/questions:
1) Automotive Brands; Toyota vs. Honda vs. Ford vs. Chevy; Do Wikipedia articles use
significantly different terms when describing these brands? Are brands associated with certain
countries described differently?
2) College football teams; similar questions as in (1)
3) Universities; similar questions as in (1)
4) Historical eras or significant events; Classical/bronze age history topics vs. Medieval vs.
Modern; Does the terminology historians use change significantly (not the content being
described -- obviously that will be different, but the historians' language itself)?
Detailed information about the API can be found here:
https://www.mediawiki.org/w/api.php?action=help&modules=query
https://www.mediawiki.org/wiki/Extension:TextExtracts
Starter code to help you get started using the Mediawiki API:
___________________________________________________
import requests
response = requests.get(
'https://en.wikipedia.org/w/api.php',
params={
'action': 'query',
'format': 'json',
'titles': 'Moscow_State_University',
'prop': 'extracts',
'exintro': True,
'explaintext': True,
}
).json()
page = next(iter(response['query']['pages'].values()))
print(page['extract'])
__________________________________________________
Action, format, and title are standard API parameters.
prop: extracts -- uses TextExtract extension
exintro: True -- Return only content before the first section
explaintext: Return extracts as plain text instead of HTML
(see "detailed information" section's link for more info)
You may choose to work with extracts or full articles -- this is up to you.
Note: You may use one of the many "third-party" Python Wikipedia parsers available online if
you choose. Please cite it properly if you do. I'm not 100% sure about this, but I think it may
actually make the lab more difficult though... We could say this: "If you'd like to make Part II
of the lab more challenging, learn how to use a third-party parser to extract text from Wikipedia
articles".
______________________________________________________
Part I Hints
1) Use functions/modularity (def somefunction(): ... ) to keep your code organized. Start by
creating a function that takes a string, breaks it up into terms, and stores key-value (term-count)
pairs in a dictionary. See hint #5 for a note on how to split the input strings best for this
particular problem.
2) Read the urls from each input file line-by-line, don't read in any ' ' characters.
3) For each link read in from input file, use a try-except block when reading:
try:
remote = urllib.urlopen(link)
... (more code that does stuff)
...
except IOError:
print "failed to open: ", link, " successfully :("
Note: this is necessary because we can't guarantee that reading from each URL will be
successful. If it fails, we need to know. There could be all kinds of reasons, and the way we
handle it depends on why we think that operation failed.
4) Read the entire content from a URL as a single string
5) Split the string on spaces, but prior to doing so, replace certain characters with spaces. You
can do that by either using reg ex (re module in Python), or just the string replace function:
line = line.replace('"', " ")
line = line.replace("'", " ")
line = line.replace('<', " ")
line = line.replace('>', " ")
line = line.replace('=', " ")
line = line.replace('/', " ")
line = line.replace("", " ")
(and so on)
6) Avoid blanks/spaces. Use str = str.strip(). Also, if an element is blank (empty string), skip it:
if elem == "":
continue
7) Use a dictionary. Terms should be the keys, counts - the values.
8) When done, sort the dictionary by values:
for elem in sorted(data, key=data.get, reverse=True):
....
9) You can add a counter to the loop in (6) to print out only the top 25 terms. The printing to file
code should also go in that loop. Don't forget to close the file after you're done writing to it.
10) Use the join function to get your data in the right format:
f.write(' '.join((elem, str(data[elem]),' '))) top25 Notepad File Edit Format View Help a 1599
Class 1514 div 1181 li 1091 href 848 en 640 WWW 448 fr 413 http 410 nav-list 367 html 298
287 title 284 script 268 link 264 span 234 ens-lyon 228 type 226 text 221 Src 218 ul 198 CSS
196 img 184 id 167 Content 161
Solution
import urllib
with open('urlfile.txt') as urlf:
uf=urlf.readlines()
for i in range(len(uf)):
link = uf[i]
f = urllib.urlopen(link)
myfile = f.read()
fline=myfile.split(' ')
di={}
for j in range(len(fline)):
line = fline[j]
line = line.replace('"', " ")
line = line.replace("'", " ")
line = line.replace('<', " ")
line = line.replace('>', " ")
line = line.replace('=', " ")
line = line.replace('/', " ")
line = line.replace("", " ")
ffline=line.split(' ')
for k in range(len(ffline)):
di[ffline[k]]-=1
sx = sorted(di.items(), key=operator.itemgetter(1))
rr=0
for key, value in di:
if(rr==25): break
print key,value
rr+=1

More Related Content

Similar to Part I File IO, URL IO, Dictionaries and other Data Structures in.pdf

srt311 Project2
srt311 Project2srt311 Project2
srt311 Project2trayyoo
 
The program reads data from two files, itemsList-0x.txt and .docx
The program reads data from two files, itemsList-0x.txt and .docxThe program reads data from two files, itemsList-0x.txt and .docx
The program reads data from two files, itemsList-0x.txt and .docxoscars29
 
C basic questions&amp;ansrs by shiva kumar kella
C basic questions&amp;ansrs by shiva kumar kellaC basic questions&amp;ansrs by shiva kumar kella
C basic questions&amp;ansrs by shiva kumar kellaManoj Kumar kothagulla
 
INFO-6053 Fall 2017 Project 3 Page 1 of 6 .docx
INFO-6053 Fall 2017 Project 3 Page 1 of 6 .docxINFO-6053 Fall 2017 Project 3 Page 1 of 6 .docx
INFO-6053 Fall 2017 Project 3 Page 1 of 6 .docxjaggernaoma
 
The Lab assignment will be graded out of 100 points.  There are .docx
The Lab assignment will be graded out of 100 points.  There are .docxThe Lab assignment will be graded out of 100 points.  There are .docx
The Lab assignment will be graded out of 100 points.  There are .docxjmindy
 
pythontraining-201jn026043638.pptx
pythontraining-201jn026043638.pptxpythontraining-201jn026043638.pptx
pythontraining-201jn026043638.pptxRohitKumar639388
 
C++ - UNIT_-_V.pptx which contains details about File Concepts
C++  - UNIT_-_V.pptx which contains details about File ConceptsC++  - UNIT_-_V.pptx which contains details about File Concepts
C++ - UNIT_-_V.pptx which contains details about File ConceptsANUSUYA S
 
C++ Programming Class Creation Program Assignment Instructions
C++ Programming Class Creation Program Assignment InstructionsC++ Programming Class Creation Program Assignment Instructions
C++ Programming Class Creation Program Assignment InstructionsTawnaDelatorrejs
 
Program 1 – CS 344This assignment asks you to write a bash.docx
Program 1 – CS 344This assignment asks you to write a bash.docxProgram 1 – CS 344This assignment asks you to write a bash.docx
Program 1 – CS 344This assignment asks you to write a bash.docxwkyra78
 
Interoduction to c++
Interoduction to c++Interoduction to c++
Interoduction to c++Amresh Raj
 
Rupicon 2014 Single table inheritance
Rupicon 2014 Single table inheritanceRupicon 2014 Single table inheritance
Rupicon 2014 Single table inheritancerupicon
 
SessionTen_CaseStudies
SessionTen_CaseStudiesSessionTen_CaseStudies
SessionTen_CaseStudiesHellen Gakuruh
 
Unit 1 - TypeScript & Introduction to Angular CLI.pptx
Unit 1 - TypeScript & Introduction to Angular CLI.pptxUnit 1 - TypeScript & Introduction to Angular CLI.pptx
Unit 1 - TypeScript & Introduction to Angular CLI.pptxMalla Reddy University
 
An Introduction To C++Templates
An Introduction To C++TemplatesAn Introduction To C++Templates
An Introduction To C++TemplatesGanesh Samarthyam
 
Getting Started in Custom Programming for Talent Sourcing
Getting Started in Custom Programming for Talent SourcingGetting Started in Custom Programming for Talent Sourcing
Getting Started in Custom Programming for Talent SourcingGlenn Gutmacher
 
E learning excel vba programming lesson 3
E learning excel vba programming  lesson 3E learning excel vba programming  lesson 3
E learning excel vba programming lesson 3Vijay Perepa
 

Similar to Part I File IO, URL IO, Dictionaries and other Data Structures in.pdf (20)

Lab 1 Essay
Lab 1 EssayLab 1 Essay
Lab 1 Essay
 
srt311 Project2
srt311 Project2srt311 Project2
srt311 Project2
 
The program reads data from two files, itemsList-0x.txt and .docx
The program reads data from two files, itemsList-0x.txt and .docxThe program reads data from two files, itemsList-0x.txt and .docx
The program reads data from two files, itemsList-0x.txt and .docx
 
C basic questions&amp;ansrs by shiva kumar kella
C basic questions&amp;ansrs by shiva kumar kellaC basic questions&amp;ansrs by shiva kumar kella
C basic questions&amp;ansrs by shiva kumar kella
 
INFO-6053 Fall 2017 Project 3 Page 1 of 6 .docx
INFO-6053 Fall 2017 Project 3 Page 1 of 6 .docxINFO-6053 Fall 2017 Project 3 Page 1 of 6 .docx
INFO-6053 Fall 2017 Project 3 Page 1 of 6 .docx
 
The Lab assignment will be graded out of 100 points.  There are .docx
The Lab assignment will be graded out of 100 points.  There are .docxThe Lab assignment will be graded out of 100 points.  There are .docx
The Lab assignment will be graded out of 100 points.  There are .docx
 
Python training
Python trainingPython training
Python training
 
pythontraining-201jn026043638.pptx
pythontraining-201jn026043638.pptxpythontraining-201jn026043638.pptx
pythontraining-201jn026043638.pptx
 
C++ - UNIT_-_V.pptx which contains details about File Concepts
C++  - UNIT_-_V.pptx which contains details about File ConceptsC++  - UNIT_-_V.pptx which contains details about File Concepts
C++ - UNIT_-_V.pptx which contains details about File Concepts
 
C++ Programming Class Creation Program Assignment Instructions
C++ Programming Class Creation Program Assignment InstructionsC++ Programming Class Creation Program Assignment Instructions
C++ Programming Class Creation Program Assignment Instructions
 
Program 1 – CS 344This assignment asks you to write a bash.docx
Program 1 – CS 344This assignment asks you to write a bash.docxProgram 1 – CS 344This assignment asks you to write a bash.docx
Program 1 – CS 344This assignment asks you to write a bash.docx
 
Interoduction to c++
Interoduction to c++Interoduction to c++
Interoduction to c++
 
Rupicon 2014 Single table inheritance
Rupicon 2014 Single table inheritanceRupicon 2014 Single table inheritance
Rupicon 2014 Single table inheritance
 
SessionTen_CaseStudies
SessionTen_CaseStudiesSessionTen_CaseStudies
SessionTen_CaseStudies
 
Unit 1 - TypeScript & Introduction to Angular CLI.pptx
Unit 1 - TypeScript & Introduction to Angular CLI.pptxUnit 1 - TypeScript & Introduction to Angular CLI.pptx
Unit 1 - TypeScript & Introduction to Angular CLI.pptx
 
An Introduction To C++Templates
An Introduction To C++TemplatesAn Introduction To C++Templates
An Introduction To C++Templates
 
Ad507
Ad507Ad507
Ad507
 
Getting Started in Custom Programming for Talent Sourcing
Getting Started in Custom Programming for Talent SourcingGetting Started in Custom Programming for Talent Sourcing
Getting Started in Custom Programming for Talent Sourcing
 
Bb Tequila Coding Style (Draft)
Bb Tequila Coding Style (Draft)Bb Tequila Coding Style (Draft)
Bb Tequila Coding Style (Draft)
 
E learning excel vba programming lesson 3
E learning excel vba programming  lesson 3E learning excel vba programming  lesson 3
E learning excel vba programming lesson 3
 

More from pristiegee

Question No. 2 Describe how the Packet-Filtering Router filter data..pdf
Question No. 2 Describe how the Packet-Filtering Router filter data..pdfQuestion No. 2 Describe how the Packet-Filtering Router filter data..pdf
Question No. 2 Describe how the Packet-Filtering Router filter data..pdfpristiegee
 
Indicicate the coordinaiton number of the metal and oxidation number.pdf
Indicicate the coordinaiton number of the metal and oxidation number.pdfIndicicate the coordinaiton number of the metal and oxidation number.pdf
Indicicate the coordinaiton number of the metal and oxidation number.pdfpristiegee
 
QUESTION 12 12. An example of an error of presentation would be e.pdf
QUESTION 12 12. An example of an error of presentation would be e.pdfQUESTION 12 12. An example of an error of presentation would be e.pdf
QUESTION 12 12. An example of an error of presentation would be e.pdfpristiegee
 
Please need help on C++ language.Infix to Postfix) Write a program.pdf
Please need help on C++ language.Infix to Postfix) Write a program.pdfPlease need help on C++ language.Infix to Postfix) Write a program.pdf
Please need help on C++ language.Infix to Postfix) Write a program.pdfpristiegee
 
On a Metabolic Pathways experiment, Describe your observations below.pdf
On a Metabolic Pathways experiment, Describe your observations below.pdfOn a Metabolic Pathways experiment, Describe your observations below.pdf
On a Metabolic Pathways experiment, Describe your observations below.pdfpristiegee
 
Name at least three major contributions of Islamic mathematicians..pdf
Name at least three major contributions of Islamic mathematicians..pdfName at least three major contributions of Islamic mathematicians..pdf
Name at least three major contributions of Islamic mathematicians..pdfpristiegee
 
Module 02 Discussion - Domestic ContainmentDiscuss the concept of .pdf
Module 02 Discussion - Domestic ContainmentDiscuss the concept of .pdfModule 02 Discussion - Domestic ContainmentDiscuss the concept of .pdf
Module 02 Discussion - Domestic ContainmentDiscuss the concept of .pdfpristiegee
 
How to do problem 15.16 Find a function f R rightarrow R^+ that i.pdf
How to do problem 15.16  Find a function f R rightarrow R^+ that i.pdfHow to do problem 15.16  Find a function f R rightarrow R^+ that i.pdf
How to do problem 15.16 Find a function f R rightarrow R^+ that i.pdfpristiegee
 
Fill in the table below for the molecules shown. HCI Ar (atom) Ethane.pdf
Fill in the table below for the molecules shown. HCI Ar (atom) Ethane.pdfFill in the table below for the molecules shown. HCI Ar (atom) Ethane.pdf
Fill in the table below for the molecules shown. HCI Ar (atom) Ethane.pdfpristiegee
 
How financial reporting for public companies has changed since the E.pdf
How financial reporting for public companies has changed since the E.pdfHow financial reporting for public companies has changed since the E.pdf
How financial reporting for public companies has changed since the E.pdfpristiegee
 
Hello everyone,Im working on my fast food order project program..pdf
Hello everyone,Im working on my fast food order project program..pdfHello everyone,Im working on my fast food order project program..pdf
Hello everyone,Im working on my fast food order project program..pdfpristiegee
 
Give the examples of network core devices Give the examples of physic.pdf
Give the examples of network core devices Give the examples of physic.pdfGive the examples of network core devices Give the examples of physic.pdf
Give the examples of network core devices Give the examples of physic.pdfpristiegee
 
For a binary search tree that has a Node with three elements, data, a.pdf
For a binary search tree that has a Node with three elements, data, a.pdfFor a binary search tree that has a Node with three elements, data, a.pdf
For a binary search tree that has a Node with three elements, data, a.pdfpristiegee
 
Find the coordinates of the midpoint of the segment connecting points.pdf
Find the coordinates of the midpoint of the segment connecting points.pdfFind the coordinates of the midpoint of the segment connecting points.pdf
Find the coordinates of the midpoint of the segment connecting points.pdfpristiegee
 
Each student is to prepare a 3-5 page paper on a project on ONE of t.pdf
Each student is to prepare a 3-5 page paper on a project on ONE of t.pdfEach student is to prepare a 3-5 page paper on a project on ONE of t.pdf
Each student is to prepare a 3-5 page paper on a project on ONE of t.pdfpristiegee
 
describe two processes by which evolution can occur. explain why.pdf
describe two processes by which evolution can occur. explain why.pdfdescribe two processes by which evolution can occur. explain why.pdf
describe two processes by which evolution can occur. explain why.pdfpristiegee
 
Develop a structure chart for student asking diploma in university a.pdf
Develop a structure chart for student asking diploma in university a.pdfDevelop a structure chart for student asking diploma in university a.pdf
Develop a structure chart for student asking diploma in university a.pdfpristiegee
 
Consider the following model of a very simple economy. Household savi.pdf
Consider the following model of a very simple economy. Household savi.pdfConsider the following model of a very simple economy. Household savi.pdf
Consider the following model of a very simple economy. Household savi.pdfpristiegee
 
Assembly ProgramngCan the upper 16 bits of the four 32 bit general.pdf
Assembly ProgramngCan the upper 16 bits of the four 32 bit general.pdfAssembly ProgramngCan the upper 16 bits of the four 32 bit general.pdf
Assembly ProgramngCan the upper 16 bits of the four 32 bit general.pdfpristiegee
 
A) How does the ability of the ligand to cross the plasma membrane d.pdf
A) How does the ability of the ligand to cross the plasma membrane d.pdfA) How does the ability of the ligand to cross the plasma membrane d.pdf
A) How does the ability of the ligand to cross the plasma membrane d.pdfpristiegee
 

More from pristiegee (20)

Question No. 2 Describe how the Packet-Filtering Router filter data..pdf
Question No. 2 Describe how the Packet-Filtering Router filter data..pdfQuestion No. 2 Describe how the Packet-Filtering Router filter data..pdf
Question No. 2 Describe how the Packet-Filtering Router filter data..pdf
 
Indicicate the coordinaiton number of the metal and oxidation number.pdf
Indicicate the coordinaiton number of the metal and oxidation number.pdfIndicicate the coordinaiton number of the metal and oxidation number.pdf
Indicicate the coordinaiton number of the metal and oxidation number.pdf
 
QUESTION 12 12. An example of an error of presentation would be e.pdf
QUESTION 12 12. An example of an error of presentation would be e.pdfQUESTION 12 12. An example of an error of presentation would be e.pdf
QUESTION 12 12. An example of an error of presentation would be e.pdf
 
Please need help on C++ language.Infix to Postfix) Write a program.pdf
Please need help on C++ language.Infix to Postfix) Write a program.pdfPlease need help on C++ language.Infix to Postfix) Write a program.pdf
Please need help on C++ language.Infix to Postfix) Write a program.pdf
 
On a Metabolic Pathways experiment, Describe your observations below.pdf
On a Metabolic Pathways experiment, Describe your observations below.pdfOn a Metabolic Pathways experiment, Describe your observations below.pdf
On a Metabolic Pathways experiment, Describe your observations below.pdf
 
Name at least three major contributions of Islamic mathematicians..pdf
Name at least three major contributions of Islamic mathematicians..pdfName at least three major contributions of Islamic mathematicians..pdf
Name at least three major contributions of Islamic mathematicians..pdf
 
Module 02 Discussion - Domestic ContainmentDiscuss the concept of .pdf
Module 02 Discussion - Domestic ContainmentDiscuss the concept of .pdfModule 02 Discussion - Domestic ContainmentDiscuss the concept of .pdf
Module 02 Discussion - Domestic ContainmentDiscuss the concept of .pdf
 
How to do problem 15.16 Find a function f R rightarrow R^+ that i.pdf
How to do problem 15.16  Find a function f R rightarrow R^+ that i.pdfHow to do problem 15.16  Find a function f R rightarrow R^+ that i.pdf
How to do problem 15.16 Find a function f R rightarrow R^+ that i.pdf
 
Fill in the table below for the molecules shown. HCI Ar (atom) Ethane.pdf
Fill in the table below for the molecules shown. HCI Ar (atom) Ethane.pdfFill in the table below for the molecules shown. HCI Ar (atom) Ethane.pdf
Fill in the table below for the molecules shown. HCI Ar (atom) Ethane.pdf
 
How financial reporting for public companies has changed since the E.pdf
How financial reporting for public companies has changed since the E.pdfHow financial reporting for public companies has changed since the E.pdf
How financial reporting for public companies has changed since the E.pdf
 
Hello everyone,Im working on my fast food order project program..pdf
Hello everyone,Im working on my fast food order project program..pdfHello everyone,Im working on my fast food order project program..pdf
Hello everyone,Im working on my fast food order project program..pdf
 
Give the examples of network core devices Give the examples of physic.pdf
Give the examples of network core devices Give the examples of physic.pdfGive the examples of network core devices Give the examples of physic.pdf
Give the examples of network core devices Give the examples of physic.pdf
 
For a binary search tree that has a Node with three elements, data, a.pdf
For a binary search tree that has a Node with three elements, data, a.pdfFor a binary search tree that has a Node with three elements, data, a.pdf
For a binary search tree that has a Node with three elements, data, a.pdf
 
Find the coordinates of the midpoint of the segment connecting points.pdf
Find the coordinates of the midpoint of the segment connecting points.pdfFind the coordinates of the midpoint of the segment connecting points.pdf
Find the coordinates of the midpoint of the segment connecting points.pdf
 
Each student is to prepare a 3-5 page paper on a project on ONE of t.pdf
Each student is to prepare a 3-5 page paper on a project on ONE of t.pdfEach student is to prepare a 3-5 page paper on a project on ONE of t.pdf
Each student is to prepare a 3-5 page paper on a project on ONE of t.pdf
 
describe two processes by which evolution can occur. explain why.pdf
describe two processes by which evolution can occur. explain why.pdfdescribe two processes by which evolution can occur. explain why.pdf
describe two processes by which evolution can occur. explain why.pdf
 
Develop a structure chart for student asking diploma in university a.pdf
Develop a structure chart for student asking diploma in university a.pdfDevelop a structure chart for student asking diploma in university a.pdf
Develop a structure chart for student asking diploma in university a.pdf
 
Consider the following model of a very simple economy. Household savi.pdf
Consider the following model of a very simple economy. Household savi.pdfConsider the following model of a very simple economy. Household savi.pdf
Consider the following model of a very simple economy. Household savi.pdf
 
Assembly ProgramngCan the upper 16 bits of the four 32 bit general.pdf
Assembly ProgramngCan the upper 16 bits of the four 32 bit general.pdfAssembly ProgramngCan the upper 16 bits of the four 32 bit general.pdf
Assembly ProgramngCan the upper 16 bits of the four 32 bit general.pdf
 
A) How does the ability of the ligand to cross the plasma membrane d.pdf
A) How does the ability of the ligand to cross the plasma membrane d.pdfA) How does the ability of the ligand to cross the plasma membrane d.pdf
A) How does the ability of the ligand to cross the plasma membrane d.pdf
 

Recently uploaded

“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaVirag Sontakke
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
MICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxMICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxabhijeetpadhi001
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 

Recently uploaded (20)

“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of India
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
MICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxMICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptx
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 

Part I File IO, URL IO, Dictionaries and other Data Structures in.pdf

  • 1. Part I: File I/O, URL I/O, Dictionaries and other Data Structures in Python The goal of this part of the lab is to practice working with file i/o, reading from a URL, and using a dictionary in Python. Output files: your program should produce an output file corresponding to each of the input files. Please upload these files to dropbox along with your source code (as plain text files, please do not archive them). Each output file should contain the top 25 terms (and their counts) found in the corresponding input file's URLs. Sample output file: Assignment Description: We will attempt to answer the following question: Are Internet Programming practices significantly different in different countries? As our input data, we will consider the web pages of the top five universities of several countries: US, Brazil, France, Germany, India, Russia, and South Africa. I have already gathered the relevant URLs in the input files (available top5_unis.zip). Please note that all of the source code is of interest for the purposes of answering this question (so you do not want to drop or parse out various html tags, script lines, etc. -- this is all relevant data). Look at the output files your code produces. What do you think? Are there differences in the way web pages are made in different countries? You do not need to submit anything to answer these questions, but we will likely discuss the results in class. URL Addresses: Brazil http://www5.usp.br/english/?lang=en http://www.unicamp.br/unicamp/ http://www.unesp.br/international/ http://www.puc-rio.br/english/ http://www.ufrgs.br/english/home China http://www.tsinghua.edu.cn/publish/newthuen/ http://english.pku.edu.cn/ http://www.fudan.edu.cn/en/ http://en.ustc.edu.cn/ http://en.sjtu.edu.cn/ France http://www.ens.fr/?lang=en https://www.polytechnique.edu/en http://www.upmc.fr/en/ http://www.u-psud.fr/en/index.html
  • 2. http://www.ens-lyon.fr/en/english-ens-de-lyon-269761.kjsp Germany http://www.rwth-aachen.de/cms/~a/root/?lidx=1 https://www.heidelberg.edu/ http://www.uni-freiburg.de/universitaet-en http://www.portal.uni-koeln.de/uoc_home.html?&L=1 http://www.fu-berlin.de/en/ India http://www.iisc.ac.in/ http://www.iitb.ac.in/ http://www.iitd.ac.in/ https://www.iitm.ac.in/ http://iitk.ac.in/ Russia http://www.msu.ru/en/ http://english.spbu.ru/ http://www.nsu.ru/?lang=en http://www.bmstu.ru/en/ http://en.tsu.ru/ South Africa https://www.uct.ac.za/ https://www.wits.ac.za/ http://www.sun.ac.za/english http://www.up.ac.za/ https://www.uj.ac.za/ U.S.A. http://www.caltech.edu http://www.stanford.edu http://www.harvard.edu http://www.mit.edu http://www.princeton.edu Part II: Reading from a URLwhile working with an API (using Mediawiki API as an example) Input: Will be obtained from a URL using Mediawiki API -- starter code below Output: Up to you... sort of. Assignment Description: Compare how Wikipedia articles describe various items in the same category. The choice of items and category is up to you. Briefly describe the category, items, and
  • 3. your hypothesis in your report. Example categories/items/questions: 1) Automotive Brands; Toyota vs. Honda vs. Ford vs. Chevy; Do Wikipedia articles use significantly different terms when describing these brands? Are brands associated with certain countries described differently? 2) College football teams; similar questions as in (1) 3) Universities; similar questions as in (1) 4) Historical eras or significant events; Classical/bronze age history topics vs. Medieval vs. Modern; Does the terminology historians use change significantly (not the content being described -- obviously that will be different, but the historians' language itself)? Detailed information about the API can be found here: https://www.mediawiki.org/w/api.php?action=help&modules=query https://www.mediawiki.org/wiki/Extension:TextExtracts Starter code to help you get started using the Mediawiki API: ___________________________________________________ import requests response = requests.get( 'https://en.wikipedia.org/w/api.php', params={ 'action': 'query', 'format': 'json', 'titles': 'Moscow_State_University', 'prop': 'extracts', 'exintro': True, 'explaintext': True, } ).json() page = next(iter(response['query']['pages'].values())) print(page['extract']) __________________________________________________ Action, format, and title are standard API parameters. prop: extracts -- uses TextExtract extension exintro: True -- Return only content before the first section explaintext: Return extracts as plain text instead of HTML (see "detailed information" section's link for more info) You may choose to work with extracts or full articles -- this is up to you. Note: You may use one of the many "third-party" Python Wikipedia parsers available online if
  • 4. you choose. Please cite it properly if you do. I'm not 100% sure about this, but I think it may actually make the lab more difficult though... We could say this: "If you'd like to make Part II of the lab more challenging, learn how to use a third-party parser to extract text from Wikipedia articles". ______________________________________________________ Part I Hints 1) Use functions/modularity (def somefunction(): ... ) to keep your code organized. Start by creating a function that takes a string, breaks it up into terms, and stores key-value (term-count) pairs in a dictionary. See hint #5 for a note on how to split the input strings best for this particular problem. 2) Read the urls from each input file line-by-line, don't read in any ' ' characters. 3) For each link read in from input file, use a try-except block when reading: try: remote = urllib.urlopen(link) ... (more code that does stuff) ... except IOError: print "failed to open: ", link, " successfully :(" Note: this is necessary because we can't guarantee that reading from each URL will be successful. If it fails, we need to know. There could be all kinds of reasons, and the way we handle it depends on why we think that operation failed. 4) Read the entire content from a URL as a single string 5) Split the string on spaces, but prior to doing so, replace certain characters with spaces. You can do that by either using reg ex (re module in Python), or just the string replace function: line = line.replace('"', " ") line = line.replace("'", " ") line = line.replace('<', " ") line = line.replace('>', " ") line = line.replace('=', " ") line = line.replace('/', " ") line = line.replace("", " ") (and so on) 6) Avoid blanks/spaces. Use str = str.strip(). Also, if an element is blank (empty string), skip it: if elem == "": continue 7) Use a dictionary. Terms should be the keys, counts - the values.
  • 5. 8) When done, sort the dictionary by values: for elem in sorted(data, key=data.get, reverse=True): .... 9) You can add a counter to the loop in (6) to print out only the top 25 terms. The printing to file code should also go in that loop. Don't forget to close the file after you're done writing to it. 10) Use the join function to get your data in the right format: f.write(' '.join((elem, str(data[elem]),' '))) top25 Notepad File Edit Format View Help a 1599 Class 1514 div 1181 li 1091 href 848 en 640 WWW 448 fr 413 http 410 nav-list 367 html 298 287 title 284 script 268 link 264 span 234 ens-lyon 228 type 226 text 221 Src 218 ul 198 CSS 196 img 184 id 167 Content 161 Solution import urllib with open('urlfile.txt') as urlf: uf=urlf.readlines() for i in range(len(uf)): link = uf[i] f = urllib.urlopen(link) myfile = f.read() fline=myfile.split(' ') di={} for j in range(len(fline)): line = fline[j] line = line.replace('"', " ") line = line.replace("'", " ") line = line.replace('<', " ") line = line.replace('>', " ") line = line.replace('=', " ") line = line.replace('/', " ") line = line.replace("", " ") ffline=line.split(' ') for k in range(len(ffline)): di[ffline[k]]-=1 sx = sorted(di.items(), key=operator.itemgetter(1)) rr=0
  • 6. for key, value in di: if(rr==25): break print key,value rr+=1