7. Courses agenda – Day3
1. Introduction of web scraping
2. Setting up development environment
3. Retrieving HTML data
4. Parsing that data
5. Project1: Geocoordinates data from Wiki and display in GoogleMap
6. Storing the information
7. Advanced topics
8. Accessing APIs
9. Project2: Dress for the weather
14. Application Layer
Eg. HTTP, FTP, Email, telnet, …
Transport Layer
Eg. TCP, UDP
Network Layer
Eg. IP
Link Layer
Eg. Ethernet, WiFi
Physical Layer
Eg. Ethernet Cable, fiber-optics
Segments
Packets
Frames
Bits
Data
This class focus
The Internet at a Glance
16. HTTP – diff from HTML
HTML: hypertext markup language
• Definitions of tags that are added to Web documents to
control their appearance
HTTP: hypertext transfer protocol
• The rules governing the conversation between a Web
client and a Web server
21. HTTP – status codes
• 200 OK
• 201 created
• 202 accepted
• 204 no content
• 301 moved perm.
• 302 moved temp
• 304 not modified
• 400 bad request
• 401 unauthorized
• 403 forbidden
• 404 not found
• 500 int. server error
• 501 not impl.
• 502 bad gateway
• 503 svc not avail
32. SAMBA
• use the raspberry as a simple Network-Attached Storage (NAS) device, can
share file with Windows and Linux
$sudo apt-get install samba samba-common-bin
$sudo nano /etc/samba/smb.conf
workgroup = WORKGROUP
wins support = yes
$mkdir ~/share
$sudo nano /etc/samba/smb.conf
[PiShare]
comment=Raspberry Pi Share
path=/home/pi/share
browseable=Yes
writeable=Yes
only guest=no
create mask=0777
directory mask=0777
public=no #set yes for public access without password
$sudo smbpasswd -a pi
http://raspberrypihq.com/how-to-share-a-folder-with-a-windows-computer-from-a-raspberry-pi/
43. Exercise 14 – run web server
• Setup ipyhton notebook server where can be accessed in local
#target (PI)
$jupyter notebook --generate-config
$nano ~/.jupyter/jupyter_notebook_config.py
# Set ip to '*' to bind on all interfaces (ips) for the public server
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.port = 9999
http://jupyter-notebook.readthedocs.io/en/latest/public_server.html
44. Exercise 14 – run web server
#target (PI)
$ jupyter notebook
#local
http://172.20.10.8:9999/tree
open it by Google Chrome to Create a new notebooks named ex. web_ practice_1
45. Exercise 14 – run web server
from urllib.request import urlopen
html = urlopen("http://google.com/")
print(html.read())
• Accessing by urlopen() by running the code below in ipython
48. BeautifulSoup
from urllib.request import urlopen
from bs4 import BeautifulSoup as bs
html =
urllib.request.urlopen("http://localhost:5000/static/demo1.html")
bsObj = bs(html.read(),"html.parser")
print(bsObj.h1) #html → body → h1 , bsObj.body.h1 produces the same result
print(bsObj.div)
BS coverts byte string (html.read)
into “html” hierarchy – DOM
51. Common problem using urllib:
Connecting Reliably(2)
if htmlConn is None:
print("URL is not found")
else:
#program continues
pass
If the server is not found, urlopen will returns a None object. This object is analogous to
null in other programming languages.
52. html = urlopen("http://localhost:5000/static/demo1.html")
bsObj = bs(htmlConn.read(),"html.parser")
print(bsObj.fooTag)#noexistence return None
print(bsObj.fooTag.someTag) # accessing None trigger exception
AttributeError: 'NoneType' object has no attribute 'someTag'
Even if the page is retrieved successfully, there is still the issue of the content on the
page. If you attempt to access a tag that does not exist, bs returns None
Common problem using urllib:
Connecting Reliably(2)
53. try:#guard against these two situations
badContent = bsObj.foo.anotherTag
except AttributeError as e:
print("Tag was not found")
else:
if badContent == None:
print("Tag was not found")
else:
print(badContent)
to explicitly check for both failure situations:
Check the exception of accessing
None object
In case the tag doesn’t exist too
Common problem using urllib:
Connecting Reliably(2)
57. searching for tags by attributes
• Use findall() to extract a Python list of proper nouns
• the most popular method in bs API
• findAll(tagName, tagAttributes)
• Unlike bsObj.tagName only get the first occurrence of the tag
• get_text() separates the content from the tags and return “str”
htmlConn = urlopen("http://localhost:5000/static/demo2.html")
bsObj = bs(htmlConn,"html.parser")
name_list = bsObj.findAll("span",{"class":"green"})
for n in name_list:
print(type(n))
print(n.get_text())
Anna
Pavlovna Scherer
Empress Marya
Fedorovna
58. searching for tags by attributes
• The two functions you will likely use the most
• findAll(tag, attributes, recursive, text, limit, keywords)
• find(tag, attributes, recursive, text, keywords)
• Recursive default is set to True – look at children and children’s children
• Limit sets 1 == find()
• keyword argument allows you to select tags that contain a particular attribute.
%95 time only use tag and attribute
bsObj.findAll("span", {"class":"green", "class":"red"}) #return the both red and green
tag in the document
bsObj.findAll({'h1','h2','h3'}) #return a list of all the header tags in a document
bsObj.findAll(text="the prince") #find the number of times “the prince”
print(len(nameList)) #7
bsObj.findAll(id="text") #same as bsObj.findAll("", {"id":"text"})
keyword argument is actually a
technically redundant
60. BS objects
BeautifulSoup objects
1) bs4.element.ResultSet
Tag objects
NavigableString objects
The Comment object
1) Ex. <!--like this one-->
by calling find and findAll
Represent text within tag by calling string()
[<span class="green">Anna Pavlovna Scherer</span>,
<span class="green">Empress MaryaFedorovna</span>]
Each in resultset , drilling down
61. Accessing by the location of tag
Navigating Trees
1) Dealing with children and other descendants
2) Children are always exactly one tag below a parent, whereas
descendants can be at any level in the tree below a parent
Tr is children of table tag
tr, th, td, img, and span are all descendants of the table tag
All children are descendants, but not all descendants are children
bsObj.body.h1
bsObj.div.findAll("img")
BS func always deal
with the descendant
64. Accessing by the location of tag
for sibling in bsObj.find("table",{"id":"giftList"}).tr.next_siblings:
print(sibling)
print all rows of products from the
product table, except
for the first title row
Ignore title row
next_siblings() function makes it trivial to collect
data from tables, especially ones with title rows
66. BS + REGEX
images = bsObj.findAll("img",
{"src":re.compile(".*img.*.jpg")})
for image in images:
print(image["src"])
import re
html = urlopen("http://localhost:5000/static/demo3.html")
bsObj = bs(html,"html.parser")
images = bsObj.findAll("img“) May find extra image - hidden images,
blank images used for spacing and
aligning elements in modern website
76. Exercise
1. 前往 Google Developers Console。
2. Select a project, or create a new one.
3. Open the API Library in the Google Developers Console. If prompted, select
a project or create a new one. Select the Enabled APIs link in the API
section to see a list of all your enabled APIs. Make sure that the API is on
the list of enabled APIs. If you have not enabled it, select the API from the
list of APIs, then select the Enable API button for the API.
4. In the sidebar on the left, select Credentials.
5. 瀏覽器API 金鑰,請選取 [Add credentials] > [API key] > [Browser key] 來
建立。
a. Setup API 金鑰
78. Exercise
1. jupyter notebook #start server , can use ipython notebook
2. Access http://localhost:9999/ by any preferred browser
3. Type in setup codes and execute by “SHIFT + ENTER”
from urllib.request import urlopen
import urllib
import requests #conda/pip install requests
import sys
from bs4 import BeautifulSoup as bs
URL = " https://en.wikipedia.org/wiki/Taipei"
b. Ipython
Refer to Official_Project1_Day3.ipynb
79. req = requests.get(URL, headers={'User-Agent':"Mining the Wiki"})
soup = bs(req.text,"html.parser")
geoTag = soup(class_='geo') #This code finds all the tags in the document
geoTag = soup(True,'geo')
geotag
geoTag = soup.find(class_='geo-dms')
lat = geoTag.find(True, 'latitude').string
Lat
geoTag = soup.find(True, 'geo')
geoTag.string.split(';')
b. Ipython
4. Type in connecting and parsing codes
!"#$%&'()%##*+,-.+/012343567'89:29161;"<#$%&/='"#$%&'()%##*+,-.+/012343567'89:29161;"<#$%&/='"#$%&'()%##*+,-.+/012343567'8
9:29161;"<#$%&/>
'35°55′45″N'
['35.92917', ' -86.85750']
Exercise
80. def geolookup():
geoTag = soup.find(True, 'geo')
if geoTag and len(geoTag) > 1:
lat = geoTag.find(True,
'latitude').string
lon = geoTag.find(True,
'longitude').string
print('a.Location is at', lat, lon)
return lat,lon
elif geoTag and len(geoTag) == 1:
(lat, lon) = geoTag.string.split(';')
(lat, lon) = (lat.strip(), lon.strip())
print('b. Location is at', lat, lon)
return lat,lon
else:
print('No location found')
5. Mix together
Exercise
def geolookup_dms():
geoTag = soup.find(True, 'geo-dms')
if geoTag and len(geoTag) > 1:
lat = geoTag.find(True,
'latitude').string
lon = geoTag.find(True,
'longitude').string
print('Location is at', lat, lon)
return lat,lon
(lat, lon) = geolookup_dms()
81. from IPython.display import Iframe
from IPython.core.display import display
api = 'key=“yourKEY”‘
maptype = '&maptype=satellite‘
zoom = "&zoom=18“
google_maps__view_url = embedRawUrl + mode + api + "&q={0}+{1}".format(lat,
lon)
google_maps__view_url
display(IFrame(google_maps__view_url, '600px', '400px')
6. Displaying in Google Map
Exercise
'https://www.google.com/maps/embed/v1/search?key=YOURKEY&q=35°55′45″N+86°51′27″W'
86. Media File
from urllib.request import urlopen
from urllib.request import urlretrieve
from bs4 import BeautifulSoup as bs
from urllib.error import HTTPError
html = urlopen("http://www.tutorialspoint.com/python/")
baseURL = 'http://www.tutorialspoint.com'
bsObj = bs(html,"html.parser")
imageLocation = bsObj.find(“a”, {“title”:
"tutorialspoint"}).find("img")["src"]
imageLocation = baseURL + imageLocation
urlretrieve (imageLocation, "logo.jpg")
If you only need to download a single
file
'http://www.tutorialspoint.com/python/images/logo.png'
89. CSV
• Storing data to CS
• CSV, or comma-separated values, is one of the most popular file formats
in which to store spreadsheet data.
• Supported by MS excel or openoffice
fruit,cost
apple,1.00
banana,0.30
pear,1.25
90. CSV
Create and write data into CSV
import csv
filename = "test.csv"
try:
csvFile = open(filename, 'w')
writer = csv.writer(csvFile,lineterminator='n')
writer.writerow(('數字', '數字加2', '數字乘2'))
for i in range(10):
writer.writerow((i, i+2, i*2))
except csv.Error as e:
print('file %s, line %d: %s' % (filename, writer.line_num, e))
finally:
csvFile.close()
92. RDMS – SQLite
Content credited: http://www-03.ibm.com/ibm/history/ibm100/us/en/icons/reldb/
• two other major closed source database systems: Microsoft’s SQL Server and Oracle’s DBMS
• Mysql is the most popular open source relational database used by youtube, facebook
Flat file DB
Rational
99. Submitting a Basic Form
<h2>請輸入你的大名!(PHP)</h2>
<form method="post" action="hello.php">
姓: <input type="text" name="firstname"><br>
名: <input type="text" name="lastname"><br>
<input type="submit" value="Submit" id="submit">
</form>
• Most web forms consist of a few HTML fields, a submit button, and an “action” page, where the actual
form processing is done
• HTML forms help them format POST requests
100. Submitting a Basic Form
import requests
params = {'firstname': ‘Paul', 'lastname': ‘Yang'}
r = requests.post("http://localhost:5000/demo_submit_form_1",
data=params)
print(r.text)
• Submitting a form with the Requests library Here basically php is written down
103. Radio Buttons, Checkboxes
• Two things need to worry - the name of the element and its value
<form method="GET" action="someProcessor.php">
<h2>請輸入你的名</h2><br>
<input type="radio" name="firstname" value="Paul"
/>Paul<br>
<input type="radio" name="firstname" value="Jack"
/>Jack<br>
<h2>請輸入你的姓</h2><br>
<input type="radio" name="lastname" value="Yang"
/>Yang<br>
<input type="radio" name="lastname" value="Wang"
/>Wang<br>
<input type="submit" value="Submit" />
</form>
104. Radio Buttons, Checkboxes
import requests
params = {'firstname': 'Paul', 'lastname': 'Yang'}
r = requests.get("http://localhost:5000/demo_submit_form_1",params=params)
print(r.text)
• Submitting a form using GET method with the Requests library
Use requests for GET too but need to
use “params” instead of data
105. Execise20 - Know using GET or POST
• Run flask server by
• sudo python WebServer.py
• Open the URLs by Google Chrome
• Form by PUT method
• http://localhost:5000/python_demo/demo5.html
• Form by PUT method
• http://localhost:5000/python_demo/demo6.html
• Enter your first name(名) and last name (姓)
106. Execise20 - Know using GET or POST
• Check it by looking at URL input and the developer tool(press ctrl+shift + i)
GET with “params”
? Firstname..etc
PUT with form data
108. Cookie
• How is this different from a login form, which lets you exist in a permanent “logged in” state
throughout your visit to the site?
• Most modern websites use cookies to keep track of who is logged in and who is not. Once a site
authenticates your login credentials a it stores in your browser a cookie, which usually
contains a server-generated token, timeout, and tracking information
<h2>Log In Here!</h2>
Warning: Your browser must be able to use cookies in order to view our site!
<form method="post" action="welcome.php">
Username (use anything!): <input type="text" name="username"><br>
Password (try "password"): <input type="password" name="password"><br>
<input type="submit" value="Login">
</form>
Visit http://pythonscraping.com/pages/cookies/login.html
109. Cookie
• How is this different from a login form, which lets you exist in a permanent “logged in” state
throughout your visit to the site?
• Most modern websites use cookies to keep track of who is logged in and who is not. Once a site
authenticates your login credentials a it stores in your browser a cookie, which usually
contains a server-generated token, timeout, and tracking information
import requests
params = {'username': 'Ryan', 'password': 'password'}
r = requests.post("http://pythonscraping.com/pages/cookies/welcome.php", params)
print("Cookie is set to:")
print(r.cookies.get_dict())
print("-----------")
print("Going to profile page...")
r = requests.get("http://pythonscraping.com/pages/cookies/profile.php",cookies=r.cookies)
print(r.text)
Cookie is set to:
{'username': 'Ryan', 'loggedin': '1'}
-----------
Going to profile page...
Hey Ryan! Looks like you're still logged into the site!
128. Exercise
1. 前往 https://home.openweathermap.org/users/sign_up to
create a new account
2. Go to the account setting to create the key and copy it
3. put city.list.json on the same folder where you run ipython( *.ipynb)
4. Run ipyton notebook command
a. Setup
141. Scanning IP range
$nmap -sP -n 172.20.10.1-30
Nmap scan report for 172.20.10.1
Host is up (1.0s latency).
Nmap scan report for 172.20.10.3
Host is up (1.0s latency).
Nmap scan report for 172.20.10.15
Host is up (0.0013s latency).
Nmap done: 30 IP addresses (3 hosts
up) scanned in 15.63 seconds
#windows
Advanced IP scanner
http://www.stevendobbelaere.be/how-to-do-a-network-ip-range-scan-with-nmap/