Advance Python
Day 4
CHA PLADIN
cpladin@ibex.co | cpladin@ama.edu
Introduction to Web Scraping and Data Analysis
AGENDA
Web scraping
HTML Tag familiarization
and inspecting elements
Request
Data scraping process
File reading and writing
(.csv)
QUICK
RECAP
● Introduction to classes
● Instance of a class,
instance variables (self)
and class variables.
● Inheritance (parent and
child class)
● Method overriding
Activity - Super Idol
Create a parent class called Worker which has
instance variables such as name, email_address,
employee_id and basic_pay
Create a subclass CEO which will inherit the Worker
attributes and with additional
employee_department which is a list and only
assign one value.
Introduction to
Web scraping
Web scraping
- data scraping used for extracting data
from websites;
- refers to automated processes
implemented using a bot or web
crawler.
Libraries to use
● beautifulsoup4
● lxml
● requests
● html5lib
Libraries
Beautifulsoup4 - library designed for quick turnaround projects like
screen-scraping.
lxml - easy-to-use library for processing XML and HTML in the
Python language.
Requests - allows you to send organic, grass-fed HTTP/1.1 requests,
without the need for manual labor.
Html5lib - Standards-compliant library for parsing and serializing
HTML documents and fragments in Python
Basic HTML Tags
Inspecting Elements
Basic Rules of Web Scraping
● Use an API if one is provided, instead of scraping
data.
● Respect the Terms of Service (ToS).
● Respect the rules of robots.txt.
● Use a reasonable crawl rate. Respect the crawl-delay
setting provided in robots.txt; if there's none, use a
conservative crawl rate (e.g. 1 request per 10-15
seconds).
Practice
We will scrape data from
https://old.yellow-pages.ph/search/jolli
bee/metro-manila/page-1, getting few
information such establishment
name, company, address and
phone number.
Import Libraries and setup source
HTML parser
Locating and grabbing data
File Writing
File Writing
PRACTICE 2
Create a program to scrape
http://books.toscrape.com/catalogue/page-2.html
in which will allows us to generate the following:
● Product/Book Name
● Book Link
● Star rating
File I/O
Practice

Day 4 - Advance Python - Ground Gurus

  • 1.
    Advance Python Day 4 CHAPLADIN cpladin@ibex.co | cpladin@ama.edu Introduction to Web Scraping and Data Analysis
  • 2.
    AGENDA Web scraping HTML Tagfamiliarization and inspecting elements Request Data scraping process File reading and writing (.csv)
  • 3.
    QUICK RECAP ● Introduction toclasses ● Instance of a class, instance variables (self) and class variables. ● Inheritance (parent and child class) ● Method overriding
  • 4.
    Activity - SuperIdol Create a parent class called Worker which has instance variables such as name, email_address, employee_id and basic_pay Create a subclass CEO which will inherit the Worker attributes and with additional employee_department which is a list and only assign one value.
  • 5.
  • 6.
    Web scraping - datascraping used for extracting data from websites; - refers to automated processes implemented using a bot or web crawler.
  • 7.
    Libraries to use ●beautifulsoup4 ● lxml ● requests ● html5lib
  • 8.
    Libraries Beautifulsoup4 - librarydesigned for quick turnaround projects like screen-scraping. lxml - easy-to-use library for processing XML and HTML in the Python language. Requests - allows you to send organic, grass-fed HTTP/1.1 requests, without the need for manual labor. Html5lib - Standards-compliant library for parsing and serializing HTML documents and fragments in Python
  • 9.
  • 10.
  • 11.
    Basic Rules ofWeb Scraping ● Use an API if one is provided, instead of scraping data. ● Respect the Terms of Service (ToS). ● Respect the rules of robots.txt. ● Use a reasonable crawl rate. Respect the crawl-delay setting provided in robots.txt; if there's none, use a conservative crawl rate (e.g. 1 request per 10-15 seconds).
  • 12.
    Practice We will scrapedata from https://old.yellow-pages.ph/search/jolli bee/metro-manila/page-1, getting few information such establishment name, company, address and phone number.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
    PRACTICE 2 Create aprogram to scrape http://books.toscrape.com/catalogue/page-2.html in which will allows us to generate the following: ● Product/Book Name ● Book Link ● Star rating
  • 19.
  • 20.