Web Scraping Basics and Useful Libraries

Name: Shubham Pralhad
Jaybhaye
ROLL NO: 2111023
SUBJECT: DATA STRUCTURE
CE-1 PRESENTATION

TOPIC: WEB SCAPING
•What is Web Scraping
•Workflow of webscraper
•Useful libraries available
•Which library to use for which job
•Leagality

Web scraping is a technique for gathering
data or information on web pages. You could
revisit your favorite web site every time it
updates for new information.
Or you could write a web scraper to have it
do it for you!
WEB SCRAPING
WHAT IS IT ?

WEB SCAPING
•It is a method to extract data from a website
that does not have an API or we want to
extract a LOT of data which we can not do
through an API due to rate limiting.
•Through web scraping we can extract any
data which we can see while browsing the
web

USAGE
WEB SCRAPING IN REAL LIFE
•Extract product information
•Extract job postings and internships
•Extract offers and discounts from deal-of-the-
day websites
•Extract data to make a search engine
•Gathering weather data
•etc.

ADVANTAGES
WEB SCRAPING VS. USING AN API
•Web Scraping is not rate limited
•Anonymously access the website and
gather data
•Some websites do not have an API
•Some data is not accessible through an API
•and many more !

WORKFLOW
ESSENTIAL PARTS OF WEB SCRAPING
Web Scraping follows this workflow:
• Get the website - using HTTP library
• Parse the html document - using any parsing library
• Store the results - either a db, csv, text file, etc
We will focus more on parsing.

LIBRARIES
USEFUL LIBRARIES AVAILABLE
•BeautifulSoup (bs4)
•lxml
•selenium
•re
•scrapy

HTTP LIBRARIES
• Requests
r = requests.get('https://www.google.com').html
•urllib/urllib2
html = urllib2.urlopen('http://python.org/').read()
• httplib/httplib2
h = httplib2.Http(".cache")
(resp_headers, content) =
h.request("http://pydelhi.org/", "GET")

PARSING LIBRARIES
•BeautifulSoup (bs4)
tree = BeautifulSoup(html_doc)
tree.title
•lxml
tree = lxml.html.fromstring(html_doc)
title = tree.xpath('/title/text()’)
•re
title = re.findall('<title>(.*?)</title>', html_doc)

BEAUTIFULSOUP
PROS AND CONS !
•We can learn it fast
soup = BeautifulSoup(html_doc)
last_a_tag = soup.find("a", id="link3")
all_b_tags = soup.find_all("b")
•very easy to use
•purely in Python
•slow :(

LXML
PROS AND CONS !
The lxml XML toolkit provides Pythonic bindings for the C
libraries libxml2 and libxslt without sacrificing speed
•very fast
•not purely in Python
•If you have no "pure Python" requirement use
lxml
•lxml works with all python versions from 2.x
to 3.x

RE
PROS AND CONS !
•requires you to learn its symbols e.g
'.',*,$,^,b,w
•can become complex
•purely baked in Python
•a part of standard library
•very fast –
•every Python version

COMPARISON
BS4 VS. LXML VS. RE
import re
import time
import urllib2
from bs4 import BeautifulSoup
from lxml import html as lxmlhtml
def timeit(fn, *args):
t1 = time.time()
for i in range(100):
fn(*args)
t2 = time.time()
print '%s took %0.3f ms' % (fn.func_name, (t2-t1)*1000.0)
def bs_test(html):
soup = BeautifulSoup(html)
return soup.html.head.title
def lxml_test(html):
tree = lxmlhtml.fromstring(html)
return tree.xpath('//title')[0].text_content()
def regex_test(html):
return re.findall('', html)[0]
if __name__ == '__main__':
url = 'http://pydelhi.org'
html = urllib2.urlopen(url).read()
for fn in (bs_test, lxml_test, regex_test):
timeit(fn, html)

RESULT
•manoj@manoj:~/Desktop$ python test.py
•bs_test took 1851.457 ms
•lxml_test took 232.942 ms
•regex_test took 7.186 ms
•lxml took 32x more time than re,
BeautifulSoup took 245x! more time than re

Is Web scraping legal?
•In short, the action of web scraping isn't illegal.
However, some rules need to be followed. Web scraping
becomes illegal when non publicly available data
becomes extracted.

Web Scraping Basics and Useful Libraries

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Web Scraping Basics and Useful Libraries

Similar to Web Scraping Basics and Useful Libraries (20)

More from Shubham Jaybhaye

More from Shubham Jaybhaye (6)

Recently uploaded

Recently uploaded (20)

Web Scraping Basics and Useful Libraries