Sesi 8_Scraping & API for really bnegineer.pptx

Scrapping and API
Session 8
By Kevin

Content
• What is Web Scraping
• Workflow of webscraper
• Useful libraries available
• Which library to use for which job
• Leagality

What?
Web scraping is a technique for gathering data or
information on web pages. You could revisit your favorite web
site every time it updates for new information.
Or you could write a web scraper to have it do it for you!

What?
A technique to automatically access and extract large
amounts of information from a website
Converts information from a website to another format
Can be used to collect updates, progress, or initial data

Usage
• Extract product information
• Extract job postings and internships
• Extract offers and discounts from deal-of-the-day websites
• Extract data to make a search engine
• Gathering weather data
• etc.

How?
Web Scraping follows this workflow:
• Get the website - using HTTP library
• Parse the html document - using any parsing library
• Store the results - either a db, csv, text file, etc
We will focus more on parsing.

Structure of HTML
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
The Dormouse's story

Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.

The story continues...
</body>
</html>

Libraries
• BeautifulSoup (bs4)
• lxml
• selenium
• re
• scrapy

HTTP Libraries
• Requests
r = requests.get('https://www.google.com').html
• urllib/urllib2
html = urllib2.urlopen('http://python.org/').read()
• httplib/httplib2
h = httplib2.Http(".cache")
(resp_headers, content) = h.request("http://pydelhi.org/", "GET")

Parsing Libraries
• BeautifulSoup (bs4)
tree = BeautifulSoup(html_doc)
tree.title
• lxml
tree = lxml.html.fromstring(html_doc)
title = tree.xpath('/title/text()’)
• re
title = re.findall('<title>(.*?)</title>', html_doc)

Beautiful soup
Pros
• We can learn it fast
soup = BeautifulSoup(html_doc)
last_a_tag = soup.find("a", id="link3")
all_b_tags = soup.find_all("b")
• very easy to use
• purely in Python
Cons
• slow :(

LXML
• The lxml XML toolkit provides Pythonic bindings for the C libraries
libxml2 and libxslt without sacrificing speed
Pros
• very fast
• If you have no "pure Python" requirement use lxml
• lxml works with all python versions from 2.x to 3.x
Cons
• not purely in Python

re
Pros
• purely baked in Python
• a part of standard library
• very fast –
• every Python version
Cons
• requires you to learn its symbols e.g '.',*,$,^,b,w
• can become complex

Comparison
import re
import time
import urllib2
from bs4 import BeautifulSoup
from lxml import html as lxmlhtml
def timeit(fn, *args):
t1 = time.time()
for i in range(100):
fn(*args)
t2 = time.time()
print '%s took %0.3f ms' % (fn.func_name, (t2-t1)*1000.0)
def bs_test(html):
soup = BeautifulSoup(html)
return soup.html.head.title
def lxml_test(html):
tree = lxmlhtml.fromstring(html)
return tree.xpath('//title')[0].text_content()
def regex_test(html):
return re.findall('', html)[0]
if __name__ == '__main__':
url = 'http://pydelhi.org'
html = urllib2.urlopen(url).read()
for fn in (bs_test, lxml_test, regex_test):
timeit(fn, html)

Result
manoj@manoj:~/Desktop$ python test.py
bs_test took 1851.457 ms
lxml_test took 232.942 ms
regex_test took 7.186 ms
lxml took 32x more time than re, BeautifulSoup took 245x! more time
than re

Is it legal?
•In short, the action of web scraping isn't illegal. However, some
rules need to be followed. Web scraping becomes illegal when
non publicly available data becomes extracted.

Sesi 8_Scraping & API for really bnegineer.pptx

More Related Content

Similar to Sesi 8_Scraping & API for really bnegineer.pptx

Recently uploaded

Sesi 8_Scraping & API for really bnegineer.pptx