Scrapping and API
Session 8
By Kevin
Content
• What is Web Scraping
• Workflow of webscraper
• Useful libraries available
• Which library to use for which job
• Leagality
What?
Web scraping is a technique for gathering data or
information on web pages. You could revisit your favorite web
site every time it updates for new information.
Or you could write a web scraper to have it do it for you!
What?
A technique to automatically access and extract large
amounts of information from a website
Converts information from a website to another format
Can be used to collect updates, progress, or initial data
Usage
• Extract product information
• Extract job postings and internships
• Extract offers and discounts from deal-of-the-day websites
• Extract data to make a search engine
• Gathering weather data
• etc.
How?
Web Scraping follows this workflow:
• Get the website - using HTTP library
• Parse the html document - using any parsing library
• Store the results - either a db, csv, text file, etc
We will focus more on parsing.
Structure of HTML
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p id="story-title"><b>The Dormouse's story</b></p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
</p>
<p>The story continues...</p>
</body>
</html>
Libraries
• BeautifulSoup (bs4)
• lxml
• selenium
• re
• scrapy
HTTP Libraries
• Requests
r = requests.get('https://www.google.com').html
• urllib/urllib2
html = urllib2.urlopen('http://python.org/').read()
• httplib/httplib2
h = httplib2.Http(".cache")
(resp_headers, content) = h.request("http://pydelhi.org/", "GET")
Parsing Libraries
• BeautifulSoup (bs4)
tree = BeautifulSoup(html_doc)
tree.title
• lxml
tree = lxml.html.fromstring(html_doc)
title = tree.xpath('/title/text()’)
• re
title = re.findall('<title>(.*?)</title>', html_doc)
Beautiful soup
Pros
• We can learn it fast
soup = BeautifulSoup(html_doc)
last_a_tag = soup.find("a", id="link3")
all_b_tags = soup.find_all("b")
• very easy to use
• purely in Python
Cons
• slow :(
LXML
• The lxml XML toolkit provides Pythonic bindings for the C libraries
libxml2 and libxslt without sacrificing speed
Pros
• very fast
• If you have no "pure Python" requirement use lxml
• lxml works with all python versions from 2.x to 3.x
Cons
• not purely in Python
re
Pros
• purely baked in Python
• a part of standard library
• very fast –
• every Python version
Cons
• requires you to learn its symbols e.g '.',*,$,^,b,w
• can become complex
Comparison
import re
import time
import urllib2
from bs4 import BeautifulSoup
from lxml import html as lxmlhtml
def timeit(fn, *args):
t1 = time.time()
for i in range(100):
fn(*args)
t2 = time.time()
print '%s took %0.3f ms' % (fn.func_name, (t2-t1)*1000.0)
def bs_test(html):
soup = BeautifulSoup(html)
return soup.html.head.title
def lxml_test(html):
tree = lxmlhtml.fromstring(html)
return tree.xpath('//title')[0].text_content()
def regex_test(html):
return re.findall('', html)[0]
if __name__ == '__main__':
url = 'http://pydelhi.org'
html = urllib2.urlopen(url).read()
for fn in (bs_test, lxml_test, regex_test):
timeit(fn, html)
Result
manoj@manoj:~/Desktop$ python test.py
bs_test took 1851.457 ms
lxml_test took 232.942 ms
regex_test took 7.186 ms
lxml took 32x more time than re, BeautifulSoup took 245x! more time
than re
Is it legal?
•In short, the action of web scraping isn't illegal. However, some
rules need to be followed. Web scraping becomes illegal when
non publicly available data becomes extracted.
WHAT IS MACHINE LEARNING?

Sesi 8_Scraping & API for really bnegineer.pptx

  • 1.
  • 2.
    Content • What isWeb Scraping • Workflow of webscraper • Useful libraries available • Which library to use for which job • Leagality
  • 3.
    What? Web scraping isa technique for gathering data or information on web pages. You could revisit your favorite web site every time it updates for new information. Or you could write a web scraper to have it do it for you!
  • 4.
    What? A technique toautomatically access and extract large amounts of information from a website Converts information from a website to another format Can be used to collect updates, progress, or initial data
  • 5.
    Usage • Extract productinformation • Extract job postings and internships • Extract offers and discounts from deal-of-the-day websites • Extract data to make a search engine • Gathering weather data • etc.
  • 6.
    How? Web Scraping followsthis workflow: • Get the website - using HTTP library • Parse the html document - using any parsing library • Store the results - either a db, csv, text file, etc We will focus more on parsing.
  • 7.
    Structure of HTML <html> <head> <title>TheDormouse's story</title> </head> <body> <p id="story-title"><b>The Dormouse's story</b></p> <p class="story"> Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. </p> <p>The story continues...</p> </body> </html>
  • 8.
    Libraries • BeautifulSoup (bs4) •lxml • selenium • re • scrapy
  • 9.
    HTTP Libraries • Requests r= requests.get('https://www.google.com').html • urllib/urllib2 html = urllib2.urlopen('http://python.org/').read() • httplib/httplib2 h = httplib2.Http(".cache") (resp_headers, content) = h.request("http://pydelhi.org/", "GET")
  • 10.
    Parsing Libraries • BeautifulSoup(bs4) tree = BeautifulSoup(html_doc) tree.title • lxml tree = lxml.html.fromstring(html_doc) title = tree.xpath('/title/text()’) • re title = re.findall('<title>(.*?)</title>', html_doc)
  • 11.
    Beautiful soup Pros • Wecan learn it fast soup = BeautifulSoup(html_doc) last_a_tag = soup.find("a", id="link3") all_b_tags = soup.find_all("b") • very easy to use • purely in Python Cons • slow :(
  • 12.
    LXML • The lxmlXML toolkit provides Pythonic bindings for the C libraries libxml2 and libxslt without sacrificing speed Pros • very fast • If you have no "pure Python" requirement use lxml • lxml works with all python versions from 2.x to 3.x Cons • not purely in Python
  • 13.
    re Pros • purely bakedin Python • a part of standard library • very fast – • every Python version Cons • requires you to learn its symbols e.g '.',*,$,^,b,w • can become complex
  • 14.
    Comparison import re import time importurllib2 from bs4 import BeautifulSoup from lxml import html as lxmlhtml def timeit(fn, *args): t1 = time.time() for i in range(100): fn(*args) t2 = time.time() print '%s took %0.3f ms' % (fn.func_name, (t2-t1)*1000.0) def bs_test(html): soup = BeautifulSoup(html) return soup.html.head.title def lxml_test(html): tree = lxmlhtml.fromstring(html) return tree.xpath('//title')[0].text_content() def regex_test(html): return re.findall('', html)[0] if __name__ == '__main__': url = 'http://pydelhi.org' html = urllib2.urlopen(url).read() for fn in (bs_test, lxml_test, regex_test): timeit(fn, html)
  • 15.
    Result manoj@manoj:~/Desktop$ python test.py bs_testtook 1851.457 ms lxml_test took 232.942 ms regex_test took 7.186 ms lxml took 32x more time than re, BeautifulSoup took 245x! more time than re
  • 16.
    Is it legal? •Inshort, the action of web scraping isn't illegal. However, some rules need to be followed. Web scraping becomes illegal when non publicly available data becomes extracted.
  • 18.
    WHAT IS MACHINELEARNING?