Shubham Pralhad presented on the topic of web scraping. The presentation covered what web scraping is, the workflow of a web scraper, useful libraries for scraping including BeautifulSoup, lxml, and re, and advantages of scraping over using an API. Web scraping involves getting a website using HTTP requests, parsing the HTML document using a parsing library, and storing the results. BeautifulSoup is easy to use but slow, lxml is very fast but not purely Python, and re is part of the standard library but requires learning regular expressions.
2. TOPIC: WEB SCAPING
•What is Web Scraping
•Workflow of webscraper
•Useful libraries available
•Which library to use for which job
•Leagality
3. Web scraping is a technique for gathering
data or information on web pages. You could
revisit your favorite web site every time it
updates for new information.
Or you could write a web scraper to have it
do it for you!
WEB SCRAPING
WHAT IS IT ?
4. WEB SCAPING
•It is a method to extract data from a website
that does not have an API or we want to
extract a LOT of data which we can not do
through an API due to rate limiting.
•Through web scraping we can extract any
data which we can see while browsing the
web
5. USAGE
WEB SCRAPING IN REAL LIFE
•Extract product information
•Extract job postings and internships
•Extract offers and discounts from deal-of-the-
day websites
•Extract data to make a search engine
•Gathering weather data
•etc.
6. ADVANTAGES
WEB SCRAPING VS. USING AN API
•Web Scraping is not rate limited
•Anonymously access the website and
gather data
•Some websites do not have an API
•Some data is not accessible through an API
•and many more !
7. WORKFLOW
ESSENTIAL PARTS OF WEB SCRAPING
Web Scraping follows this workflow:
• Get the website - using HTTP library
• Parse the html document - using any parsing library
• Store the results - either a db, csv, text file, etc
We will focus more on parsing.
9. HTTP LIBRARIES
USEFUL LIBRARIES AVAILABLE
• Requests
r = requests.get('https://www.google.com').html
•urllib/urllib2
html = urllib2.urlopen('http://python.org/').read()
• httplib/httplib2
h = httplib2.Http(".cache")
(resp_headers, content) =
h.request("http://pydelhi.org/", "GET")
10. PARSING LIBRARIES
USEFUL LIBRARIES AVAILABLE
•BeautifulSoup (bs4)
tree = BeautifulSoup(html_doc)
tree.title
•lxml
tree = lxml.html.fromstring(html_doc)
title = tree.xpath('/title/text()’)
•re
title = re.findall('<title>(.*?)</title>', html_doc)
11. BEAUTIFULSOUP
PROS AND CONS !
•We can learn it fast
soup = BeautifulSoup(html_doc)
last_a_tag = soup.find("a", id="link3")
all_b_tags = soup.find_all("b")
•very easy to use
•purely in Python
•slow :(
12. LXML
PROS AND CONS !
The lxml XML toolkit provides Pythonic bindings for the C
libraries libxml2 and libxslt without sacrificing speed
•very fast
•not purely in Python
•If you have no "pure Python" requirement use
lxml
•lxml works with all python versions from 2.x
to 3.x
13. RE
PROS AND CONS !
•requires you to learn its symbols e.g
'.',*,$,^,b,w
•can become complex
•purely baked in Python
•a part of standard library
•very fast –
•every Python version
14. COMPARISON
BS4 VS. LXML VS. RE
import re
import time
import urllib2
from bs4 import BeautifulSoup
from lxml import html as lxmlhtml
def timeit(fn, *args):
t1 = time.time()
for i in range(100):
fn(*args)
t2 = time.time()
print '%s took %0.3f ms' % (fn.func_name, (t2-t1)*1000.0)
def bs_test(html):
soup = BeautifulSoup(html)
return soup.html.head.title
def lxml_test(html):
tree = lxmlhtml.fromstring(html)
return tree.xpath('//title')[0].text_content()
def regex_test(html):
return re.findall('', html)[0]
if __name__ == '__main__':
url = 'http://pydelhi.org'
html = urllib2.urlopen(url).read()
for fn in (bs_test, lxml_test, regex_test):
timeit(fn, html)
16. Is Web scraping legal?
•In short, the action of web scraping isn't illegal.
However, some rules need to be followed. Web scraping
becomes illegal when non publicly available data
becomes extracted.