Web scraping with BeautifulSoup, LXML, RegEx and Scrapy

{Web Scraping}
https://www.linkedin.com/in/littinrajan
An Introduction to Web Scraping using Python
http://littinrajan.wordpress.com/

AGENDA
• What is Web Scraping?
• Why it is needed?
• How it Works?
• How to do Massive Web Scraping?
• Can we make it Automated?

WEB SCRAPING
‘Web Scraping’ is a technique for gathering structured data or information
from web pages.
It offers a quick way to acquire data which is presented on the web in a
particular format.
What is it?

WEB SCRAPING
In some cases API’s are not capable enough to get the whole data that we
want from web pages.
We can anonymously access the website and gather data.
It is not data limited.
Why it is needed?

WEB SCRAPING
1. Accessing the target Website using HTTP library like requests, Urllib,
httplib, etc.
2. Parse the content of the web using any Web Parsing library like Beautiful
Soup, lxml, ReGex, etc.
3. Save the result to the required format like Database table, CSV, Excel, text
file, etc.
How it works?

WEB SCRAPING
Requests
Requests is a Python HTTP library which allows us to send HTTP requests
using Python
Part1: Accessing Data

WEB SCRAPING
Urllib3
urllib3 is a powerful, HTTP client for Python

WEB SCRAPING
httplib2
Httplib2 is a small, fast HTTP client library for Python. Features persistent
connections, cache, and Google App Engine support

WEB SCRAPING
BeautifulSoup4
Beautiful Soup is a Parsing library that makes it easy to scrape information
from web pages.
It sits atop an HTML or XML parser, providing Pythonic idioms for iterating,
searching, and modifying the parse tree.
It is very easy to use. But slow in parsing.
Part2: Parsing Content

WEB SCRAPING
BeautifulSoup4
Can handle broken markup and can purely code in Python.

WEB SCRAPING
lxml
lxml is the most feature-rich and easy-to-use library for processing XML
and HTML in Python which represents as an element tree.
Very fast in processing.
Codes cannot purely in Python

WEB SCRAPING
lxml
lxml is able to works with all python versions from 2.x to 3.x.

WEB SCRAPING
RegEx
Regex is a library which used to work with Regular Expressions.
Based on our request pattern it is able to parse the data.
It is used only to extract minute amount of text.
In order to handle we should learn its symbols e.g '.',*,$,^,b,w,d

WEB SCRAPING
RegEx
Can purely code in Python.
It is very fast and support all versions of Python.

WEB SCRAPING
After parsing we will get the collection of data that we want to work with.
Then we can convert it into the convenient format for later purpose.
We can save the data into the various formats like DataBase table or Comma-
Seperated Values(CSV) file or Excel file or Normal Text file.
Part3: Saving Result

WEB SCRAPING
Request library is much slower than all. But the advantage is that it supports
restful API.
Httplib2 consumes least execution time but it is hard to work with other
languages.
Time Comparison:
Comparison: Http Libraries

WEB SCRAPING
Beautifulsoup consumes more time to parse the data but it widely used
because of it’s high support with other languages.
RegEx is veery easy to usable and run faster but cannot work in complex
situations.
Time Comparison:
Comparison: Parsing Libraries

WEB SCRAPING
In some time it needed millions of web pages to be scraped everyday to get a
solution.
Most times the source web pages will change and it will become a havoc for
you to get the required data.
In some cases regex won’t work but beautifulsoup will. But the issue is that
the output will be generated very slowly.
How to do Massive Web Scraping?

WEB SCRAPING
SCRAPY is the solution for Massive Web Scraping.
It is a free and open-source web-crawling framework written in Python.
It can also be used to extract data using APIs or as a general-purpose web
crawler.
It comprised with almost all tools that we want to work for web scraping.
How to do Massive Web Scraping?

WEB SCRAPING
 When there is millions of pages to scrape.
 When you want asynchronous processing(multiple request at a time)
 When the data is funky in nature and it is not properly formatted.
 Pages with server issues.
 Websites with login wall.
Scrapy: When to Use?

WEB SCRAPING
1. Define a Scraper.
2. Defining Items to Extract.
3. Creating a Spider to Crawl.
4. Run the Scraper.
Scrapy: WorkFlow

WEB SCRAPING
First we have to define the scraper by building a project.
It will create a directory with the required files and directories.
Scrapy: Defining Scraper

WEB SCRAPING
Root Directory will contain a configuration file ‘scrapy.cfg’ and project’s
python module.
The module folder will contain items file, pipeline file, settings file,
middlewares file, a directory for putting spiders and init python file.
Scrapy: Defining Scraper

WEB SCRAPING
Items are the containers used to collect the data that is scrapped from the
websites.
We can define our items by editing ‘items.py’.
Scrapy: Defining Items to Extract

WEB SCRAPING
Spiders are classes which defines;
 How a certain site will be scraped,
 How to perform the crawl and
 How to extract structured data from their pages.
Scrapy: Creating a Spider to Crawl
Here is how to create your spider with any sample template

WEB SCRAPING
In order to crawl our data we have to define the callback function parse()
It will collect the data of our interest.
We can also define settings in spider like allowed domain settings, callback
response, etc.
Scrapy: Creating a Spider to Crawl

WEB SCRAPING
After defining items and our crawler we can run our scraper by scrapy crawl
command. We can also store scraped data by using Feed Exports.
Scrapy also provides shell scripting using built-in Scrapy Shell. We can trigger
the shell by the following way.
Scrapy: Run the Scraper

WEB SCRAPING
Automated code makes the process to be completed without any human
intervention.
Can easily pass through the walls of webpages without getting blocked.
The solution is Selenium. It is one of the well known package which is used
to automate web browser interaction. Also supports python.
Can we make it Automated?

Web scraping with BeautifulSoup, LXML, RegEx and Scrapy

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Web scraping with BeautifulSoup, LXML, RegEx and Scrapy

Similar to Web scraping with BeautifulSoup, LXML, RegEx and Scrapy (20)

Recently uploaded

Recently uploaded (20)

Web scraping with BeautifulSoup, LXML, RegEx and Scrapy