Scrapinghub PyCon Philippines 2015

Scraping
the web using
scrapy
Jolo Balbin

Mikko Gozalo

– Mitchell Kapor,
a variation of former MIT President Jerome Weisner’s quote
“Getting information off
the Internet is like taking
a drink from a fire
hydrant.”

Tons of information
on the internet
News / Rappler, ABS-CBN News, GMA News Online
SOCIAL MEDIA / Facebook, Twitter
TRANSPORTATION / MMDA, WAZE, DOTC WEBSITE
WEATHER / Project NOAH, PAGASA
E-COMMERCE / LAZADA, ZALORA, EBAY, OLX
Government DATA / PHILGEPS

Tons of information
on the internet
News / What’s trending? What’s HAPPENING?
SOCIAL MEDIA / What are the people’s sentiment on subject x?
TRANSPORTATION / What’s the traffic like later?
WEATHER / What’s the effect of weather on traffic?
E-COMMERCE / Who’s selling the cheapest item x?
Government DATA / Where are our taxes going?

The Problem
Not all data are
structured!
HOW DO WE TURN
UNSTRUCTURED DATA
INTO STRUCTURED
ONES?

Web scraping
๏ computer software technique of extracting
information from websites.
๏ focuses more on the transformation of unstructured
data on the web, typically in HTML format, into
structured data that can be stored and analyzed in a
central local database or spreadsheet.
https://en.wikipedia.org/wiki/Web_scraping

Conventional Way
๏ Fetch the webpage using urllib, httplib or requests.
๏ Use beautifulsoup4, lxml or regular expressions to get
extract information.
๏ Analyze/store the information!

Conventional Way
๏ Blocking! We have to wait for each request to ﬁnish
before we move on to the next.
๏ If it encounters an error somewhere, we’re doomed.
Everything will just halt!

Conventional Way
๏ We can use threading, gevent or other libraries to
make it asynchronous.
๏ We can wrap parts of the code in try-except blocks
to catch possible Exceptions.

Why Scrapy?
๏ Processes requests and responses asynchronously.
๏ Customizable!You can override almost everything.
๏ Handles cookies, delays, timeouts, etc so you won’t
have to. No need to reinvent the wheel!
๏ Includes Selectors, a parsing library that can parse
HTML and XML using XPATH or CSS; or you can just
use Beautiful Soup!

History of Scrapy
๏ An open source framework to scrape websites
๏ Scrapy was started by Pablo Hoffman and Shane
Evans (2007)
๏ Originally a tool used by Shane’s company
๏ They saw the potential, and open sourced it.

Getting Started with scrapy
๏ As easy as pip install Scrapy.
๏ Start a project with scrapy startproject
project_name.
๏ Creating your ﬁrst spider!

At Scrapinghub
๏ Company that provides scraping-related services to
clients around the globe.
๏ Distributed team of 105 people around the world.
๏ Active in contributing to open-source!
๏ Project owner of Scrapy!

Academe/Research
๏ A U.S. Department of Energy National Laboratory
Operated by a university in California.
๏ Analyzes relation between product price, energy
efﬁciency and other product features of typical home
appliances.
๏ Partnered with Scrapinghub for academic research!

Market Analytics
๏ A UK company that provides price, promotion and online
product positioning analytics.
๏ Help consumers ﬁnd the best prices!
๏ Help online retailers compare their prices with other
retailers.
๏ Help brands check if retailers are providing accurate
product information.
๏ Partnered with Scrapinghub for their scraping needs!

Government Research
๏ Scrapinghub is participating in DARPA’s Memex.
๏ Crawls the deep web.
๏ Aids in systematically tracking down criminal activity.

using Scrapy
FOR SIDE PROJECTS

MRT Passenger Traffic
๏ Crawls the MRT3 website using Scrapy.
๏ Downloads the CCTV images for each station.
๏ Approximate the relative passenger trafﬁc for the
certain moment using computer vision!

*Line status as of July 1 (Wednesday), 6:40pm

*CUBAO STATION status as of July 1

*AYALA STATION NB status as of July 1

MiniBalita.com
๏ A news reader for Philippine news.
๏ Crawls Philippine news websites such as Rappler,ABS-
CBN News, Inquirer, Spin.ph, etc.
๏ Integrated with TextTeaser to produce “mini” balita.

2013 General Elections
๏ Crawled the 2013 General Elections to ﬁnd trends.
๏ 70 clustered precincts registered 100% turnout, most
of them in ARMM.
๏ One clustered precinct voted for only one senator.
No one voted for anyone else despite the fact that a
voter may choose up to 12 candidates!

Is SCRAPING legal?
๏ Legalities about scraping is a gray area.
๏ Scraping public data is somewhat legal.
๏ Illegality may arise from how the data is used.
๏ Some websites explicitly prohibit scraping.
๏ Always obey robots.txt.

End. Any Questions?
Jolo Balbin

Twitter: @mojojolo

http://www.summarizerman.com
Mikko Gozalo

Twitter: @mikkogozalo

http://www.mikkogozalo.com

Scrapinghub PyCon Philippines 2015

More Related Content

Similar to Scrapinghub PyCon Philippines 2015

Recently uploaded

Scrapinghub PyCon Philippines 2015