Scraping
the web using
scrapy
Jolo Balbin

Mikko Gozalo
– Mitchell Kapor,
a variation of former MIT President Jerome Weisner’s quote
“Getting information off
the Internet is like taking
a drink from a fire
hydrant.”
Tons of information
on the internet
News / Rappler, ABS-CBN News, GMA News Online
SOCIAL MEDIA / Facebook, Twitter
TRANSPORTATION / MMDA, WAZE, DOTC WEBSITE
WEATHER / Project NOAH, PAGASA
E-COMMERCE / LAZADA, ZALORA, EBAY, OLX
Government DATA / PHILGEPS
Tons of information
on the internet
News / What’s trending? What’s HAPPENING?
SOCIAL MEDIA / What are the people’s sentiment on subject x?
TRANSPORTATION / What’s the traffic like later?
WEATHER / What’s the effect of weather on traffic?
E-COMMERCE / Who’s selling the cheapest item x?
Government DATA / Where are our taxes going?
The Problem
Not all data are
structured!
HOW DO WE TURN
UNSTRUCTURED DATA
INTO STRUCTURED
ONES?
Web scraping
๏ computer software technique of extracting
information from websites.
๏ focuses more on the transformation of unstructured
data on the web, typically in HTML format, into
structured data that can be stored and analyzed in a
central local database or spreadsheet.
https://en.wikipedia.org/wiki/Web_scraping
Conventional Way
๏ Fetch the webpage using urllib, httplib or requests.
๏ Use beautifulsoup4, lxml or regular expressions to get
extract information.
๏ Analyze/store the information!
Conventional Way
Conventional Way
๏ Blocking! We have to wait for each request to finish
before we move on to the next.
๏ If it encounters an error somewhere, we’re doomed.
Everything will just halt!
Conventional Way
๏ We can use threading, gevent or other libraries to
make it asynchronous.
๏ We can wrap parts of the code in try-except blocks
to catch possible Exceptions.
Here comes…
Why Scrapy?
๏ Processes requests and responses asynchronously.
๏ Customizable!You can override almost everything.
๏ Handles cookies, delays, timeouts, etc so you won’t
have to. No need to reinvent the wheel!
๏ Includes Selectors, a parsing library that can parse
HTML and XML using XPATH or CSS; or you can just
use Beautiful Soup!
History of Scrapy
๏ An open source framework to scrape websites
๏ Scrapy was started by Pablo Hoffman and Shane
Evans (2007)
๏ Originally a tool used by Shane’s company
๏ They saw the potential, and open sourced it.
Getting Started with scrapy
๏ As easy as pip install Scrapy.
๏ Start a project with scrapy startproject
project_name.
๏ Creating your first spider!
Short Demo
using Scrapy
at Work
At Scrapinghub
๏ Company that provides scraping-related services to
clients around the globe.
๏ Distributed team of 105 people around the world.
๏ Active in contributing to open-source!
๏ Project owner of Scrapy!
Academe/Research
๏ A U.S. Department of Energy National Laboratory
Operated by a university in California.
๏ Analyzes relation between product price, energy
efficiency and other product features of typical home
appliances.
๏ Partnered with Scrapinghub for academic research!
Market Analytics
๏ A UK company that provides price, promotion and online
product positioning analytics.
๏ Help consumers find the best prices!
๏ Help online retailers compare their prices with other
retailers.
๏ Help brands check if retailers are providing accurate
product information.
๏ Partnered with Scrapinghub for their scraping needs!
Government Research
๏ Scrapinghub is participating in DARPA’s Memex.
๏ Crawls the deep web.
๏ Aids in systematically tracking down criminal activity.
using Scrapy
FOR SIDE PROJECTS
MRT Passenger Traffic
๏ Crawls the MRT3 website using Scrapy.
๏ Downloads the CCTV images for each station.
๏ Approximate the relative passenger traffic for the
certain moment using computer vision!
MRT Passenger Traffic
*Line status as of July 1 (Wednesday), 6:40pm
MRT Passenger Traffic
*CUBAO STATION status as of July 1
MRT Passenger Traffic
*AYALA STATION NB status as of July 1
MiniBalita.com
๏ A news reader for Philippine news.
๏ Crawls Philippine news websites such as Rappler,ABS-
CBN News, Inquirer, Spin.ph, etc.
๏ Integrated with TextTeaser to produce “mini” balita.
2013 General Elections
๏ Crawled the 2013 General Elections to find trends.
๏ 70 clustered precincts registered 100% turnout, most
of them in ARMM.
๏ One clustered precinct voted for only one senator.
No one voted for anyone else despite the fact that a
voter may choose up to 12 candidates!
Is SCRAPING legal?
๏ Legalities about scraping is a gray area.
๏ Scraping public data is somewhat legal.
๏ Illegality may arise from how the data is used.
๏ Some websites explicitly prohibit scraping.
๏ Always obey robots.txt.
End. Any Questions?
Jolo Balbin

Twitter: @mojojolo

http://www.summarizerman.com
Mikko Gozalo

Twitter: @mikkogozalo

http://www.mikkogozalo.com

Scrapinghub PyCon Philippines 2015

  • 1.
  • 2.
    – Mitchell Kapor, avariation of former MIT President Jerome Weisner’s quote “Getting information off the Internet is like taking a drink from a fire hydrant.”
  • 3.
    Tons of information onthe internet News / Rappler, ABS-CBN News, GMA News Online SOCIAL MEDIA / Facebook, Twitter TRANSPORTATION / MMDA, WAZE, DOTC WEBSITE WEATHER / Project NOAH, PAGASA E-COMMERCE / LAZADA, ZALORA, EBAY, OLX Government DATA / PHILGEPS
  • 4.
    Tons of information onthe internet News / What’s trending? What’s HAPPENING? SOCIAL MEDIA / What are the people’s sentiment on subject x? TRANSPORTATION / What’s the traffic like later? WEATHER / What’s the effect of weather on traffic? E-COMMERCE / Who’s selling the cheapest item x? Government DATA / Where are our taxes going?
  • 5.
    The Problem Not alldata are structured! HOW DO WE TURN UNSTRUCTURED DATA INTO STRUCTURED ONES?
  • 6.
    Web scraping ๏ computersoftware technique of extracting information from websites. ๏ focuses more on the transformation of unstructured data on the web, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet. https://en.wikipedia.org/wiki/Web_scraping
  • 7.
    Conventional Way ๏ Fetchthe webpage using urllib, httplib or requests. ๏ Use beautifulsoup4, lxml or regular expressions to get extract information. ๏ Analyze/store the information!
  • 8.
  • 9.
    Conventional Way ๏ Blocking!We have to wait for each request to finish before we move on to the next. ๏ If it encounters an error somewhere, we’re doomed. Everything will just halt!
  • 10.
    Conventional Way ๏ Wecan use threading, gevent or other libraries to make it asynchronous. ๏ We can wrap parts of the code in try-except blocks to catch possible Exceptions.
  • 11.
  • 12.
    Why Scrapy? ๏ Processesrequests and responses asynchronously. ๏ Customizable!You can override almost everything. ๏ Handles cookies, delays, timeouts, etc so you won’t have to. No need to reinvent the wheel! ๏ Includes Selectors, a parsing library that can parse HTML and XML using XPATH or CSS; or you can just use Beautiful Soup!
  • 13.
    History of Scrapy ๏An open source framework to scrape websites ๏ Scrapy was started by Pablo Hoffman and Shane Evans (2007) ๏ Originally a tool used by Shane’s company ๏ They saw the potential, and open sourced it.
  • 14.
    Getting Started withscrapy ๏ As easy as pip install Scrapy. ๏ Start a project with scrapy startproject project_name. ๏ Creating your first spider!
  • 15.
  • 16.
  • 17.
    At Scrapinghub ๏ Companythat provides scraping-related services to clients around the globe. ๏ Distributed team of 105 people around the world. ๏ Active in contributing to open-source! ๏ Project owner of Scrapy!
  • 18.
    Academe/Research ๏ A U.S.Department of Energy National Laboratory Operated by a university in California. ๏ Analyzes relation between product price, energy efficiency and other product features of typical home appliances. ๏ Partnered with Scrapinghub for academic research!
  • 19.
    Market Analytics ๏ AUK company that provides price, promotion and online product positioning analytics. ๏ Help consumers find the best prices! ๏ Help online retailers compare their prices with other retailers. ๏ Help brands check if retailers are providing accurate product information. ๏ Partnered with Scrapinghub for their scraping needs!
  • 20.
    Government Research ๏ Scrapinghubis participating in DARPA’s Memex. ๏ Crawls the deep web. ๏ Aids in systematically tracking down criminal activity.
  • 21.
  • 22.
    MRT Passenger Traffic ๏Crawls the MRT3 website using Scrapy. ๏ Downloads the CCTV images for each station. ๏ Approximate the relative passenger traffic for the certain moment using computer vision!
  • 23.
    MRT Passenger Traffic *Linestatus as of July 1 (Wednesday), 6:40pm
  • 24.
    MRT Passenger Traffic *CUBAOSTATION status as of July 1
  • 25.
    MRT Passenger Traffic *AYALASTATION NB status as of July 1
  • 26.
    MiniBalita.com ๏ A newsreader for Philippine news. ๏ Crawls Philippine news websites such as Rappler,ABS- CBN News, Inquirer, Spin.ph, etc. ๏ Integrated with TextTeaser to produce “mini” balita.
  • 28.
    2013 General Elections ๏Crawled the 2013 General Elections to find trends. ๏ 70 clustered precincts registered 100% turnout, most of them in ARMM. ๏ One clustered precinct voted for only one senator. No one voted for anyone else despite the fact that a voter may choose up to 12 candidates!
  • 29.
    Is SCRAPING legal? ๏Legalities about scraping is a gray area. ๏ Scraping public data is somewhat legal. ๏ Illegality may arise from how the data is used. ๏ Some websites explicitly prohibit scraping. ๏ Always obey robots.txt.
  • 30.
    End. Any Questions? JoloBalbin Twitter: @mojojolo http://www.summarizerman.com Mikko Gozalo Twitter: @mikkogozalo http://www.mikkogozalo.com