Dr. Carlos Rodríguez Contreras
UNAM
Web Scraping
Web Scraping
 Web scraping, web harvesting, or web data extraction is data
scraping used for extracting data from websites.
 Web scraping software may access the World Wide Web directly
using the Hypertext Transfer Protocol, or through a web browser.
 While web scraping can be done manually by a software user, the
term typically refers to automated processes implemented using
a bot or web crawler.
Web Scraping
 It is a form of copying, in which specific data is gathered and copied
from the web, typically into a central local database or spreadsheet,
for later retrieval or analysis.
 Scraping a web page involves fetching it and extracting data from it.
 Fetching is the downloading of a page (which a browser does when
you view the page).
 Therefore, web crawling is a main component of web scraping, to
fetch pages for later processing.
Web Scraping
 Once fetched, the extraction can take place. The content of a page
may be parsed, searched, reformatted, its data copied into a
spreadsheet, and so on.
 Web scrapers typically take something out of a page, to make use
of it for another purpose somewhere else.
 An example would be to find and copy names and phone numbers,
or companies and their URLs, to a list (contact scraping).
Web Scraping
 Web scraping is used as a component of applications for web
indexing, web mining and data mining, online price change
monitoring and price comparison, product review scraping (to watch
the competition), gathering real estate listings, weather data
monitoring, website change detection, research, tracking online
presence and reputation, web mashup and, web data integration.
 Newer forms of web scraping involve listening to data feeds from
web servers. For example, JSON is commonly used as a transport
storage mechanism between the client and the web server.
Web Scraping
Web Scraping
An Exercise on Forbes Lists
 Each year Forbes ranks the world based on a variety categories
ranging from the wealthiest people on the planet to the best
colleges America has to offer.
 Forbes is the preeminent maintainer of covering a wide range of
business related topics including sports, entertainment, individual
wealth, and locations.
 The lists are chocked full of data that can be analyzed, visualized
and merged with other data.
 Upon discovering that Forbes
went to the pains of building an
API, even though it is
undocumented, Alex Bresler
decided to build forbesListR
package to wrap that API and
make it as easy as possible to
access data with a few simple
functions.
Global 2000
Global 2000
Global 2000
Forbes Global 2000
Scraping with R
Web Scraping

Web Scraping

  • 1.
    Dr. Carlos RodríguezContreras UNAM Web Scraping
  • 2.
    Web Scraping  Webscraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.  Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser.  While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler.
  • 3.
    Web Scraping  Itis a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.  Scraping a web page involves fetching it and extracting data from it.  Fetching is the downloading of a page (which a browser does when you view the page).  Therefore, web crawling is a main component of web scraping, to fetch pages for later processing.
  • 4.
    Web Scraping  Oncefetched, the extraction can take place. The content of a page may be parsed, searched, reformatted, its data copied into a spreadsheet, and so on.  Web scrapers typically take something out of a page, to make use of it for another purpose somewhere else.  An example would be to find and copy names and phone numbers, or companies and their URLs, to a list (contact scraping).
  • 5.
    Web Scraping  Webscraping is used as a component of applications for web indexing, web mining and data mining, online price change monitoring and price comparison, product review scraping (to watch the competition), gathering real estate listings, weather data monitoring, website change detection, research, tracking online presence and reputation, web mashup and, web data integration.  Newer forms of web scraping involve listening to data feeds from web servers. For example, JSON is commonly used as a transport storage mechanism between the client and the web server.
  • 6.
  • 7.
  • 8.
    An Exercise onForbes Lists
  • 9.
     Each yearForbes ranks the world based on a variety categories ranging from the wealthiest people on the planet to the best colleges America has to offer.  Forbes is the preeminent maintainer of covering a wide range of business related topics including sports, entertainment, individual wealth, and locations.  The lists are chocked full of data that can be analyzed, visualized and merged with other data.
  • 10.
     Upon discoveringthat Forbes went to the pains of building an API, even though it is undocumented, Alex Bresler decided to build forbesListR package to wrap that API and make it as easy as possible to access data with a few simple functions.
  • 11.
  • 12.
  • 13.
  • 14.