Mining the web, no experience required.
Ruairí Fahy, 25th
October 2015
Scrapinghub - Who are we?
● Provider of cloud based web-crawling
solutions
● Builder of spiders and crawling
solutions
● Creator of open source projects like
Scrapy, Portia and Splash
● Find out more at scrapinghub.com
Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
Splash
Portia
Scrapy
The Project
Obtain and compare house types and
prices across the country
● Build a spider for daft.ie using Portia
● Crawl daft.ie to obtain housing data
● Process the data using Pandas
● Visualise the data using CartoDB
Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
The Basics
Web Scraping - The process of extracting
data from the web
Spider - A piece of software designed to
extract links and items from webpages
Crawl - Visit all pages of interest on a site
using your spider
Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
Build a spider using Portia
● Portia is a tool for building spiders
without having to write any code.
● It has a simple UI for loading pages
that you want to extract data from.
● Create Samples by highlighting data
that you want on a page.
● Use these samples to train the
extraction algorithm.
Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
https://github.com/scrapinghub/portia
Run our spider
● Scrapy Cloud - Hosted crawling at scrapinghub.com
● Scrapyd - Run your own server for crawling
● Portiacrawl - Run the spider locally using scrapy
Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
Process our data with Pandas
● The spider has extracted the house type,
price, BER, number of bedrooms and
address for all houses for sale on daft.ie.
● Clean and normalise data
● Add a geopoint column so the houses can
be placed on a map.
● Process fields to prepare them for plotting
Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
Notebook: https://gist.github.com/ruairif/80102746320d0229a0ce
Visualise the data using CartoDB
● Create a dataset from our csv file
● Plot our data on a map
● Compare prices across the country
● Compare property type
● Compare BER
● http://cdb.io/1POBIU8
Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
We’re Hiring - scrapinghub.com/jobs
Thank you!
Ruairi Fahy, 25th
October 2015
ruairi@scrapinghub.com

Mining the web, no experience required

  • 1.
    Mining the web,no experience required. Ruairí Fahy, 25th October 2015
  • 2.
    Scrapinghub - Whoare we? ● Provider of cloud based web-crawling solutions ● Builder of spiders and crawling solutions ● Creator of open source projects like Scrapy, Portia and Splash ● Find out more at scrapinghub.com Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015 Splash Portia Scrapy
  • 3.
    The Project Obtain andcompare house types and prices across the country ● Build a spider for daft.ie using Portia ● Crawl daft.ie to obtain housing data ● Process the data using Pandas ● Visualise the data using CartoDB Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
  • 4.
    The Basics Web Scraping- The process of extracting data from the web Spider - A piece of software designed to extract links and items from webpages Crawl - Visit all pages of interest on a site using your spider Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
  • 5.
    Build a spiderusing Portia ● Portia is a tool for building spiders without having to write any code. ● It has a simple UI for loading pages that you want to extract data from. ● Create Samples by highlighting data that you want on a page. ● Use these samples to train the extraction algorithm. Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015 https://github.com/scrapinghub/portia
  • 6.
    Run our spider ●Scrapy Cloud - Hosted crawling at scrapinghub.com ● Scrapyd - Run your own server for crawling ● Portiacrawl - Run the spider locally using scrapy Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
  • 7.
    Process our datawith Pandas ● The spider has extracted the house type, price, BER, number of bedrooms and address for all houses for sale on daft.ie. ● Clean and normalise data ● Add a geopoint column so the houses can be placed on a map. ● Process fields to prepare them for plotting Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015 Notebook: https://gist.github.com/ruairif/80102746320d0229a0ce
  • 8.
    Visualise the datausing CartoDB ● Create a dataset from our csv file ● Plot our data on a map ● Compare prices across the country ● Compare property type ● Compare BER ● http://cdb.io/1POBIU8 Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
  • 9.
    We’re Hiring -scrapinghub.com/jobs
  • 10.
    Thank you! Ruairi Fahy,25th October 2015 ruairi@scrapinghub.com