SlideShare a Scribd company logo
Mining the web, no experience required.
Ruairí Fahy, 25th
October 2015
Scrapinghub - Who are we?
● Provider of cloud based web-crawling
solutions
● Builder of spiders and crawling
solutions
● Creator of open source projects like
Scrapy, Portia and Splash
● Find out more at scrapinghub.com
Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
Splash
Portia
Scrapy
The Project
Obtain and compare house types and
prices across the country
● Build a spider for daft.ie using Portia
● Crawl daft.ie to obtain housing data
● Process the data using Pandas
● Visualise the data using CartoDB
Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
The Basics
Web Scraping - The process of extracting
data from the web
Spider - A piece of software designed to
extract links and items from webpages
Crawl - Visit all pages of interest on a site
using your spider
Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
Build a spider using Portia
● Portia is a tool for building spiders
without having to write any code.
● It has a simple UI for loading pages
that you want to extract data from.
● Create Samples by highlighting data
that you want on a page.
● Use these samples to train the
extraction algorithm.
Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
https://github.com/scrapinghub/portia
Run our spider
● Scrapy Cloud - Hosted crawling at scrapinghub.com
● Scrapyd - Run your own server for crawling
● Portiacrawl - Run the spider locally using scrapy
Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
Process our data with Pandas
● The spider has extracted the house type,
price, BER, number of bedrooms and
address for all houses for sale on daft.ie.
● Clean and normalise data
● Add a geopoint column so the houses can
be placed on a map.
● Process fields to prepare them for plotting
Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
Notebook: https://gist.github.com/ruairif/80102746320d0229a0ce
Visualise the data using CartoDB
● Create a dataset from our csv file
● Plot our data on a map
● Compare prices across the country
● Compare property type
● Compare BER
● http://cdb.io/1POBIU8
Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
We’re Hiring - scrapinghub.com/jobs
Thank you!
Ruairi Fahy, 25th
October 2015
ruairi@scrapinghub.com

More Related Content

Similar to Mining the web, no experience required

ATLRUG Announcements/Upgrade News - August 2016
ATLRUG Announcements/Upgrade News - August 2016ATLRUG Announcements/Upgrade News - August 2016
ATLRUG Announcements/Upgrade News - August 2016
jasnow
 
ATLRUG Community Announcements for December 2016
ATLRUG Community Announcements for December 2016ATLRUG Community Announcements for December 2016
ATLRUG Community Announcements for December 2016
jasnow
 
ATLRUG December 2015
ATLRUG December 2015ATLRUG December 2015
ATLRUG December 2015
jasnow
 
Hong kong drupal user group dec13th responsive web design for dummy
Hong kong drupal user group dec13th responsive web design for dummyHong kong drupal user group dec13th responsive web design for dummy
Hong kong drupal user group dec13th responsive web design for dummy
Ann Lam
 
OSGi IoT Demo & Contest 2015
OSGi IoT Demo & Contest 2015OSGi IoT Demo & Contest 2015
OSGi IoT Demo & Contest 2015
mfrancis
 
ATLRUG Community Announcements - Sept. 2015
ATLRUG Community Announcements - Sept. 2015ATLRUG Community Announcements - Sept. 2015
ATLRUG Community Announcements - Sept. 2015
jasnow
 
ATLRUG Announcements - July 2016
ATLRUG Announcements - July 2016ATLRUG Announcements - July 2016
ATLRUG Announcements - July 2016
jasnow
 
Big Data Processing Utilizing Open-source Technologies - May 2015
Big Data Processing Utilizing Open-source Technologies - May 2015Big Data Processing Utilizing Open-source Technologies - May 2015
Big Data Processing Utilizing Open-source Technologies - May 2015
Amir Sedighi
 
The current state of SAP Integration, SAPPHIRENOW 2018
The current state of SAP Integration, SAPPHIRENOW 2018The current state of SAP Integration, SAPPHIRENOW 2018
The current state of SAP Integration, SAPPHIRENOW 2018
Daniel Graversen
 
ATLRUG May 2015 Announcements
ATLRUG May 2015 AnnouncementsATLRUG May 2015 Announcements
ATLRUG May 2015 Announcements
jasnow
 
ATLRUG Community Announcements - Oct. 2015
ATLRUG Community Announcements - Oct. 2015ATLRUG Community Announcements - Oct. 2015
ATLRUG Community Announcements - Oct. 2015
jasnow
 
ATLRUG Announcements for Feb. 2016
ATLRUG Announcements for Feb. 2016ATLRUG Announcements for Feb. 2016
ATLRUG Announcements for Feb. 2016
jasnow
 
Prototype your dream
Prototype your dreamPrototype your dream
Prototype your dream
Paul Ardeleanu
 
Publishing your open source project
Publishing your open source projectPublishing your open source project
Publishing your open source project
Rishi Pithadiya
 
Web scraping with Ruby
Web scraping with RubyWeb scraping with Ruby
Web scraping with Ruby
Hidehiro Nagaoka
 
Build and Deploy a Python Web App to Amazon in 30 Mins
Build and Deploy a Python Web App to Amazon in 30 MinsBuild and Deploy a Python Web App to Amazon in 30 Mins
Build and Deploy a Python Web App to Amazon in 30 Mins
Jeff Hull
 
ATLRUG Community/Giveback Announcments
ATLRUG Community/Giveback AnnouncmentsATLRUG Community/Giveback Announcments
ATLRUG Community/Giveback Announcments
jasnow
 
ATLRUG Announcements - October 2016
ATLRUG Announcements - October 2016ATLRUG Announcements - October 2016
ATLRUG Announcements - October 2016
jasnow
 
20150624 Belgian GraphDB meetup at Ordina
20150624 Belgian GraphDB meetup at Ordina20150624 Belgian GraphDB meetup at Ordina
20150624 Belgian GraphDB meetup at Ordina
Rik Van Bruggen
 
Drawbridge_MeetUp_June19_072414
Drawbridge_MeetUp_June19_072414Drawbridge_MeetUp_June19_072414
Drawbridge_MeetUp_June19_072414
Nitin Panjwani
 

Similar to Mining the web, no experience required (20)

ATLRUG Announcements/Upgrade News - August 2016
ATLRUG Announcements/Upgrade News - August 2016ATLRUG Announcements/Upgrade News - August 2016
ATLRUG Announcements/Upgrade News - August 2016
 
ATLRUG Community Announcements for December 2016
ATLRUG Community Announcements for December 2016ATLRUG Community Announcements for December 2016
ATLRUG Community Announcements for December 2016
 
ATLRUG December 2015
ATLRUG December 2015ATLRUG December 2015
ATLRUG December 2015
 
Hong kong drupal user group dec13th responsive web design for dummy
Hong kong drupal user group dec13th responsive web design for dummyHong kong drupal user group dec13th responsive web design for dummy
Hong kong drupal user group dec13th responsive web design for dummy
 
OSGi IoT Demo & Contest 2015
OSGi IoT Demo & Contest 2015OSGi IoT Demo & Contest 2015
OSGi IoT Demo & Contest 2015
 
ATLRUG Community Announcements - Sept. 2015
ATLRUG Community Announcements - Sept. 2015ATLRUG Community Announcements - Sept. 2015
ATLRUG Community Announcements - Sept. 2015
 
ATLRUG Announcements - July 2016
ATLRUG Announcements - July 2016ATLRUG Announcements - July 2016
ATLRUG Announcements - July 2016
 
Big Data Processing Utilizing Open-source Technologies - May 2015
Big Data Processing Utilizing Open-source Technologies - May 2015Big Data Processing Utilizing Open-source Technologies - May 2015
Big Data Processing Utilizing Open-source Technologies - May 2015
 
The current state of SAP Integration, SAPPHIRENOW 2018
The current state of SAP Integration, SAPPHIRENOW 2018The current state of SAP Integration, SAPPHIRENOW 2018
The current state of SAP Integration, SAPPHIRENOW 2018
 
ATLRUG May 2015 Announcements
ATLRUG May 2015 AnnouncementsATLRUG May 2015 Announcements
ATLRUG May 2015 Announcements
 
ATLRUG Community Announcements - Oct. 2015
ATLRUG Community Announcements - Oct. 2015ATLRUG Community Announcements - Oct. 2015
ATLRUG Community Announcements - Oct. 2015
 
ATLRUG Announcements for Feb. 2016
ATLRUG Announcements for Feb. 2016ATLRUG Announcements for Feb. 2016
ATLRUG Announcements for Feb. 2016
 
Prototype your dream
Prototype your dreamPrototype your dream
Prototype your dream
 
Publishing your open source project
Publishing your open source projectPublishing your open source project
Publishing your open source project
 
Web scraping with Ruby
Web scraping with RubyWeb scraping with Ruby
Web scraping with Ruby
 
Build and Deploy a Python Web App to Amazon in 30 Mins
Build and Deploy a Python Web App to Amazon in 30 MinsBuild and Deploy a Python Web App to Amazon in 30 Mins
Build and Deploy a Python Web App to Amazon in 30 Mins
 
ATLRUG Community/Giveback Announcments
ATLRUG Community/Giveback AnnouncmentsATLRUG Community/Giveback Announcments
ATLRUG Community/Giveback Announcments
 
ATLRUG Announcements - October 2016
ATLRUG Announcements - October 2016ATLRUG Announcements - October 2016
ATLRUG Announcements - October 2016
 
20150624 Belgian GraphDB meetup at Ordina
20150624 Belgian GraphDB meetup at Ordina20150624 Belgian GraphDB meetup at Ordina
20150624 Belgian GraphDB meetup at Ordina
 
Drawbridge_MeetUp_June19_072414
Drawbridge_MeetUp_June19_072414Drawbridge_MeetUp_June19_072414
Drawbridge_MeetUp_June19_072414
 

Recently uploaded

當測試開始左移
當測試開始左移當測試開始左移
當測試開始左移
Jersey (CHE-PING) Su
 
Independent Girls call Service Pune 000XX00000 Provide Best And Top Girl Serv...
Independent Girls call Service Pune 000XX00000 Provide Best And Top Girl Serv...Independent Girls call Service Pune 000XX00000 Provide Best And Top Girl Serv...
Independent Girls call Service Pune 000XX00000 Provide Best And Top Girl Serv...
bhumivarma35300
 
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
confluent
 
Software development... for all? (keynote at ICSOFT'2024)
Software development... for all? (keynote at ICSOFT'2024)Software development... for all? (keynote at ICSOFT'2024)
Software development... for all? (keynote at ICSOFT'2024)
miso_uam
 
GT degree offer diploma Transcript
GT degree offer diploma TranscriptGT degree offer diploma Transcript
GT degree offer diploma Transcript
attueb
 
Mobile App Development Company in Noida - Drona Infotech.
Mobile App Development Company in Noida - Drona Infotech.Mobile App Development Company in Noida - Drona Infotech.
Mobile App Development Company in Noida - Drona Infotech.
Mobile App Development Company in Noida - Drona Infotech
 
Il Data Streaming per un’AI real-time di nuova generazione
Il Data Streaming per un’AI real-time di nuova generazioneIl Data Streaming per un’AI real-time di nuova generazione
Il Data Streaming per un’AI real-time di nuova generazione
confluent
 
To Avoid Mistakes When Using Online Attendance Sheets
To Avoid Mistakes When Using Online Attendance SheetsTo Avoid Mistakes When Using Online Attendance Sheets
To Avoid Mistakes When Using Online Attendance Sheets
Task Tracker
 
Wired_2.0_Create_AmsterdamJUG_09072024.pptx
Wired_2.0_Create_AmsterdamJUG_09072024.pptxWired_2.0_Create_AmsterdamJUG_09072024.pptx
Wired_2.0_Create_AmsterdamJUG_09072024.pptx
SimonedeGijt
 
How To Fill Timesheet in TaskSprint: Quick Guide 2024
How To Fill Timesheet in TaskSprint: Quick Guide 2024How To Fill Timesheet in TaskSprint: Quick Guide 2024
How To Fill Timesheet in TaskSprint: Quick Guide 2024
TaskSprint | Employee Efficiency Software
 
bangalore Girls call 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
bangalore Girls call  👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Deliverybangalore Girls call  👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
bangalore Girls call 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
sunilverma7884
 
HIRE A HACKER FOR CHEATING HUSBAND/WIFE)
HIRE A HACKER FOR CHEATING HUSBAND/WIFE)HIRE A HACKER FOR CHEATING HUSBAND/WIFE)
HIRE A HACKER FOR CHEATING HUSBAND/WIFE)
josephinedrea942
 
Independent Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class H...
Independent Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class H...Independent Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class H...
Independent Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class H...
aslasdfmkhan4750
 
Artificial intelligence in customer services or chatbots
Artificial intelligence  in customer services or chatbotsArtificial intelligence  in customer services or chatbots
Artificial intelligence in customer services or chatbots
kayash1656
 
Leading Project Management Tool Taskruop.pptx
Leading Project Management Tool Taskruop.pptxLeading Project Management Tool Taskruop.pptx
Leading Project Management Tool Taskruop.pptx
taskroupseo
 
VVIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 i...
VVIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 i...VVIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 i...
VVIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 i...
jealousviolet
 
Mobile App Development Company in Noida - Drona Infotech
Mobile App Development Company in Noida - Drona InfotechMobile App Development Company in Noida - Drona Infotech
Mobile App Development Company in Noida - Drona Infotech
Mobile App Development Company in Noida - Drona Infotech
 
The Ultimate Guide to Phone Spy Apps: Everything You Need to Know
The Ultimate Guide to Phone Spy Apps: Everything You Need to KnowThe Ultimate Guide to Phone Spy Apps: Everything You Need to Know
The Ultimate Guide to Phone Spy Apps: Everything You Need to Know
onemonitarsoftware
 
NYGGS 360: A Complete ERP for Construction Innovation
NYGGS 360: A Complete ERP for Construction InnovationNYGGS 360: A Complete ERP for Construction Innovation
NYGGS 360: A Complete ERP for Construction Innovation
NYGGS Construction ERP Software
 
Girls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in CityGirls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in City
neshakor5152
 

Recently uploaded (20)

當測試開始左移
當測試開始左移當測試開始左移
當測試開始左移
 
Independent Girls call Service Pune 000XX00000 Provide Best And Top Girl Serv...
Independent Girls call Service Pune 000XX00000 Provide Best And Top Girl Serv...Independent Girls call Service Pune 000XX00000 Provide Best And Top Girl Serv...
Independent Girls call Service Pune 000XX00000 Provide Best And Top Girl Serv...
 
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
 
Software development... for all? (keynote at ICSOFT'2024)
Software development... for all? (keynote at ICSOFT'2024)Software development... for all? (keynote at ICSOFT'2024)
Software development... for all? (keynote at ICSOFT'2024)
 
GT degree offer diploma Transcript
GT degree offer diploma TranscriptGT degree offer diploma Transcript
GT degree offer diploma Transcript
 
Mobile App Development Company in Noida - Drona Infotech.
Mobile App Development Company in Noida - Drona Infotech.Mobile App Development Company in Noida - Drona Infotech.
Mobile App Development Company in Noida - Drona Infotech.
 
Il Data Streaming per un’AI real-time di nuova generazione
Il Data Streaming per un’AI real-time di nuova generazioneIl Data Streaming per un’AI real-time di nuova generazione
Il Data Streaming per un’AI real-time di nuova generazione
 
To Avoid Mistakes When Using Online Attendance Sheets
To Avoid Mistakes When Using Online Attendance SheetsTo Avoid Mistakes When Using Online Attendance Sheets
To Avoid Mistakes When Using Online Attendance Sheets
 
Wired_2.0_Create_AmsterdamJUG_09072024.pptx
Wired_2.0_Create_AmsterdamJUG_09072024.pptxWired_2.0_Create_AmsterdamJUG_09072024.pptx
Wired_2.0_Create_AmsterdamJUG_09072024.pptx
 
How To Fill Timesheet in TaskSprint: Quick Guide 2024
How To Fill Timesheet in TaskSprint: Quick Guide 2024How To Fill Timesheet in TaskSprint: Quick Guide 2024
How To Fill Timesheet in TaskSprint: Quick Guide 2024
 
bangalore Girls call 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
bangalore Girls call  👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Deliverybangalore Girls call  👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
bangalore Girls call 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
 
HIRE A HACKER FOR CHEATING HUSBAND/WIFE)
HIRE A HACKER FOR CHEATING HUSBAND/WIFE)HIRE A HACKER FOR CHEATING HUSBAND/WIFE)
HIRE A HACKER FOR CHEATING HUSBAND/WIFE)
 
Independent Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class H...
Independent Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class H...Independent Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class H...
Independent Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class H...
 
Artificial intelligence in customer services or chatbots
Artificial intelligence  in customer services or chatbotsArtificial intelligence  in customer services or chatbots
Artificial intelligence in customer services or chatbots
 
Leading Project Management Tool Taskruop.pptx
Leading Project Management Tool Taskruop.pptxLeading Project Management Tool Taskruop.pptx
Leading Project Management Tool Taskruop.pptx
 
VVIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 i...
VVIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 i...VVIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 i...
VVIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 i...
 
Mobile App Development Company in Noida - Drona Infotech
Mobile App Development Company in Noida - Drona InfotechMobile App Development Company in Noida - Drona Infotech
Mobile App Development Company in Noida - Drona Infotech
 
The Ultimate Guide to Phone Spy Apps: Everything You Need to Know
The Ultimate Guide to Phone Spy Apps: Everything You Need to KnowThe Ultimate Guide to Phone Spy Apps: Everything You Need to Know
The Ultimate Guide to Phone Spy Apps: Everything You Need to Know
 
NYGGS 360: A Complete ERP for Construction Innovation
NYGGS 360: A Complete ERP for Construction InnovationNYGGS 360: A Complete ERP for Construction Innovation
NYGGS 360: A Complete ERP for Construction Innovation
 
Girls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in CityGirls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in City
 

Mining the web, no experience required

  • 1. Mining the web, no experience required. Ruairí Fahy, 25th October 2015
  • 2. Scrapinghub - Who are we? ● Provider of cloud based web-crawling solutions ● Builder of spiders and crawling solutions ● Creator of open source projects like Scrapy, Portia and Splash ● Find out more at scrapinghub.com Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015 Splash Portia Scrapy
  • 3. The Project Obtain and compare house types and prices across the country ● Build a spider for daft.ie using Portia ● Crawl daft.ie to obtain housing data ● Process the data using Pandas ● Visualise the data using CartoDB Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
  • 4. The Basics Web Scraping - The process of extracting data from the web Spider - A piece of software designed to extract links and items from webpages Crawl - Visit all pages of interest on a site using your spider Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
  • 5. Build a spider using Portia ● Portia is a tool for building spiders without having to write any code. ● It has a simple UI for loading pages that you want to extract data from. ● Create Samples by highlighting data that you want on a page. ● Use these samples to train the extraction algorithm. Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015 https://github.com/scrapinghub/portia
  • 6. Run our spider ● Scrapy Cloud - Hosted crawling at scrapinghub.com ● Scrapyd - Run your own server for crawling ● Portiacrawl - Run the spider locally using scrapy Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
  • 7. Process our data with Pandas ● The spider has extracted the house type, price, BER, number of bedrooms and address for all houses for sale on daft.ie. ● Clean and normalise data ● Add a geopoint column so the houses can be placed on a map. ● Process fields to prepare them for plotting Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015 Notebook: https://gist.github.com/ruairif/80102746320d0229a0ce
  • 8. Visualise the data using CartoDB ● Create a dataset from our csv file ● Plot our data on a map ● Compare prices across the country ● Compare property type ● Compare BER ● http://cdb.io/1POBIU8 Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
  • 9. We’re Hiring - scrapinghub.com/jobs
  • 10. Thank you! Ruairi Fahy, 25th October 2015 ruairi@scrapinghub.com