SlideShare a Scribd company logo
Mining the web, no experience required.
Ruairí Fahy, 25th
October 2015
Scrapinghub - Who are we?
● Provider of cloud based web-crawling
solutions
● Builder of spiders and crawling
solutions
● Creator of open source projects like
Scrapy, Portia and Splash
● Find out more at scrapinghub.com
Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
Splash
Portia
Scrapy
The Project
Obtain and compare house types and
prices across the country
● Build a spider for daft.ie using Portia
● Crawl daft.ie to obtain housing data
● Process the data using Pandas
● Visualise the data using CartoDB
Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
The Basics
Web Scraping - The process of extracting
data from the web
Spider - A piece of software designed to
extract links and items from webpages
Crawl - Visit all pages of interest on a site
using your spider
Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
Build a spider using Portia
● Portia is a tool for building spiders
without having to write any code.
● It has a simple UI for loading pages
that you want to extract data from.
● Create Samples by highlighting data
that you want on a page.
● Use these samples to train the
extraction algorithm.
Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
https://github.com/scrapinghub/portia
Run our spider
● Scrapy Cloud - Hosted crawling at scrapinghub.com
● Scrapyd - Run your own server for crawling
● Portiacrawl - Run the spider locally using scrapy
Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
Process our data with Pandas
● The spider has extracted the house type,
price, BER, number of bedrooms and
address for all houses for sale on daft.ie.
● Clean and normalise data
● Add a geopoint column so the houses can
be placed on a map.
● Process fields to prepare them for plotting
Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
Notebook: https://gist.github.com/ruairif/80102746320d0229a0ce
Visualise the data using CartoDB
● Create a dataset from our csv file
● Plot our data on a map
● Compare prices across the country
● Compare property type
● Compare BER
● http://cdb.io/1POBIU8
Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
We’re Hiring - scrapinghub.com/jobs
Thank you!
Ruairi Fahy, 25th
October 2015
ruairi@scrapinghub.com

More Related Content

Similar to Mining the web, no experience required

ATLRUG Announcements/Upgrade News - August 2016
ATLRUG Announcements/Upgrade News - August 2016ATLRUG Announcements/Upgrade News - August 2016
ATLRUG Announcements/Upgrade News - August 2016
jasnow
 
ATLRUG Community Announcements for December 2016
ATLRUG Community Announcements for December 2016ATLRUG Community Announcements for December 2016
ATLRUG Community Announcements for December 2016
jasnow
 
ATLRUG December 2015
ATLRUG December 2015ATLRUG December 2015
ATLRUG December 2015
jasnow
 
Hong kong drupal user group dec13th responsive web design for dummy
Hong kong drupal user group dec13th responsive web design for dummyHong kong drupal user group dec13th responsive web design for dummy
Hong kong drupal user group dec13th responsive web design for dummy
Ann Lam
 
OSGi IoT Demo & Contest 2015
OSGi IoT Demo & Contest 2015OSGi IoT Demo & Contest 2015
OSGi IoT Demo & Contest 2015
mfrancis
 
ATLRUG Community Announcements - Sept. 2015
ATLRUG Community Announcements - Sept. 2015ATLRUG Community Announcements - Sept. 2015
ATLRUG Community Announcements - Sept. 2015
jasnow
 
ATLRUG Announcements - July 2016
ATLRUG Announcements - July 2016ATLRUG Announcements - July 2016
ATLRUG Announcements - July 2016
jasnow
 
Big Data Processing Utilizing Open-source Technologies - May 2015
Big Data Processing Utilizing Open-source Technologies - May 2015Big Data Processing Utilizing Open-source Technologies - May 2015
Big Data Processing Utilizing Open-source Technologies - May 2015
Amir Sedighi
 
The current state of SAP Integration, SAPPHIRENOW 2018
The current state of SAP Integration, SAPPHIRENOW 2018The current state of SAP Integration, SAPPHIRENOW 2018
The current state of SAP Integration, SAPPHIRENOW 2018
Daniel Graversen
 
ATLRUG May 2015 Announcements
ATLRUG May 2015 AnnouncementsATLRUG May 2015 Announcements
ATLRUG May 2015 Announcements
jasnow
 
ATLRUG Community Announcements - Oct. 2015
ATLRUG Community Announcements - Oct. 2015ATLRUG Community Announcements - Oct. 2015
ATLRUG Community Announcements - Oct. 2015
jasnow
 
ATLRUG Announcements for Feb. 2016
ATLRUG Announcements for Feb. 2016ATLRUG Announcements for Feb. 2016
ATLRUG Announcements for Feb. 2016
jasnow
 
Prototype your dream
Prototype your dreamPrototype your dream
Prototype your dream
Paul Ardeleanu
 
Publishing your open source project
Publishing your open source projectPublishing your open source project
Publishing your open source project
Rishi Pithadiya
 
Web scraping with Ruby
Web scraping with RubyWeb scraping with Ruby
Web scraping with Ruby
Hidehiro Nagaoka
 
Build and Deploy a Python Web App to Amazon in 30 Mins
Build and Deploy a Python Web App to Amazon in 30 MinsBuild and Deploy a Python Web App to Amazon in 30 Mins
Build and Deploy a Python Web App to Amazon in 30 Mins
Jeff Hull
 
ATLRUG Community/Giveback Announcments
ATLRUG Community/Giveback AnnouncmentsATLRUG Community/Giveback Announcments
ATLRUG Community/Giveback Announcments
jasnow
 
ATLRUG Announcements - October 2016
ATLRUG Announcements - October 2016ATLRUG Announcements - October 2016
ATLRUG Announcements - October 2016
jasnow
 
20150624 Belgian GraphDB meetup at Ordina
20150624 Belgian GraphDB meetup at Ordina20150624 Belgian GraphDB meetup at Ordina
20150624 Belgian GraphDB meetup at Ordina
Rik Van Bruggen
 
Drawbridge_MeetUp_June19_072414
Drawbridge_MeetUp_June19_072414Drawbridge_MeetUp_June19_072414
Drawbridge_MeetUp_June19_072414Nitin Panjwani
 

Similar to Mining the web, no experience required (20)

ATLRUG Announcements/Upgrade News - August 2016
ATLRUG Announcements/Upgrade News - August 2016ATLRUG Announcements/Upgrade News - August 2016
ATLRUG Announcements/Upgrade News - August 2016
 
ATLRUG Community Announcements for December 2016
ATLRUG Community Announcements for December 2016ATLRUG Community Announcements for December 2016
ATLRUG Community Announcements for December 2016
 
ATLRUG December 2015
ATLRUG December 2015ATLRUG December 2015
ATLRUG December 2015
 
Hong kong drupal user group dec13th responsive web design for dummy
Hong kong drupal user group dec13th responsive web design for dummyHong kong drupal user group dec13th responsive web design for dummy
Hong kong drupal user group dec13th responsive web design for dummy
 
OSGi IoT Demo & Contest 2015
OSGi IoT Demo & Contest 2015OSGi IoT Demo & Contest 2015
OSGi IoT Demo & Contest 2015
 
ATLRUG Community Announcements - Sept. 2015
ATLRUG Community Announcements - Sept. 2015ATLRUG Community Announcements - Sept. 2015
ATLRUG Community Announcements - Sept. 2015
 
ATLRUG Announcements - July 2016
ATLRUG Announcements - July 2016ATLRUG Announcements - July 2016
ATLRUG Announcements - July 2016
 
Big Data Processing Utilizing Open-source Technologies - May 2015
Big Data Processing Utilizing Open-source Technologies - May 2015Big Data Processing Utilizing Open-source Technologies - May 2015
Big Data Processing Utilizing Open-source Technologies - May 2015
 
The current state of SAP Integration, SAPPHIRENOW 2018
The current state of SAP Integration, SAPPHIRENOW 2018The current state of SAP Integration, SAPPHIRENOW 2018
The current state of SAP Integration, SAPPHIRENOW 2018
 
ATLRUG May 2015 Announcements
ATLRUG May 2015 AnnouncementsATLRUG May 2015 Announcements
ATLRUG May 2015 Announcements
 
ATLRUG Community Announcements - Oct. 2015
ATLRUG Community Announcements - Oct. 2015ATLRUG Community Announcements - Oct. 2015
ATLRUG Community Announcements - Oct. 2015
 
ATLRUG Announcements for Feb. 2016
ATLRUG Announcements for Feb. 2016ATLRUG Announcements for Feb. 2016
ATLRUG Announcements for Feb. 2016
 
Prototype your dream
Prototype your dreamPrototype your dream
Prototype your dream
 
Publishing your open source project
Publishing your open source projectPublishing your open source project
Publishing your open source project
 
Web scraping with Ruby
Web scraping with RubyWeb scraping with Ruby
Web scraping with Ruby
 
Build and Deploy a Python Web App to Amazon in 30 Mins
Build and Deploy a Python Web App to Amazon in 30 MinsBuild and Deploy a Python Web App to Amazon in 30 Mins
Build and Deploy a Python Web App to Amazon in 30 Mins
 
ATLRUG Community/Giveback Announcments
ATLRUG Community/Giveback AnnouncmentsATLRUG Community/Giveback Announcments
ATLRUG Community/Giveback Announcments
 
ATLRUG Announcements - October 2016
ATLRUG Announcements - October 2016ATLRUG Announcements - October 2016
ATLRUG Announcements - October 2016
 
20150624 Belgian GraphDB meetup at Ordina
20150624 Belgian GraphDB meetup at Ordina20150624 Belgian GraphDB meetup at Ordina
20150624 Belgian GraphDB meetup at Ordina
 
Drawbridge_MeetUp_June19_072414
Drawbridge_MeetUp_June19_072414Drawbridge_MeetUp_June19_072414
Drawbridge_MeetUp_June19_072414
 

Recently uploaded

Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptxText-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
ShamsuddeenMuhammadA
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
e20449
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
Launch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in MinutesLaunch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in Minutes
Roshan Dwivedi
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
Boni García
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 

Recently uploaded (20)

Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptxText-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
Launch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in MinutesLaunch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in Minutes
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 

Mining the web, no experience required

  • 1. Mining the web, no experience required. Ruairí Fahy, 25th October 2015
  • 2. Scrapinghub - Who are we? ● Provider of cloud based web-crawling solutions ● Builder of spiders and crawling solutions ● Creator of open source projects like Scrapy, Portia and Splash ● Find out more at scrapinghub.com Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015 Splash Portia Scrapy
  • 3. The Project Obtain and compare house types and prices across the country ● Build a spider for daft.ie using Portia ● Crawl daft.ie to obtain housing data ● Process the data using Pandas ● Visualise the data using CartoDB Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
  • 4. The Basics Web Scraping - The process of extracting data from the web Spider - A piece of software designed to extract links and items from webpages Crawl - Visit all pages of interest on a site using your spider Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
  • 5. Build a spider using Portia ● Portia is a tool for building spiders without having to write any code. ● It has a simple UI for loading pages that you want to extract data from. ● Create Samples by highlighting data that you want on a page. ● Use these samples to train the extraction algorithm. Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015 https://github.com/scrapinghub/portia
  • 6. Run our spider ● Scrapy Cloud - Hosted crawling at scrapinghub.com ● Scrapyd - Run your own server for crawling ● Portiacrawl - Run the spider locally using scrapy Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
  • 7. Process our data with Pandas ● The spider has extracted the house type, price, BER, number of bedrooms and address for all houses for sale on daft.ie. ● Clean and normalise data ● Add a geopoint column so the houses can be placed on a map. ● Process fields to prepare them for plotting Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015 Notebook: https://gist.github.com/ruairif/80102746320d0229a0ce
  • 8. Visualise the data using CartoDB ● Create a dataset from our csv file ● Plot our data on a map ● Compare prices across the country ● Compare property type ● Compare BER ● http://cdb.io/1POBIU8 Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
  • 9. We’re Hiring - scrapinghub.com/jobs
  • 10. Thank you! Ruairi Fahy, 25th October 2015 ruairi@scrapinghub.com