SlideShare a Scribd company logo
1 of 30
Download to read offline
Make It Rain With Web
Scraping
Weather Data
CTPUG - 24 February 2018Gavin Wiener
Who Am I
What is Web-Scraping
“Web scraping, web harvesting, or web data extraction is data
scraping used for extracting data from websites”
https://en.wikipedia.org/wiki/Web_scraping
Throw Your Hands in The Air Like You Just Don't
Care
Who has web scraped?
Why did you web scrape?
Reasons
● Poor (or No) API (Application Programming Interface)
○ “Where there is a will, there is way”
● Poor design
○ Non-responsive
○ Cluttered
○ E.g. https://whereisthemyciti.com/
● Curiosity
○ Python
○ Web-scraping
○ New skill
Tools
● Python
○ https://www.python.org/
● Requests
○ http://docs.python-requests.org/en/master/user/install/
● BeautifulSoup
○ https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Minimum previous knowledge required
Disclaimer
● Read the website’s policy
○ robots.txt
○ Automatic scraping
● Do not circumvent paid APIs
○ Might be illegal
○ Time is better used
The Website
https://www.worldweatheronline.com/cape-town-weather/western-cape/za.aspx?day=20&tp=1#hourly
Get Data - Investigate URLs
https://www.worldweatheronline.com/cape-town-weather/we
stern-cape/za.aspx?day=20&tp=1#hourly
Goal
● Like a puzzle
○ Patterns
○ Repetition
● Good HTML structure
○ Web-framework templates
Identify Structure - Inspecting Elements
Single Instance
Isolate Elements
The Goods
The Goods
The Goods
I Can Haz Your Data - Code
Getting the Rows
The Goods - Banner
The Goods - Banner
The Goods - Banner
The Goods - Banner
I Can Haz Your Data - Code
I Can Haz Your Data - Raw
Summary
1. Identify the structure, and interesting components e.g. <table>, ids, classes
2. Identify how to reach the data e.g. urls
3. ‘Scrape’ the data with code e.g. code
4. Profit??
Summary - Web Scraping Sometimes #2
● Inconsistent
○ Structure can change
● Code can be messy
● Lots of data manipulation
○ Paying for a well-maintained API is better than headaches
gavinwiener@gmail.com
http://github.com/divisionMax/

More Related Content

Similar to Make It Rain With Web Scraping

Harnessing Web Scraping for Data Science.pptx
Harnessing Web Scraping  for Data Science.pptxHarnessing Web Scraping  for Data Science.pptx
Harnessing Web Scraping for Data Science.pptx
Atharva Joshi
 
India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015
Kanwal Prakash Singh
 
India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015
Kanwal Prakash Singh
 
Mapping french open data actors on the web with common crawl
Mapping french open data actors on the web with common crawlMapping french open data actors on the web with common crawl
Mapping french open data actors on the web with common crawl
data publica
 

Similar to Make It Rain With Web Scraping (20)

No API? No Problem! Let the Robot Do Your Work! Web Scraping and Automation W...
No API? No Problem! Let the Robot Do Your Work! Web Scraping and Automation W...No API? No Problem! Let the Robot Do Your Work! Web Scraping and Automation W...
No API? No Problem! Let the Robot Do Your Work! Web Scraping and Automation W...
 
Data processing qgis3_foss4g-eu_2017
Data processing qgis3_foss4g-eu_2017Data processing qgis3_foss4g-eu_2017
Data processing qgis3_foss4g-eu_2017
 
MnSearch Snippets April 2019: Screaming Frog Custom Extraction - Griffin Roer
MnSearch Snippets April 2019: Screaming Frog Custom Extraction - Griffin RoerMnSearch Snippets April 2019: Screaming Frog Custom Extraction - Griffin Roer
MnSearch Snippets April 2019: Screaming Frog Custom Extraction - Griffin Roer
 
Beginner's Guide to Scraping
Beginner's Guide to ScrapingBeginner's Guide to Scraping
Beginner's Guide to Scraping
 
Altitude San Francisco 2018: HTTP Invalidation Workshop
Altitude San Francisco 2018: HTTP Invalidation WorkshopAltitude San Francisco 2018: HTTP Invalidation Workshop
Altitude San Francisco 2018: HTTP Invalidation Workshop
 
Harnessing Web Scraping for Data Science.pptx
Harnessing Web Scraping  for Data Science.pptxHarnessing Web Scraping  for Data Science.pptx
Harnessing Web Scraping for Data Science.pptx
 
OutSystems Webinar - Troubleshooting Mobile Apps Performance
OutSystems Webinar - Troubleshooting Mobile Apps PerformanceOutSystems Webinar - Troubleshooting Mobile Apps Performance
OutSystems Webinar - Troubleshooting Mobile Apps Performance
 
Training Webinar: Troubleshooting Mobile Apps Performance
Training Webinar: Troubleshooting Mobile Apps Performance Training Webinar: Troubleshooting Mobile Apps Performance
Training Webinar: Troubleshooting Mobile Apps Performance
 
WordCamp Milwaukee 2012 - Aaron Saray - Secure Wordpress Coding
WordCamp Milwaukee 2012 - Aaron Saray - Secure Wordpress CodingWordCamp Milwaukee 2012 - Aaron Saray - Secure Wordpress Coding
WordCamp Milwaukee 2012 - Aaron Saray - Secure Wordpress Coding
 
India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015
 
India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015
 
Contributing to the Odoo Community Association (OCA)
Contributing to the Odoo Community Association (OCA)Contributing to the Odoo Community Association (OCA)
Contributing to the Odoo Community Association (OCA)
 
Contributing to the Odoo Community Association
Contributing to the Odoo Community AssociationContributing to the Odoo Community Association
Contributing to the Odoo Community Association
 
Python in Industry
Python in IndustryPython in Industry
Python in Industry
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budget
 
Open Source BI Overview
Open Source BI Overview Open Source BI Overview
Open Source BI Overview
 
Mapping french open data actors on the web with common crawl
Mapping french open data actors on the web with common crawlMapping french open data actors on the web with common crawl
Mapping french open data actors on the web with common crawl
 
An EyeWitness View into your Network
An EyeWitness View into your NetworkAn EyeWitness View into your Network
An EyeWitness View into your Network
 
Building Beautiful High Performance Connected Car Applications
Building Beautiful High Performance Connected Car ApplicationsBuilding Beautiful High Performance Connected Car Applications
Building Beautiful High Performance Connected Car Applications
 
Workflow Engines + Luigi
Workflow Engines + LuigiWorkflow Engines + Luigi
Workflow Engines + Luigi
 

Recently uploaded

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Recently uploaded (20)

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software Engineering
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern Enterprise
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation Computing
 

Make It Rain With Web Scraping