Getting started with Web Scraping in Python

•

1 like•490 views

All the necessary tricks, libraries, tools that a beginner should know to successfully scrape any site with python. Instead of covering on code I'm focusing more on developing an intuition in the reader so that he can decide intuitively what path to take.

Technology

Scrapingtotherescue
(Webscrapingusingpython)
By : Satwik Kansal and Pradhvan Bisht

Whatiswebscraping ?
Web scraping is a technique to extract large amounts of
data from websites whereby the data is extracted and
saved to a local file in your computer.
The data can be used for several purposes like displaying on
your own website and application, performing data analysis
or for any other reason.

whyshouldyouscrape
- API may not provide what you need
- No rate limit
- Take what you really want!
- Reduces manual effort
- Swag!

Thingsthatmightcomehandy
-HTML
-CSS
-XPATH
-Regular Expressions

Howit’sdone?
Broadly a Three Step Process
1. Getting the content (in most cases HTML)
2. Parsing the response.
3. Optimizing/Improving the performance and preserving the data

GETTINGTHECONTENT
● Using modules like urllib, urllib2, requests, mechanize and selenium.
● Involves GET/POST request to the server.
● The response contains the information to be extracted.
● Sometimes not as easy as it may seem.

ExtractingTheData
1. Using Regular Expression and Basic python
Tricky, complex and kind of fragile.
2. Using Parsing Libraries
❏ Two different approaches possible -- Simple Parsing and Search Tree
parsing.
❏ Some popular libraries are BeautifulSoup, Lxml, and html5lib.
❏ Each modules has its own techniques and thus its own pros and trade-
offs

ComparingParsers
BEAUTIFUL SOUP
LXML
SCRAPY
HTML5LIB

PreservingTheData
1. Writing to a file.
2. Exporting as csv or excel file.
3. Storing in a database.

Examples
Example 1 : Scraping Tweets from Twitter using BeautifulSoup
and python’s Requests module
Code
Example 2 : Scraping top Stackoverflow posts using Scrapy
Code
Example 3 : Using Selenium to Log in and fetch library
details from a university library site which uses Dynamic
HTML.

WHATTOUSEWHERE
1. Handling dynamically generated html
Solutions: Selenium or Spidermonkey
2. Cookie based Authentication
Solution : Requests module.
3. Simple scraping
Solutions: BeautifulSoup+Requests, Scrapy, Selenium

Scrapinghacks
1. Overcoming captchas
Lookup tables, One time manual entry , Death By Captchas (paid service)
2. Per IP address query limit
Using tsocks, ssh_D and socks monkey.
3. Improving performance
Multiprocessing , gevent and requests.async() method.

Example3
Automating My College Library
Problems :
1. Authentication
2. Dynamically Generated <iframe> tag
Solution
Selenium with headless Browser like PhantomJS
Alternative: Mechanize
Code

EthicsOfScraping
Exceeding authorized use of the site
Means doing anything that is prohibited in the Terms of Use
(See CFAA, breach of contract, unjust enrichment, trespass
to chattels, and various state laws similar to CFAA)
Copyright Issues
If the material you are scraping is not factual, but
something that required some amount of creativity to create,
you have copyright to worry about.
QuickTip -- Conform to the the robots.txt file.

● The brute-force way to get the information required.
● Absolutely Legal
● Not always that easy.

What's hot

Scraping data from the web and documentsTommy Tavenner

Web scrapingAshley Davis

What is Web-scraping?Yu-Chang Ho

Web Scraping BasicsKyle Banerjee

Web ScrapingCarlos Rodriguez

What is web scraping?Brijesh Prajapati

WEB Scraping.pptxShubham Jaybhaye

Web Scrapingprimeteacher32

Intro to beautiful soupAndreas Chandra

Web miningTanjarul Islam Mishu

Web crawlerpoonamkenkre

Web miningshireen fatima

Gaurav web miningGaurav Uniyal

Web search engines ( Mr.Mirza )Ali Saif Mirza

Web crawler and applicationsPartnered Health

Web scrapingSelecto

Search Engine Powerpoint Partha Himu

Seo and page rank algorithmNilkanth Shirodkar

Data Science Training | Data Science Tutorial | Data Science Certification | ...Edureka!

Web MiningZiyad Abid

What's hot (20)

Scraping data from the web and documents

Web scraping

What is Web-scraping?

Web Scraping Basics

Web Scraping

What is web scraping?

WEB Scraping.pptx

Web Scraping

Intro to beautiful soup

Web mining

Web crawler

Web mining

Gaurav web mining

Web search engines ( Mr.Mirza )

Web crawler and applications

Web scraping

Search Engine Powerpoint

Seo and page rank algorithm

Data Science Training | Data Science Tutorial | Data Science Certification | ...

Web Mining

Similar to Getting started with Web Scraping in Python

REST Api Tips and TricksMaksym Bruner

Making it fast: Zotonic & PerformanceArjan

contentDMspacecowboyian

Large-Scale Web Scraping: An Ultimate GuideData Scraping and Data Extraction

Frontend performancesacred 8

BackboneJS Training - Giving Backbone to your applicationsJoseph Khan

Web Component Development Using Servlet & JSP Technologies (EE6) - Chapter 4...WebStackAcademy

Cache is KingSteve Souders

Building Faster WebsitesCraig Walker

Ajax workshopWBUTTUTORIALS

526_topic08.pptsajeedmalagi

Big data analysis in python @ PyCon.tw 2013Jimmy Lai

Scrapy workshopKarthik Ananth

L017418893IOSR Journals

Web scraping with BeautifulSoup, LXML, RegEx and ScrapyLITTINRAJAN

Net core performanceChamithSaranga

Scaling asp.net websites to millions of usersoazabir

Advanced Web Scraping or How To Make Internet Your Database #seoplus2018Esteve Castells

Sherlock Homepage - A detective story about running large web services - WebN...Maarten Balliauw

Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)Yury Leonychev

Similar to Getting started with Web Scraping in Python (20)

REST Api Tips and Tricks

Making it fast: Zotonic & Performance

contentDM

Large-Scale Web Scraping: An Ultimate Guide

Frontend performance

BackboneJS Training - Giving Backbone to your applications

Web Component Development Using Servlet & JSP Technologies (EE6) - Chapter 4...

Cache is King

Building Faster Websites

Ajax workshop

526_topic08.ppt

Big data analysis in python @ PyCon.tw 2013

Scrapy workshop

L017418893

Web scraping with BeautifulSoup, LXML, RegEx and Scrapy

Net core performance

Scaling asp.net websites to millions of users

Advanced Web Scraping or How To Make Internet Your Database #seoplus2018

Sherlock Homepage - A detective story about running large web services - WebN...

Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)

Recently uploaded

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Pigging Solutions Piggable Sweeping ElbowsPigging Solutions

Install Stable Diffusion in windows machinePadma Pradeep

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

Artificial intelligence in the post-deep learning eraDeakin University

Key Features Of Token Development (1).pptxLBM Solutions

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

CloudStudio User manual (basic edition):comworks

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard

Understanding the Laravel MVC ArchitecturePixlogix Infotech

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

Pigging Solutions in Pet Food ManufacturingPigging Solutions

Recently uploaded (20)

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

Pigging Solutions Piggable Sweeping Elbows

Install Stable Diffusion in windows machine

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

Artificial intelligence in the post-deep learning era

Key Features Of Token Development (1).pptx

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

Breaking the Kubernetes Kill Chain: Host Path Mount

CloudStudio User manual (basic edition):

08448380779 Call Girls In Civil Lines Women Seeking Men

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...

Maximizing Board Effectiveness 2024 Webinar.pptx

Understanding the Laravel MVC Architecture

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi

Advanced Test Driven-Development @ php[tek] 2024

SQL Database Design For Developers at php[tek] 2024

Pigging Solutions in Pet Food Manufacturing

Getting started with Web Scraping in Python

1. Scrapingtotherescue (Webscrapingusingpython) By : Satwik Kansal and Pradhvan Bisht

2. Whatiswebscraping ? Web scraping is a technique to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer. The data can be used for several purposes like displaying on your own website and application, performing data analysis or for any other reason.

4. whyshouldyouscrape - API may not provide what you need - No rate limit - Take what you really want! - Reduces manual effort - Swag!

5. Thingsthatmightcomehandy -HTML -CSS -XPATH -Regular Expressions

6. Howit’sdone? Broadly a Three Step Process 1. Getting the content (in most cases HTML) 2. Parsing the response. 3. Optimizing/Improving the performance and preserving the data

7. GETTINGTHECONTENT ● Using modules like urllib, urllib2, requests, mechanize and selenium. ● Involves GET/POST request to the server. ● The response contains the information to be extracted. ● Sometimes not as easy as it may seem.

8. ExtractingTheData 1. Using Regular Expression and Basic python Tricky, complex and kind of fragile. 2. Using Parsing Libraries ❏ Two different approaches possible -- Simple Parsing and Search Tree parsing. ❏ Some popular libraries are BeautifulSoup, Lxml, and html5lib. ❏ Each modules has its own techniques and thus its own pros and trade- offs

10. ComparingParsers BEAUTIFUL SOUP LXML SCRAPY HTML5LIB

11. PreservingTheData 1. Writing to a file. 2. Exporting as csv or excel file. 3. Storing in a database.

12. Examples Example 1 : Scraping Tweets from Twitter using BeautifulSoup and python’s Requests module Code Example 2 : Scraping top Stackoverflow posts using Scrapy Code Example 3 : Using Selenium to Log in and fetch library details from a university library site which uses Dynamic HTML.

13.

14. WHATTOUSEWHERE 1. Handling dynamically generated html Solutions: Selenium or Spidermonkey 2. Cookie based Authentication Solution : Requests module. 3. Simple scraping Solutions: BeautifulSoup+Requests, Scrapy, Selenium

15.

16. Scrapinghacks 1. Overcoming captchas Lookup tables, One time manual entry , Death By Captchas (paid service) 2. Per IP address query limit Using tsocks, ssh_D and socks monkey. 3. Improving performance Multiprocessing , gevent and requests.async() method.

17. Example3 Automating My College Library Problems : 1. Authentication 2. Dynamically Generated <iframe> tag Solution Selenium with headless Browser like PhantomJS Alternative: Mechanize Code

18.

19. EthicsOfScraping Exceeding authorized use of the site Means doing anything that is prohibited in the Terms of Use (See CFAA, breach of contract, unjust enrichment, trespass to chattels, and various state laws similar to CFAA) Copyright Issues If the material you are scraping is not factual, but something that required some amount of creativity to create, you have copyright to worry about. QuickTip -- Conform to the the robots.txt file.

20.

21. ● The brute-force way to get the information required. ● Absolutely Legal ● Not always that easy.

Getting started with Web Scraping in Python

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Getting started with Web Scraping in Python

Similar to Getting started with Web Scraping in Python (20)

Recently uploaded

Recently uploaded (20)

Getting started with Web Scraping in Python