• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Web scraping in python

Web scraping in python



It is a getting started guide to web scraping with Python and was presented at Dev Fest Google Developers Group Pune.

It is a getting started guide to web scraping with Python and was presented at Dev Fest Google Developers Group Pune.



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds


Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Web scraping in python Web scraping in python Presentation Transcript

    • Web Scraping with Python Virendra Rajput, Hacker @Markitty
    • Agenda ● What is scraping ● Why we scrape ● My experiments with web scraping ● How do we do it ● Tools to use ● Online demo ● Some more tools ● Ethics for scraping
    • converting unstructured documents into structured information scraping:
    • What is Web Scraping? ● Web scraping (web harvesting) is a software technique of extracting information from websites ● It focuses on transformation of unstructured data on the web (typically HTML), into structured data that can be stored and analyzed
    • RSS is meta data and not HTML replacement
    • Why we scrape? ● Web pages contain wealth of information (in text form), designed mostly for human consumption ● Static websites (legacy systems) ● Interfacing with 3rd party with no API access ● Websites are more important than API’s ● The data is already available (in the form of web pages) ● No rate limiting ● Anonymous access
    • How search engines use it
    • My Experiments with Scraping
    • and more..! IMDb API Did you mean! Facebook Bot for Brahma Kumaris
    • Getting started!
    • Fetching the data ● Involves finding the endpoint - URL or URL’s ● Sending HTTP requests to the server ● Using requests library: import requests data = requests.get(‘http://google.com/’) html = data.content
    • Processing (say no to Reg-ex) ● use reg-ex ● Avoid using reg-ex ● Reasons why not to use it: 1. Its fragile 2. Really hard to maintain 3. Improper HTML & Encoding handling
    • Use BeautifulSoup for parsing ● Provides simple methods to- ○ search ○ navigate ○ select ● Deals with broken web-pages really well ● Auto-detects encoding Philosophy- “You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help.”
    • Export the data ● Database (relational or non-relational) ● CSV ● JSON ● File (XML, YAML, etc.) ● API
    • Live example demo
    • Challenges ● External sites can change without warning ○ Figuring out the frequency is difficult (TEST, and test) ○ Changes can break scrapers easily ● Bad HTTP status codes ○ example: using 200 OK to signal an error ○ cannot always trust your HTTP libraries default behaviour ● Messy HTML markup
    • Mechanize ● Stateful web-browsing with mechanize ○ Fill up forms ○ Follow links ○ Handle cookies ○ Browse history ● After Andy Lester’s WWW: Mechanize
    • Filling forms with Mechanize
    • Scrapy - a framework for web scraping ● Uses XPath to select elements ● Interactive shell scripting ● Using Scrapy: ○ define a model to store items ○ create your spider to extract items ○ write a Pipeline to store them
    • Conclusion ● Scrape wisely ● Do not steal ● Use cloud ● Share your scrapers scraperwiki.com
    • The End! Virendra Rajput http://virendra.me/ http://twitter.com/bkvirendra