Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scrapy workshop

1,531 views

Published on

Description

If you want to get data from the web, and there are no APIs available, then you need to use web scraping! Scrapy is the most effective and popular choice for web scraping and is used in many areas such as data science, journalism, business intelligence, web development, etc.

Abstract

If you want to get data from the web, and there are no APIs available, then you need to use web scraping! Scrapy is the most effective and popular choice for web scraping and is used in many areas such as data science, journalism, business intelligence, web development, etc.

This workshop will provide an overview of Scrapy, starting from the fundamentals and working through each new topic with hands-on examples.

Participants will come away with a good understanding of Scrapy, the principles behind its design, and how to apply the best practices encouraged by Scrapy to any scraping task.

Goals:

Set up a python environment.
Learn basic concepts of the Scrapy framework.

Published in: Data & Analytics
  • Be the first to comment

Scrapy workshop

  1. 1. SCRAPY WORKSHOP Karthik Ananth karthik@scrapinghub.com
  2. 2. Karthik Ananth Who am I? ! Leading professional services @ Scrapinghub ! I have vision to synergise data generation and analytics ! Open source promoter
  3. 3. APIs Why Web Scraping Semantic web
  4. 4. What is Web Scraping The main goal in scraping is to extract structured data from unstructured sources, typically, web pages.
  5. 5. What for ! Monitor prices ! Leads generation ! Aggregate information ! Your imagination is the limit
  6. 6. Do you speak HTTP? Headers, Query String Status Codes Methods Persistence GET, POST, PUT, HEAD… 2XX, 3XX, 4XX, 418 , 5XX, 999 Accept-language, UA*… Cookies
  7. 7. Standard Library HTTP for humans Let’s perform a request urllib2 python-requests
  8. 8. import requests req = requests.get('http://scrapinghub.com/about/') Show me the code! What now?
  9. 9. HTMLis not a regular language
  10. 10. lxml pythonic binding for the C libraries libxml2 and libxslt beautifulsoup html.parser, lxml, html5lib HTMLParsers
  11. 11. import requests
 import lxml.html
 req = requests.get(‘http://nyc2015.pydata.org/schedule/')
 tree = lxml.html.fromstring(req.text)
 for tr in tree.xpath('//span[@class="speaker"]'):
 name = tr.xpath('text()')
 url = tr.xpath('@href')
 print name
 print url Show me the code!
  12. 12. “Those who don't understand xpath are cursed to reinvent it, poorly.”
  13. 13. Scrapy-ify early on
  14. 14. “An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.”
  15. 15. $ conda install -c scrapinghub scrapy
  16. 16. $ scrapy shell <url> An interactive shell console Invaluable tool for developing and debugging your spiders
  17. 17. An interactive shell console >>> response.url 'http://example.com' >>> response.xpath('//h1/text()') [<Selector xpath='//h1/text()' data=u'Example Domain'>] >>> view(response) # open in browser >>> fetch('http://www.google.com') # fetch other URL
  18. 18. $ scrapy startproject <name> pydata ├── pydata │   ├── __init__.py │   ├── items.py │   ├── pipelines.py │   ├── settings.py │   └── spiders │   └── __init__.py └── scrapy.cfg Starting a project
  19. 19. What is a spider
  20. 20. import scrapy class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com'] start_urls = [ 'http://www.example.com/', ] def parse(self, response): msg = 'A response from %s just arrived!' % response.url self.logger.info(msg) What is a Spider?
  21. 21. import scrapy class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com'] start_urls = [ ‘http://www.example.com/' ] def parse(self, response): for h3 in response.xpath(‘//h3/text()’).extract(): yield {‘title’: h3} for url in response.xpath('//a/@href').extract(): yield scrapy.Request(url, callback=self.parse) What is a Spider? 1.0
  22. 22. Batteries included ! Logging ! Stats collection ! Testing: contracts ! Telnet console: inspect a Scrapy process
  23. 23. Avoid getting banned ! Rotate your User Agent ! Disable cookies ! Randomized download delays ! Use a pool of rotating IPs ! Crawlera
  24. 24. A service daemon to run Scrapy spiders $ scrapyd-deploy Deployment 1.0 scrapyd
  25. 25. Scrapy Cloud $ shub deploy
  26. 26. TONSofOpenSource Fullyremotedistributedteam About us
  27. 27. Mandatory Sales Slide try.scrapinghub.com/pydatanyc Crawl the web, at scale • cloud-based platform • smart proxy rotator Get data, hassle-free • off-the-shelf datasets • turn-key web scraping
  28. 28. We’re hiring!
  29. 29. Thanks

×