Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Web Scraping in Python with Scrapy

1,896 views

Published on

Web Scraping in Python with Scrapy @鮨会

Published in: Technology
  • Be the first to comment

Web Scraping in Python with Scrapy

  1. 1. Web Scraping in Python with Scrapy Kota Kato @orangain 2015-09-08, 鮨会
  2. 2. Who am I? • Kota Kato • @orangain • Software Engineer • Interested in automation such as Jenkins, Chef, Docker etc.
  3. 3. Definition: Web Scraping • Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Web scraping - Wikipedia, the free encyclopedia
 https://en.wikipedia.org/wiki/Web_scraping
  4. 4. eBook-1 • Cross-store search engine for ebooks. • Retrieve ebook data from 9 ebook stores. http://ebook-1.com/
  5. 5. QB Meter • Visualize crowdedness of QB HOUSE, 10 minutes barbershop. • Retrieve crowdedness from QB HOUSE's Web site every 5 minutes. http://qbmeter.capybala.com/
  6. 6. Prototype of Glance • Prototype of simple news app like newspaper. • Retrieve news from NHK NEWS WEB 4 times per a day.
  7. 7. Pokedos • Web app to find nearest bus stops to see the arrival information of buses. • Retrieve location of the all bus stops in Kyoto- city. http://bus.capybala.com/
  8. 8. Why Web Scraping? • For Web Developer: • Develop mash-up application. • For Data Analyst: • Retrieve data to analyze. • For Everybody: • Automate operation of web sites.
  9. 9. Why Use Python? • Easy to use • Powerful libraries, especially Scrapy • Seamlessness between data processing and developing application
  10. 10. Web Scraping in Python • Combination of lightweight libraries: • Retrieving: Requests • Scraping: lxml, Beautiful Soup • Full stack framework: • Scrapy Today's topic
  11. 11. Scrapy
  12. 12. Scrapy • Fast, simple and extensible Web scraping framework in Python • Currently compatible only with Python 2.7 • In-progress Python 3 support • Maintained by Scrapinghub • BSD License http://scrapy.org/
  13. 13. Why Use Scrapy? • Annoying stuffs in crawling and scraping are done by Scrapy. Extracting Links Throttling Concurrency robots.txt and <meta> Tags XML Sitemaps Filtering Duplicated URLs Retry on Error Job Control
  14. 14. Getting Started with Scrapy $ pip install scrapy $ cat > myspider.py <<EOF import scrapy class BlogSpider(scrapy.Spider): name = 'blogspider' start_urls = ['http://blog.scrapinghub.com'] def parse(self, response): for url in response.css('ul li a::attr("href")').re(r'.*/dddd/dd/$'): yield scrapy.Request(response.urljoin(url), self.parse_titles) def parse_titles(self, response): for post_title in response.css('div.entries > ul > li a::text').extract(): yield {'title': post_title} EOF $ scrapy runspider myspider.py http://scrapy.org/Requirements: Python 2.7, libxml2 and libxslt
  15. 15. Let's Collect Sushi Images
  16. 16. Create a Scrapy Project $ scrapy startproject sushibot $ tree sushibot/ sushibot/ !"" scrapy.cfg #"" sushibot !"" __init__.py !"" items.py !"" pipelines.py !"" settings.py #"" spiders #"" __init__.py 2 directories, 6 files
  17. 17. Generate a Spider $ cd sushibot $ scrapy genspider sushi api.flickr.com $ cat sushibot/spiders/sushi.py # -*- coding: utf-8 -*- import scrapy class SushiSpider(scrapy.Spider): name = "sushi" allowed_domains = ["api.flickr.com"] start_urls = ( 'http://www.api.flickr.com/', ) def parse(self, response): pass
  18. 18. Flickr API to Search Photos $ curl 'https://api.flickr.com/services/rest/? method=flickr.photos.search&api_key=******&text=sushi&sort=relevance ' > photos.xml $ cat photos.xml <?xml version="1.0" encoding="utf-8" ?> <rsp stat="ok"> <photos page="1" pages="871" perpage="100" total="87088"> <photo id="4794344495" owner="38553162@N00" secret="d907790937" server="4093" farm="5" title="Sushi!" ispublic="1" isfriend="0" isfamily="0" /> <photo id="8486536177" owner="78779574@N00" secret="f77b824ebb" server="8382" farm="9" title="Best Salmon Sushi" ispublic="1" isfriend="0" isfamily="0" /> ... https://www.flickr.com/services/api/flickr.photos.search.html
  19. 19. Construct Photo's URL <photo id="4794344495" owner="38553162@N00" secret="d907790937" server="4093" farm="5" title="Sushi!" ispublic="1" isfriend="0" isfamily="0" /> https://farm{farm-id}.staticflickr.com/{server-id}/{id}_{secret} _[mstzb].jpg https://farm5.staticflickr.com/4093/4794344495_d907790937_b.jpg https://www.flickr.com/services/api/misc.urls.html Photo element: Photo's URL template: Result:
  20. 20. spider/sushi.py (Modified) # -*- coding: utf-8 -*- import os import scrapy from sushibot.items import SushibotItem class SushiSpider(scrapy.Spider): name = "sushi" allowed_domains = ["api.flickr.com", "staticflickr.com"] start_urls = ( 'https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=' + os.environ['FLICKR_KEY'] + '&text=sushi&sort=relevance', ) def parse(self, response): for photo in response.css('photo'): yield scrapy.Request(photo_url(photo), self.handle_image) def handle_image(self, response): return SushibotItem(url=response.url, body=response.body) def photo_url(photo): return 'https://farm{farm}.staticflickr.com/{server}/{id}_{secret}_{size}.jpg'.format( farm=photo.xpath('@farm').extract_first(), server=photo.xpath('@server').extract_first(), id=photo.xpath('@id').extract_first(), secret=photo.xpath('@secret').extract_first(), size='b', )
  21. 21. Scrapy's Architecture http://doc.scrapy.org/en/1.0/topics/architecture.html
  22. 22. items.py # -*- coding: utf-8 -*- from pprint import pformat import scrapy class SushibotItem(scrapy.Item): url = scrapy.Field() body = scrapy.Field() def __str__(self): return pformat({ 'url': self['url'], 'body': self['body'][:10] + '...', })
  23. 23. pipelines.py # -*- coding: utf-8 -*- import os class SaveImagePipeline(object): def process_item(self, item, spider): output_dir = 'images' if not os.path.exists(output_dir): os.makedirs(output_dir) filename = item['url'].split('/')[-1] with open(os.path.join(output_dir, filename), 'wb') as f: f.write(item['body']) return item
  24. 24. settings.py • Appended settings: # Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = 'sushibot (+orangain@gmail.com)' # Configure a delay for requests for the same website (default: 0) # See http://scrapy.readthedocs.org/en/latest/topics/ settings.html#download-delay # See also autothrottle settings and docs DOWNLOAD_DELAY = 1 # Configure item pipelines # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'sushibot.pipelines.SaveImagePipeline': 300, }
  25. 25. Run Spider $ FLICKR_KEY=********** scrapy crawl sushi NOTE: Provide Flickr's API key with environment variables.
  26. 26. Thank you! • Web scraping has power to propose improvement. • Source code is available at
 https://github.com/orangain/sushibot @orangain

×