Web Scraping in
Python with Scrapy
Kota Kato
@orangain
2015-09-08, 鮨会
Who am I?
• Kota Kato
• @orangain
• Software Engineer
• Interested in automation such as Jenkins,
Chef, Docker etc.
Definition: Web Scraping
• Web scraping (web harvesting or web data
extraction) is a computer software technique
of extracting information from websites.
Web scraping - Wikipedia, the free encyclopedia

https://en.wikipedia.org/wiki/Web_scraping
eBook-1
• Cross-store search engine for ebooks.
• Retrieve ebook data from 9 ebook stores.
http://ebook-1.com/
QB Meter
• Visualize crowdedness
of QB HOUSE, 10
minutes barbershop.
• Retrieve crowdedness
from QB HOUSE's
Web site every 5
minutes.
http://qbmeter.capybala.com/
Prototype of
Glance
• Prototype of simple news
app like newspaper.
• Retrieve news from NHK
NEWS WEB 4 times per a
day.
Pokedos
• Web app to find nearest
bus stops to see the
arrival information of
buses.
• Retrieve location of the
all bus stops in Kyoto-
city.
http://bus.capybala.com/
Why Web Scraping?
• For Web Developer:
• Develop mash-up application.
• For Data Analyst:
• Retrieve data to analyze.
• For Everybody:
• Automate operation of web sites.
Why Use Python?
• Easy to use
• Powerful libraries, especially Scrapy
• Seamlessness between data processing and
developing application
Web Scraping in Python
• Combination of lightweight libraries:
• Retrieving: Requests
• Scraping: lxml, Beautiful Soup
• Full stack framework:
• Scrapy Today's topic
Scrapy
Scrapy
• Fast, simple and extensible Web scraping
framework in Python
• Currently compatible only with Python 2.7
• In-progress Python 3 support
• Maintained by Scrapinghub
• BSD License
http://scrapy.org/
Why Use Scrapy?
• Annoying stuffs in crawling and scraping are
done by Scrapy.
Extracting
Links
Throttling Concurrency
robots.txt and
<meta> Tags
XML Sitemaps
Filtering
Duplicated
URLs
Retry on Error Job Control
Getting Started with Scrapy
$ pip install scrapy
$ cat > myspider.py <<EOF
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['http://blog.scrapinghub.com']
def parse(self, response):
for url in response.css('ul li a::attr("href")').re(r'.*/dddd/dd/$'):
yield scrapy.Request(response.urljoin(url), self.parse_titles)
def parse_titles(self, response):
for post_title in response.css('div.entries > ul > li a::text').extract():
yield {'title': post_title}
EOF
$ scrapy runspider myspider.py
http://scrapy.org/Requirements: Python 2.7, libxml2 and libxslt
Let's Collect Sushi Images
Create a Scrapy Project
$ scrapy startproject sushibot
$ tree sushibot/
sushibot/
!"" scrapy.cfg
#"" sushibot
!"" __init__.py
!"" items.py
!"" pipelines.py
!"" settings.py
#"" spiders
#"" __init__.py
2 directories, 6 files
Generate a Spider
$ cd sushibot
$ scrapy genspider sushi api.flickr.com
$ cat sushibot/spiders/sushi.py
# -*- coding: utf-8 -*-
import scrapy
class SushiSpider(scrapy.Spider):
name = "sushi"
allowed_domains = ["api.flickr.com"]
start_urls = (
'http://www.api.flickr.com/',
)
def parse(self, response):
pass
Flickr API to Search Photos
$ curl 'https://api.flickr.com/services/rest/?
method=flickr.photos.search&api_key=******&text=sushi&sort=relevance
' > photos.xml
$ cat photos.xml
<?xml version="1.0" encoding="utf-8" ?>
<rsp stat="ok">
<photos page="1" pages="871" perpage="100" total="87088">
<photo id="4794344495" owner="38553162@N00" secret="d907790937"
server="4093" farm="5" title="Sushi!" ispublic="1" isfriend="0"
isfamily="0" />
<photo id="8486536177" owner="78779574@N00" secret="f77b824ebb"
server="8382" farm="9" title="Best Salmon Sushi" ispublic="1"
isfriend="0" isfamily="0" />
...
https://www.flickr.com/services/api/flickr.photos.search.html
Construct Photo's URL
<photo id="4794344495" owner="38553162@N00" secret="d907790937"
server="4093" farm="5" title="Sushi!" ispublic="1" isfriend="0"
isfamily="0" />
https://farm{farm-id}.staticflickr.com/{server-id}/{id}_{secret}
_[mstzb].jpg
https://farm5.staticflickr.com/4093/4794344495_d907790937_b.jpg
https://www.flickr.com/services/api/misc.urls.html
Photo element:
Photo's URL template:
Result:
spider/sushi.py (Modified)
# -*- coding: utf-8 -*-
import os
import scrapy
from sushibot.items import SushibotItem
class SushiSpider(scrapy.Spider):
name = "sushi"
allowed_domains = ["api.flickr.com", "staticflickr.com"]
start_urls = (
'https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=' +
os.environ['FLICKR_KEY'] + '&text=sushi&sort=relevance',
)
def parse(self, response):
for photo in response.css('photo'):
yield scrapy.Request(photo_url(photo), self.handle_image)
def handle_image(self, response):
return SushibotItem(url=response.url, body=response.body)
def photo_url(photo):
return 'https://farm{farm}.staticflickr.com/{server}/{id}_{secret}_{size}.jpg'.format(
farm=photo.xpath('@farm').extract_first(),
server=photo.xpath('@server').extract_first(),
id=photo.xpath('@id').extract_first(),
secret=photo.xpath('@secret').extract_first(),
size='b',
)
Scrapy's Architecture
http://doc.scrapy.org/en/1.0/topics/architecture.html
items.py
# -*- coding: utf-8 -*-
from pprint import pformat
import scrapy
class SushibotItem(scrapy.Item):
url = scrapy.Field()
body = scrapy.Field()
def __str__(self):
return pformat({
'url': self['url'],
'body': self['body'][:10] + '...',
})
pipelines.py
# -*- coding: utf-8 -*-
import os
class SaveImagePipeline(object):
def process_item(self, item, spider):
output_dir = 'images'
if not os.path.exists(output_dir):
os.makedirs(output_dir)
filename = item['url'].split('/')[-1]
with open(os.path.join(output_dir, filename), 'wb') as f:
f.write(item['body'])
return item
settings.py
• Appended settings:
# Crawl responsibly by identifying yourself (and your website) on the
user-agent
USER_AGENT = 'sushibot (+orangain@gmail.com)'
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/
settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'sushibot.pipelines.SaveImagePipeline': 300,
}
Run Spider
$ FLICKR_KEY=********** scrapy crawl sushi
NOTE: Provide Flickr's API key with environment variables.
Thank you!
• Web scraping has power to propose
improvement.
• Source code is available at

https://github.com/orangain/sushibot
@orangain

Web Scraping in Python with Scrapy

  • 1.
    Web Scraping in Pythonwith Scrapy Kota Kato @orangain 2015-09-08, 鮨会
  • 2.
    Who am I? •Kota Kato • @orangain • Software Engineer • Interested in automation such as Jenkins, Chef, Docker etc.
  • 3.
    Definition: Web Scraping •Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Web scraping - Wikipedia, the free encyclopedia
 https://en.wikipedia.org/wiki/Web_scraping
  • 4.
    eBook-1 • Cross-store searchengine for ebooks. • Retrieve ebook data from 9 ebook stores. http://ebook-1.com/
  • 5.
    QB Meter • Visualizecrowdedness of QB HOUSE, 10 minutes barbershop. • Retrieve crowdedness from QB HOUSE's Web site every 5 minutes. http://qbmeter.capybala.com/
  • 6.
    Prototype of Glance • Prototypeof simple news app like newspaper. • Retrieve news from NHK NEWS WEB 4 times per a day.
  • 7.
    Pokedos • Web appto find nearest bus stops to see the arrival information of buses. • Retrieve location of the all bus stops in Kyoto- city. http://bus.capybala.com/
  • 8.
    Why Web Scraping? •For Web Developer: • Develop mash-up application. • For Data Analyst: • Retrieve data to analyze. • For Everybody: • Automate operation of web sites.
  • 9.
    Why Use Python? •Easy to use • Powerful libraries, especially Scrapy • Seamlessness between data processing and developing application
  • 10.
    Web Scraping inPython • Combination of lightweight libraries: • Retrieving: Requests • Scraping: lxml, Beautiful Soup • Full stack framework: • Scrapy Today's topic
  • 11.
  • 12.
    Scrapy • Fast, simpleand extensible Web scraping framework in Python • Currently compatible only with Python 2.7 • In-progress Python 3 support • Maintained by Scrapinghub • BSD License http://scrapy.org/
  • 13.
    Why Use Scrapy? •Annoying stuffs in crawling and scraping are done by Scrapy. Extracting Links Throttling Concurrency robots.txt and <meta> Tags XML Sitemaps Filtering Duplicated URLs Retry on Error Job Control
  • 14.
    Getting Started withScrapy $ pip install scrapy $ cat > myspider.py <<EOF import scrapy class BlogSpider(scrapy.Spider): name = 'blogspider' start_urls = ['http://blog.scrapinghub.com'] def parse(self, response): for url in response.css('ul li a::attr("href")').re(r'.*/dddd/dd/$'): yield scrapy.Request(response.urljoin(url), self.parse_titles) def parse_titles(self, response): for post_title in response.css('div.entries > ul > li a::text').extract(): yield {'title': post_title} EOF $ scrapy runspider myspider.py http://scrapy.org/Requirements: Python 2.7, libxml2 and libxslt
  • 15.
  • 16.
    Create a ScrapyProject $ scrapy startproject sushibot $ tree sushibot/ sushibot/ !"" scrapy.cfg #"" sushibot !"" __init__.py !"" items.py !"" pipelines.py !"" settings.py #"" spiders #"" __init__.py 2 directories, 6 files
  • 17.
    Generate a Spider $cd sushibot $ scrapy genspider sushi api.flickr.com $ cat sushibot/spiders/sushi.py # -*- coding: utf-8 -*- import scrapy class SushiSpider(scrapy.Spider): name = "sushi" allowed_domains = ["api.flickr.com"] start_urls = ( 'http://www.api.flickr.com/', ) def parse(self, response): pass
  • 18.
    Flickr API toSearch Photos $ curl 'https://api.flickr.com/services/rest/? method=flickr.photos.search&api_key=******&text=sushi&sort=relevance ' > photos.xml $ cat photos.xml <?xml version="1.0" encoding="utf-8" ?> <rsp stat="ok"> <photos page="1" pages="871" perpage="100" total="87088"> <photo id="4794344495" owner="38553162@N00" secret="d907790937" server="4093" farm="5" title="Sushi!" ispublic="1" isfriend="0" isfamily="0" /> <photo id="8486536177" owner="78779574@N00" secret="f77b824ebb" server="8382" farm="9" title="Best Salmon Sushi" ispublic="1" isfriend="0" isfamily="0" /> ... https://www.flickr.com/services/api/flickr.photos.search.html
  • 19.
    Construct Photo's URL <photoid="4794344495" owner="38553162@N00" secret="d907790937" server="4093" farm="5" title="Sushi!" ispublic="1" isfriend="0" isfamily="0" /> https://farm{farm-id}.staticflickr.com/{server-id}/{id}_{secret} _[mstzb].jpg https://farm5.staticflickr.com/4093/4794344495_d907790937_b.jpg https://www.flickr.com/services/api/misc.urls.html Photo element: Photo's URL template: Result:
  • 20.
    spider/sushi.py (Modified) # -*-coding: utf-8 -*- import os import scrapy from sushibot.items import SushibotItem class SushiSpider(scrapy.Spider): name = "sushi" allowed_domains = ["api.flickr.com", "staticflickr.com"] start_urls = ( 'https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=' + os.environ['FLICKR_KEY'] + '&text=sushi&sort=relevance', ) def parse(self, response): for photo in response.css('photo'): yield scrapy.Request(photo_url(photo), self.handle_image) def handle_image(self, response): return SushibotItem(url=response.url, body=response.body) def photo_url(photo): return 'https://farm{farm}.staticflickr.com/{server}/{id}_{secret}_{size}.jpg'.format( farm=photo.xpath('@farm').extract_first(), server=photo.xpath('@server').extract_first(), id=photo.xpath('@id').extract_first(), secret=photo.xpath('@secret').extract_first(), size='b', )
  • 21.
  • 22.
    items.py # -*- coding:utf-8 -*- from pprint import pformat import scrapy class SushibotItem(scrapy.Item): url = scrapy.Field() body = scrapy.Field() def __str__(self): return pformat({ 'url': self['url'], 'body': self['body'][:10] + '...', })
  • 23.
    pipelines.py # -*- coding:utf-8 -*- import os class SaveImagePipeline(object): def process_item(self, item, spider): output_dir = 'images' if not os.path.exists(output_dir): os.makedirs(output_dir) filename = item['url'].split('/')[-1] with open(os.path.join(output_dir, filename), 'wb') as f: f.write(item['body']) return item
  • 24.
    settings.py • Appended settings: #Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = 'sushibot (+orangain@gmail.com)' # Configure a delay for requests for the same website (default: 0) # See http://scrapy.readthedocs.org/en/latest/topics/ settings.html#download-delay # See also autothrottle settings and docs DOWNLOAD_DELAY = 1 # Configure item pipelines # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'sushibot.pipelines.SaveImagePipeline': 300, }
  • 25.
    Run Spider $ FLICKR_KEY=**********scrapy crawl sushi NOTE: Provide Flickr's API key with environment variables.
  • 26.
    Thank you! • Webscraping has power to propose improvement. • Source code is available at
 https://github.com/orangain/sushibot @orangain