Downloading the
internet with
Python + scrapy 💻🐍
Erin Shellman @erinshellman
Puget Sound 
Programming Python meet-up
January 14, 2015
hi!
I’m a data scientist in the Nordstrom Data Lab. I’ve built
scrapers to monitor the product catalogs of various
sports retailers.
Getting data can be hard
Despite the open-data movement and popularity
of APIs, volumes of data are locked up in DOMs all
over the internet.
Monitoring
competitor prices
• As a retailer, I want to strategically set prices in
relation to my competitors.
• But they aren’t interested in sharing their prices and
mark-down strategies with me. 😭
• “Scrapy is an application framework for crawling web
sites and extracting structured data which can be
used for a wide range of useful applications, like data
mining, information processing or historical archival.”
• scrapin’ on rails!
scrapy startproject prices
scrapy project
prices/
scrapy.cfg
prices/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
...
scrapy project
prices/
scrapy.cfg
prices/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
backcountry.py
...
class Product(Item):!
!
product_title = Field()!
description = Field()!
price = Field() !
Define what to
scrape in items.py
protip: get to know
the DOM.
protip: get to know
the DOM.
Sometimes
there are
hidden gems.
SKU-level inventory
availability?
Score!
Spider design
Spiders have two primary components:
1. Crawling (navigation) instructions
2. Parsing instructions
Define the crawl behavior
in spiders/backcountry.py
After spending some
time on backcountry.com,
I decided the all brands
landing page was the
best starting URL.
class BackcountrySpider(CrawlSpider):!
name = 'backcountry'!
def __init__(self, *args, **kwargs):!
super(BackcountrySpider, self).__init__(*args, **kwargs)!
self.base_url = 'http://www.backcountry.com'!
self.start_urls = ['http://www.backcountry.com/Store/catalog/shopAllBrands.jsp']!
!
def parse_start_url(self, response):!
brands = response.xpath("//a[@class='qa-brand-link']/@href").extract()!
!
for brand in brands:!
brand_url = str(self.base_url + brand)!
self.log("Queued up: %s" % brand_url)!
!
yield scrapy.Request(url = brand_url, !
callback = self.parse_brand_landing_pages)!
Part I: Crawl Setup
e.g. brand_url = http://www.backcountry.com/burton
!
def parse_start_url(self, response):!
brands = response.xpath("//a[@class='qa-brand-link']/@href").extract()!
!
for brand in brands:!
brand_url = str(self.base_url + brand)!
self.log("Queued up: %s" % brand_url)!
!
yield scrapy.Request(url = brand_url, !
callback = self.parse_brand_landing_pages)!
!
def parse_brand_landing_pages(self, response):!
shop_all_pattern = "//a[@class='subcategory-link brand-plp-link qa-brand-plp-link']/@href"!
shop_all_link = response.xpath(shop_all_pattern).extract()!
!
if shop_all_link:!
all_product_url = str(self.base_url + shop_all_link[0]) !
!
yield scrapy.Request(url = all_product_url,!
callback = self.parse_product_pages)!
else: !
yield scrapy.Request(url = response.url,!
callback = self.parse_product_pages)
def parse_product_pages(self, response):!
product_page_pattern = "//a[contains(@class, 'qa-product-link')]/@href"!
pagination_pattern = "//li[@class='page-link page-number']/a/@href"!
!
product_pages = response.xpath(product_page_pattern).extract()!
more_pages = response.xpath(pagination_pattern).extract()!
!
# Paginate!!
for page in more_pages:!
next_page = str(self.base_url + page)!
yield scrapy.Request(url = next_page,!
callback = self.parse_product_pages)!
!
for product in product_pages:!
product_url = str(self.base_url + product)!
!
yield scrapy.Request(url = product_url,!
callback = self.parse_item)
def parse_product_pages(self, response):!
product_page_pattern = "//a[contains(@class, 'qa-product-link')]/@href"!
pagination_pattern = "//li[@class='page-link page-number']/a/@href"!
!
product_pages = response.xpath(product_page_pattern).extract()!
more_pages = response.xpath(pagination_pattern).extract()!
!
# Paginate!!
for page in more_pages:!
next_page = str(self.base_url + page)!
yield scrapy.Request(url = next_page,!
callback = self.parse_product_pages)!
!
for product in product_pages:!
product_url = str(self.base_url + product)!
!
yield scrapy.Request(url = product_url,!
callback = self.parse_item)
# Paginate!!
for page in more_pages:!
next_page = str(self.base_url + page)!
yield scrapy.Request(url = next_page,!
callback = self.parse_product_pages)!
!
for product in product_pages:!
product_url = str(self.base_url + product)!
!
yield scrapy.Request(url = product_url,!
callback = self.parse_item)
def parse_item(self, response):!
!
item = Product()!
dirty_data = {}!
!
dirty_data['product_title'] = response.xpath(“//*[@id=‘product-buy-box’]/div/div[1]/h1/text()“).extract()!
dirty_data['description'] = response.xpath("//div[@class='product-description']/text()").extract()!
dirty_data['price'] = response.xpath("//span[@itemprop='price']/text()").extract()!
!
for variable in dirty_data.keys():!
if dirty_data[variable]: !
if variable == 'price':!
item[variable] = float(''.join(dirty_data[variable]).strip().replace('$', '').replace(',', ''))!
else: !
item[variable] = ''.join(dirty_data[variable]).strip()!
!
yield item!
Part II: Parsing
for variable in dirty_data.keys():!
if dirty_data[variable]: !
if variable == 'price':!
item[variable] = float(''.join(dirty_data[variable]).strip().replace('$', '').replace(',', ''))!
else: !
item[variable] = ''.join(dirty_data[variable]).strip()
Part II: Clean it now!
scrapy crawl backcountry -o bc.json
2015-01-02 12:32:52-0800 [backcountry] INFO: Closing spider (finished)
2015-01-02 12:32:52-0800 [backcountry] INFO: Stored json feed (38881 items) in: bc.json
2015-01-02 12:32:52-0800 [backcountry] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 33068379,
'downloader/request_count': 41848,
'downloader/request_method_count/GET': 41848,
'downloader/response_bytes': 1715232531,
'downloader/response_count': 41848,
'downloader/response_status_count/200': 41835,
'downloader/response_status_count/301': 9,
'downloader/response_status_count/404': 4,
'dupefilter/filtered': 12481,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 1, 2, 20, 32, 52, 524929),
'item_scraped_count': 38881,
'log_count/DEBUG': 81784,
'log_count/ERROR': 23,
'log_count/INFO': 26,
'request_depth_max': 7,
'response_received_count': 41839,
'scheduler/dequeued': 41848,
'scheduler/dequeued/memory': 41848,
'scheduler/enqueued': 41848,
'scheduler/enqueued/memory': 41848,
'spider_exceptions/IndexError': 23,
'start_time': datetime.datetime(2015, 1, 2, 20, 14, 16, 892071)}
2015-01-02 12:32:52-0800 [backcountry] INFO: Spider closed (finished)
{
"review_count": 18,
"product_id": "BAF0028",
"brand": "Baffin",
"product_url": "http://www.backcountry.com/baffin-cush-slipper-mens",
"source": "backcountry",
"inventory": {
"BAF0028-ESP-S3XL": 27,
"BAF0028-BKA-XL": 40,
"BAF0028-NVA-XL": 5,
"BAF0028-NVA-L": 7,
"BAF0028-BKA-L": 17,
"BAF0028-ESP-XXL": 12,
"BAF0028-NVA-XXL": 6,
"BAF0028-BKA-XXL": 44,
"BAF0028-NVA-S3XL": 10,
"BAF0028-ESP-L": 50,
"BAF0028-ESP-XL": 52,
"BAF0028-BKA-S3XL": 19
},
"price_high": 24.95,
"price": 23.95,
"description_short": "Cush Slipper - Men's",
"price_low": 23.95,
"review_score": 4
}
prices/
scrapy.cfg
prices/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
backcountry.py
evo.py
rei.py
...
–Monica Rogati, VP of Data at Jawbone
“Data wrangling is a huge — and
surprisingly so — part of the job. It’s
something that is not appreciated by data
civilians. At times, it feels like everything
we do.”
Resources
• Code here!:
• https://github.com/erinshellman/backcountry-scraper
• Lynn Root’s excellent end-to-end tutorial.
• http://newcoder.io/Intro-Scrape/
• Web scraping - It’s your civic duty
• http://pbpython.com/web-scraping-mn-budget.html
Bring your projects to hacknight!
http://www.meetup.com/Seattle-PyLadies
Ladies!!
Bring your projects to hacknight!
http://www.meetup.com/Seattle-PyLadies
Ladies!!
Thursday, January
29th
6PM
!
Intro to iPython
and Matplotlib
Ada Developers Academy
1301 5th Avenue #1350

Downloading the internet with Python + Scrapy

  • 1.
    Downloading the internet with Python+ scrapy 💻🐍 Erin Shellman @erinshellman Puget Sound Programming Python meet-up January 14, 2015
  • 2.
    hi! I’m a datascientist in the Nordstrom Data Lab. I’ve built scrapers to monitor the product catalogs of various sports retailers.
  • 3.
    Getting data canbe hard Despite the open-data movement and popularity of APIs, volumes of data are locked up in DOMs all over the internet.
  • 4.
    Monitoring competitor prices • Asa retailer, I want to strategically set prices in relation to my competitors. • But they aren’t interested in sharing their prices and mark-down strategies with me. 😭
  • 5.
    • “Scrapy isan application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.” • scrapin’ on rails!
  • 6.
  • 7.
  • 8.
  • 9.
    class Product(Item):! ! product_title =Field()! description = Field()! price = Field() ! Define what to scrape in items.py
  • 10.
    protip: get toknow the DOM.
  • 11.
    protip: get toknow the DOM.
  • 12.
    Sometimes there are hidden gems. SKU-levelinventory availability? Score!
  • 13.
    Spider design Spiders havetwo primary components: 1. Crawling (navigation) instructions 2. Parsing instructions
  • 14.
    Define the crawlbehavior in spiders/backcountry.py After spending some time on backcountry.com, I decided the all brands landing page was the best starting URL.
  • 15.
    class BackcountrySpider(CrawlSpider):! name ='backcountry'! def __init__(self, *args, **kwargs):! super(BackcountrySpider, self).__init__(*args, **kwargs)! self.base_url = 'http://www.backcountry.com'! self.start_urls = ['http://www.backcountry.com/Store/catalog/shopAllBrands.jsp']! ! def parse_start_url(self, response):! brands = response.xpath("//a[@class='qa-brand-link']/@href").extract()! ! for brand in brands:! brand_url = str(self.base_url + brand)! self.log("Queued up: %s" % brand_url)! ! yield scrapy.Request(url = brand_url, ! callback = self.parse_brand_landing_pages)! Part I: Crawl Setup
  • 16.
    e.g. brand_url =http://www.backcountry.com/burton ! def parse_start_url(self, response):! brands = response.xpath("//a[@class='qa-brand-link']/@href").extract()! ! for brand in brands:! brand_url = str(self.base_url + brand)! self.log("Queued up: %s" % brand_url)! ! yield scrapy.Request(url = brand_url, ! callback = self.parse_brand_landing_pages)!
  • 17.
    ! def parse_brand_landing_pages(self, response):! shop_all_pattern= "//a[@class='subcategory-link brand-plp-link qa-brand-plp-link']/@href"! shop_all_link = response.xpath(shop_all_pattern).extract()! ! if shop_all_link:! all_product_url = str(self.base_url + shop_all_link[0]) ! ! yield scrapy.Request(url = all_product_url,! callback = self.parse_product_pages)! else: ! yield scrapy.Request(url = response.url,! callback = self.parse_product_pages)
  • 18.
    def parse_product_pages(self, response):! product_page_pattern= "//a[contains(@class, 'qa-product-link')]/@href"! pagination_pattern = "//li[@class='page-link page-number']/a/@href"! ! product_pages = response.xpath(product_page_pattern).extract()! more_pages = response.xpath(pagination_pattern).extract()! ! # Paginate!! for page in more_pages:! next_page = str(self.base_url + page)! yield scrapy.Request(url = next_page,! callback = self.parse_product_pages)! ! for product in product_pages:! product_url = str(self.base_url + product)! ! yield scrapy.Request(url = product_url,! callback = self.parse_item)
  • 19.
    def parse_product_pages(self, response):! product_page_pattern= "//a[contains(@class, 'qa-product-link')]/@href"! pagination_pattern = "//li[@class='page-link page-number']/a/@href"! ! product_pages = response.xpath(product_page_pattern).extract()! more_pages = response.xpath(pagination_pattern).extract()! ! # Paginate!! for page in more_pages:! next_page = str(self.base_url + page)! yield scrapy.Request(url = next_page,! callback = self.parse_product_pages)! ! for product in product_pages:! product_url = str(self.base_url + product)! ! yield scrapy.Request(url = product_url,! callback = self.parse_item)
  • 20.
    # Paginate!! for pagein more_pages:! next_page = str(self.base_url + page)! yield scrapy.Request(url = next_page,! callback = self.parse_product_pages)! ! for product in product_pages:! product_url = str(self.base_url + product)! ! yield scrapy.Request(url = product_url,! callback = self.parse_item)
  • 21.
    def parse_item(self, response):! ! item= Product()! dirty_data = {}! ! dirty_data['product_title'] = response.xpath(“//*[@id=‘product-buy-box’]/div/div[1]/h1/text()“).extract()! dirty_data['description'] = response.xpath("//div[@class='product-description']/text()").extract()! dirty_data['price'] = response.xpath("//span[@itemprop='price']/text()").extract()! ! for variable in dirty_data.keys():! if dirty_data[variable]: ! if variable == 'price':! item[variable] = float(''.join(dirty_data[variable]).strip().replace('$', '').replace(',', ''))! else: ! item[variable] = ''.join(dirty_data[variable]).strip()! ! yield item! Part II: Parsing
  • 22.
    for variable indirty_data.keys():! if dirty_data[variable]: ! if variable == 'price':! item[variable] = float(''.join(dirty_data[variable]).strip().replace('$', '').replace(',', ''))! else: ! item[variable] = ''.join(dirty_data[variable]).strip() Part II: Clean it now!
  • 23.
  • 24.
    2015-01-02 12:32:52-0800 [backcountry]INFO: Closing spider (finished) 2015-01-02 12:32:52-0800 [backcountry] INFO: Stored json feed (38881 items) in: bc.json 2015-01-02 12:32:52-0800 [backcountry] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 33068379, 'downloader/request_count': 41848, 'downloader/request_method_count/GET': 41848, 'downloader/response_bytes': 1715232531, 'downloader/response_count': 41848, 'downloader/response_status_count/200': 41835, 'downloader/response_status_count/301': 9, 'downloader/response_status_count/404': 4, 'dupefilter/filtered': 12481, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 1, 2, 20, 32, 52, 524929), 'item_scraped_count': 38881, 'log_count/DEBUG': 81784, 'log_count/ERROR': 23, 'log_count/INFO': 26, 'request_depth_max': 7, 'response_received_count': 41839, 'scheduler/dequeued': 41848, 'scheduler/dequeued/memory': 41848, 'scheduler/enqueued': 41848, 'scheduler/enqueued/memory': 41848, 'spider_exceptions/IndexError': 23, 'start_time': datetime.datetime(2015, 1, 2, 20, 14, 16, 892071)} 2015-01-02 12:32:52-0800 [backcountry] INFO: Spider closed (finished)
  • 25.
    { "review_count": 18, "product_id": "BAF0028", "brand":"Baffin", "product_url": "http://www.backcountry.com/baffin-cush-slipper-mens", "source": "backcountry", "inventory": { "BAF0028-ESP-S3XL": 27, "BAF0028-BKA-XL": 40, "BAF0028-NVA-XL": 5, "BAF0028-NVA-L": 7, "BAF0028-BKA-L": 17, "BAF0028-ESP-XXL": 12, "BAF0028-NVA-XXL": 6, "BAF0028-BKA-XXL": 44, "BAF0028-NVA-S3XL": 10, "BAF0028-ESP-L": 50, "BAF0028-ESP-XL": 52, "BAF0028-BKA-S3XL": 19 }, "price_high": 24.95, "price": 23.95, "description_short": "Cush Slipper - Men's", "price_low": 23.95, "review_score": 4 }
  • 26.
  • 27.
    –Monica Rogati, VPof Data at Jawbone “Data wrangling is a huge — and surprisingly so — part of the job. It’s something that is not appreciated by data civilians. At times, it feels like everything we do.”
  • 28.
    Resources • Code here!: •https://github.com/erinshellman/backcountry-scraper • Lynn Root’s excellent end-to-end tutorial. • http://newcoder.io/Intro-Scrape/ • Web scraping - It’s your civic duty • http://pbpython.com/web-scraping-mn-budget.html
  • 29.
    Bring your projectsto hacknight! http://www.meetup.com/Seattle-PyLadies Ladies!!
  • 30.
    Bring your projectsto hacknight! http://www.meetup.com/Seattle-PyLadies Ladies!! Thursday, January 29th 6PM ! Intro to iPython and Matplotlib Ada Developers Academy 1301 5th Avenue #1350