Downloading the internet with Python + Scrapy

Downloading the
internet with
Python + scrapy 💻🐍
Erin Shellman @erinshellman
Puget Sound
Programming Python meet-up
January 14, 2015

hi!
I’m a data scientist in the Nordstrom Data Lab. I’ve built
scrapers to monitor the product catalogs of various
sports retailers.

Getting data can be hard
Despite the open-data movement and popularity
of APIs, volumes of data are locked up in DOMs all
over the internet.

Monitoring
competitor prices
• As a retailer, I want to strategically set prices in
relation to my competitors.
• But they aren’t interested in sharing their prices and
mark-down strategies with me. 😭

• “Scrapy is an application framework for crawling web
sites and extracting structured data which can be
used for a wide range of useful applications, like data
mining, information processing or historical archival.”
• scrapin’ on rails!

scrapy project
prices/
scrapy.cfg
prices/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
...

scrapy project
prices/
scrapy.cfg
prices/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
backcountry.py
...

class Product(Item):!
!
product_title = Field()!
description = Field()!
price = Field() !
Define what to
scrape in items.py

Sometimes
there are
hidden gems.
SKU-level inventory
availability?
Score!

Spider design
Spiders have two primary components:
1. Crawling (navigation) instructions
2. Parsing instructions

Define the crawl behavior
in spiders/backcountry.py
After spending some
time on backcountry.com,
I decided the all brands
landing page was the
best starting URL.

class BackcountrySpider(CrawlSpider):!
name = 'backcountry'!
def __init__(self, *args, **kwargs):!
super(BackcountrySpider, self).__init__(*args, **kwargs)!
self.base_url = 'http://www.backcountry.com'!
self.start_urls = ['http://www.backcountry.com/Store/catalog/shopAllBrands.jsp']!
!
def parse_start_url(self, response):!
brands = response.xpath("//a[@class='qa-brand-link']/@href").extract()!
!
for brand in brands:!
brand_url = str(self.base_url + brand)!
self.log("Queued up: %s" % brand_url)!
!
yield scrapy.Request(url = brand_url, !
callback = self.parse_brand_landing_pages)!
Part I: Crawl Setup

e.g. brand_url = http://www.backcountry.com/burton
!
def parse_start_url(self, response):!
brands = response.xpath("//a[@class='qa-brand-link']/@href").extract()!
!
for brand in brands:!
brand_url = str(self.base_url + brand)!
self.log("Queued up: %s" % brand_url)!
!
yield scrapy.Request(url = brand_url, !
callback = self.parse_brand_landing_pages)!

!
def parse_brand_landing_pages(self, response):!
shop_all_pattern = "//a[@class='subcategory-link brand-plp-link qa-brand-plp-link']/@href"!
shop_all_link = response.xpath(shop_all_pattern).extract()!
!
if shop_all_link:!
all_product_url = str(self.base_url + shop_all_link[0]) !
!
yield scrapy.Request(url = all_product_url,!
callback = self.parse_product_pages)!
else: !
yield scrapy.Request(url = response.url,!
callback = self.parse_product_pages)

def parse_product_pages(self, response):!
product_page_pattern = "//a[contains(@class, 'qa-product-link')]/@href"!
pagination_pattern = "//li[@class='page-link page-number']/a/@href"!
!
product_pages = response.xpath(product_page_pattern).extract()!
more_pages = response.xpath(pagination_pattern).extract()!
!
# Paginate!!
for page in more_pages:!
next_page = str(self.base_url + page)!
yield scrapy.Request(url = next_page,!
!
for product in product_pages:!
product_url = str(self.base_url + product)!
!
yield scrapy.Request(url = product_url,!
callback = self.parse_item)

# Paginate!!
for page in more_pages:!
next_page = str(self.base_url + page)!
yield scrapy.Request(url = next_page,!
!
for product in product_pages:!
product_url = str(self.base_url + product)!
!
yield scrapy.Request(url = product_url,!
callback = self.parse_item)

def parse_item(self, response):!
!
item = Product()!
dirty_data = {}!
!
dirty_data['product_title'] = response.xpath(“//*[@id=‘product-buy-box’]/div/div[1]/h1/text()“).extract()!
dirty_data['description'] = response.xpath("//div[@class='product-description']/text()").extract()!
dirty_data['price'] = response.xpath("//span[@itemprop='price']/text()").extract()!
!
for variable in dirty_data.keys():!
if dirty_data[variable]: !
if variable == 'price':!
item[variable] = float(''.join(dirty_data[variable]).strip().replace('$', '').replace(',', ''))!
else: !
item[variable] = ''.join(dirty_data[variable]).strip()!
!
yield item!
Part II: Parsing

for variable in dirty_data.keys():!
if dirty_data[variable]: !
if variable == 'price':!
item[variable] = float(''.join(dirty_data[variable]).strip().replace('$', '').replace(',', ''))!
else: !
item[variable] = ''.join(dirty_data[variable]).strip()
Part II: Clean it now!

scrapy crawl backcountry -o bc.json

2015-01-02 12:32:52-0800 [backcountry] INFO: Closing spider (finished)
2015-01-02 12:32:52-0800 [backcountry] INFO: Stored json feed (38881 items) in: bc.json
2015-01-02 12:32:52-0800 [backcountry] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 33068379,
'downloader/request_count': 41848,
'downloader/request_method_count/GET': 41848,
'downloader/response_bytes': 1715232531,
'downloader/response_count': 41848,
'downloader/response_status_count/200': 41835,
'dupefilter/filtered': 12481,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 1, 2, 20, 32, 52, 524929),
'item_scraped_count': 38881,
'log_count/DEBUG': 81784,
'log_count/ERROR': 23,
'log_count/INFO': 26,
'request_depth_max': 7,
'response_received_count': 41839,
'scheduler/dequeued': 41848,
'scheduler/dequeued/memory': 41848,
'scheduler/enqueued': 41848,
'scheduler/enqueued/memory': 41848,
'spider_exceptions/IndexError': 23,
'start_time': datetime.datetime(2015, 1, 2, 20, 14, 16, 892071)}
2015-01-02 12:32:52-0800 [backcountry] INFO: Spider closed (finished)

{
"review_count": 18,
"product_id": "BAF0028",
"brand": "Baffin",
"product_url": "http://www.backcountry.com/baffin-cush-slipper-mens",
"source": "backcountry",
"inventory": {
"BAF0028-ESP-S3XL": 27,
"BAF0028-BKA-XL": 40,
"BAF0028-NVA-XL": 5,
"BAF0028-NVA-L": 7,
"BAF0028-BKA-L": 17,
"BAF0028-ESP-XXL": 12,
"BAF0028-NVA-XXL": 6,
"BAF0028-BKA-XXL": 44,
"BAF0028-NVA-S3XL": 10,
"BAF0028-ESP-L": 50,
"BAF0028-ESP-XL": 52,
"BAF0028-BKA-S3XL": 19
},
"price_high": 24.95,
"price": 23.95,
"description_short": "Cush Slipper - Men's",
"price_low": 23.95,
"review_score": 4
}

prices/
scrapy.cfg
prices/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
backcountry.py
evo.py
rei.py
...

–Monica Rogati, VP of Data at Jawbone
“Data wrangling is a huge — and
surprisingly so — part of the job. It’s
something that is not appreciated by data
civilians. At times, it feels like everything
we do.”

Resources
• Code here!:
• https://github.com/erinshellman/backcountry-scraper
• Lynn Root’s excellent end-to-end tutorial.
• http://newcoder.io/Intro-Scrape/
• Web scraping - It’s your civic duty
• http://pbpython.com/web-scraping-mn-budget.html

Bring your projects to hacknight!
http://www.meetup.com/Seattle-PyLadies
Ladies!!

Bring your projects to hacknight!
http://www.meetup.com/Seattle-PyLadies
Ladies!!
Thursday, January
29th
6PM
!
Intro to iPython
and Matplotlib
Ada Developers Academy
1301 5th Avenue #1350

Downloading the internet with Python + Scrapy

More Related Content

What's hot

Viewers also liked

Similar to Downloading the internet with Python + Scrapy

Recently uploaded

Downloading the internet with Python + Scrapy