SlideShare a Scribd company logo
Downloading the
internet with
Python + scrapy 💻🐍
Erin Shellman @erinshellman
Puget Sound 
Programming Python meet-up
January 14, 2015
hi!
I’m a data scientist in the Nordstrom Data Lab. I’ve built
scrapers to monitor the product catalogs of various
sports retailers.
Getting data can be hard
Despite the open-data movement and popularity
of APIs, volumes of data are locked up in DOMs all
over the internet.
Monitoring
competitor prices
• As a retailer, I want to strategically set prices in
relation to my competitors.
• But they aren’t interested in sharing their prices and
mark-down strategies with me. 😭
• “Scrapy is an application framework for crawling web
sites and extracting structured data which can be
used for a wide range of useful applications, like data
mining, information processing or historical archival.”
• scrapin’ on rails!
scrapy startproject prices
scrapy project
prices/
scrapy.cfg
prices/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
...
scrapy project
prices/
scrapy.cfg
prices/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
backcountry.py
...
class Product(Item):!
!
product_title = Field()!
description = Field()!
price = Field() !
Define what to
scrape in items.py
protip: get to know
the DOM.
protip: get to know
the DOM.
Sometimes
there are
hidden gems.
SKU-level inventory
availability?
Score!
Spider design
Spiders have two primary components:
1. Crawling (navigation) instructions
2. Parsing instructions
Define the crawl behavior
in spiders/backcountry.py
After spending some
time on backcountry.com,
I decided the all brands
landing page was the
best starting URL.
class BackcountrySpider(CrawlSpider):!
name = 'backcountry'!
def __init__(self, *args, **kwargs):!
super(BackcountrySpider, self).__init__(*args, **kwargs)!
self.base_url = 'http://www.backcountry.com'!
self.start_urls = ['http://www.backcountry.com/Store/catalog/shopAllBrands.jsp']!
!
def parse_start_url(self, response):!
brands = response.xpath("//a[@class='qa-brand-link']/@href").extract()!
!
for brand in brands:!
brand_url = str(self.base_url + brand)!
self.log("Queued up: %s" % brand_url)!
!
yield scrapy.Request(url = brand_url, !
callback = self.parse_brand_landing_pages)!
Part I: Crawl Setup
e.g. brand_url = http://www.backcountry.com/burton
!
def parse_start_url(self, response):!
brands = response.xpath("//a[@class='qa-brand-link']/@href").extract()!
!
for brand in brands:!
brand_url = str(self.base_url + brand)!
self.log("Queued up: %s" % brand_url)!
!
yield scrapy.Request(url = brand_url, !
callback = self.parse_brand_landing_pages)!
!
def parse_brand_landing_pages(self, response):!
shop_all_pattern = "//a[@class='subcategory-link brand-plp-link qa-brand-plp-link']/@href"!
shop_all_link = response.xpath(shop_all_pattern).extract()!
!
if shop_all_link:!
all_product_url = str(self.base_url + shop_all_link[0]) !
!
yield scrapy.Request(url = all_product_url,!
callback = self.parse_product_pages)!
else: !
yield scrapy.Request(url = response.url,!
callback = self.parse_product_pages)
def parse_product_pages(self, response):!
product_page_pattern = "//a[contains(@class, 'qa-product-link')]/@href"!
pagination_pattern = "//li[@class='page-link page-number']/a/@href"!
!
product_pages = response.xpath(product_page_pattern).extract()!
more_pages = response.xpath(pagination_pattern).extract()!
!
# Paginate!!
for page in more_pages:!
next_page = str(self.base_url + page)!
yield scrapy.Request(url = next_page,!
callback = self.parse_product_pages)!
!
for product in product_pages:!
product_url = str(self.base_url + product)!
!
yield scrapy.Request(url = product_url,!
callback = self.parse_item)
def parse_product_pages(self, response):!
product_page_pattern = "//a[contains(@class, 'qa-product-link')]/@href"!
pagination_pattern = "//li[@class='page-link page-number']/a/@href"!
!
product_pages = response.xpath(product_page_pattern).extract()!
more_pages = response.xpath(pagination_pattern).extract()!
!
# Paginate!!
for page in more_pages:!
next_page = str(self.base_url + page)!
yield scrapy.Request(url = next_page,!
callback = self.parse_product_pages)!
!
for product in product_pages:!
product_url = str(self.base_url + product)!
!
yield scrapy.Request(url = product_url,!
callback = self.parse_item)
# Paginate!!
for page in more_pages:!
next_page = str(self.base_url + page)!
yield scrapy.Request(url = next_page,!
callback = self.parse_product_pages)!
!
for product in product_pages:!
product_url = str(self.base_url + product)!
!
yield scrapy.Request(url = product_url,!
callback = self.parse_item)
def parse_item(self, response):!
!
item = Product()!
dirty_data = {}!
!
dirty_data['product_title'] = response.xpath(“//*[@id=‘product-buy-box’]/div/div[1]/h1/text()“).extract()!
dirty_data['description'] = response.xpath("//div[@class='product-description']/text()").extract()!
dirty_data['price'] = response.xpath("//span[@itemprop='price']/text()").extract()!
!
for variable in dirty_data.keys():!
if dirty_data[variable]: !
if variable == 'price':!
item[variable] = float(''.join(dirty_data[variable]).strip().replace('$', '').replace(',', ''))!
else: !
item[variable] = ''.join(dirty_data[variable]).strip()!
!
yield item!
Part II: Parsing
for variable in dirty_data.keys():!
if dirty_data[variable]: !
if variable == 'price':!
item[variable] = float(''.join(dirty_data[variable]).strip().replace('$', '').replace(',', ''))!
else: !
item[variable] = ''.join(dirty_data[variable]).strip()
Part II: Clean it now!
scrapy crawl backcountry -o bc.json
2015-01-02 12:32:52-0800 [backcountry] INFO: Closing spider (finished)
2015-01-02 12:32:52-0800 [backcountry] INFO: Stored json feed (38881 items) in: bc.json
2015-01-02 12:32:52-0800 [backcountry] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 33068379,
'downloader/request_count': 41848,
'downloader/request_method_count/GET': 41848,
'downloader/response_bytes': 1715232531,
'downloader/response_count': 41848,
'downloader/response_status_count/200': 41835,
'downloader/response_status_count/301': 9,
'downloader/response_status_count/404': 4,
'dupefilter/filtered': 12481,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 1, 2, 20, 32, 52, 524929),
'item_scraped_count': 38881,
'log_count/DEBUG': 81784,
'log_count/ERROR': 23,
'log_count/INFO': 26,
'request_depth_max': 7,
'response_received_count': 41839,
'scheduler/dequeued': 41848,
'scheduler/dequeued/memory': 41848,
'scheduler/enqueued': 41848,
'scheduler/enqueued/memory': 41848,
'spider_exceptions/IndexError': 23,
'start_time': datetime.datetime(2015, 1, 2, 20, 14, 16, 892071)}
2015-01-02 12:32:52-0800 [backcountry] INFO: Spider closed (finished)
{
"review_count": 18,
"product_id": "BAF0028",
"brand": "Baffin",
"product_url": "http://www.backcountry.com/baffin-cush-slipper-mens",
"source": "backcountry",
"inventory": {
"BAF0028-ESP-S3XL": 27,
"BAF0028-BKA-XL": 40,
"BAF0028-NVA-XL": 5,
"BAF0028-NVA-L": 7,
"BAF0028-BKA-L": 17,
"BAF0028-ESP-XXL": 12,
"BAF0028-NVA-XXL": 6,
"BAF0028-BKA-XXL": 44,
"BAF0028-NVA-S3XL": 10,
"BAF0028-ESP-L": 50,
"BAF0028-ESP-XL": 52,
"BAF0028-BKA-S3XL": 19
},
"price_high": 24.95,
"price": 23.95,
"description_short": "Cush Slipper - Men's",
"price_low": 23.95,
"review_score": 4
}
prices/
scrapy.cfg
prices/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
backcountry.py
evo.py
rei.py
...
–Monica Rogati, VP of Data at Jawbone
“Data wrangling is a huge — and
surprisingly so — part of the job. It’s
something that is not appreciated by data
civilians. At times, it feels like everything
we do.”
Resources
• Code here!:
• https://github.com/erinshellman/backcountry-scraper
• Lynn Root’s excellent end-to-end tutorial.
• http://newcoder.io/Intro-Scrape/
• Web scraping - It’s your civic duty
• http://pbpython.com/web-scraping-mn-budget.html
Bring your projects to hacknight!
http://www.meetup.com/Seattle-PyLadies
Ladies!!
Bring your projects to hacknight!
http://www.meetup.com/Seattle-PyLadies
Ladies!!
Thursday, January
29th
6PM
!
Intro to iPython
and Matplotlib
Ada Developers Academy
1301 5th Avenue #1350

More Related Content

What's hot

Assumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfAssumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourself
Erin Shellman
 
Pydata-Python tools for webscraping
Pydata-Python tools for webscrapingPydata-Python tools for webscraping
Pydata-Python tools for webscraping
Jose Manuel Ortega Candel
 
How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.Diep Nguyen
 
Selenium&scrapy
Selenium&scrapySelenium&scrapy
Selenium&scrapy
Arcangelo Saracino
 
Scrapy
ScrapyScrapy
Webscraping with asyncio
Webscraping with asyncioWebscraping with asyncio
Webscraping with asyncio
Jose Manuel Ortega Candel
 
Django
DjangoDjango
Django
Kangjin Jun
 
Open Hack London - Introduction to YQL
Open Hack London - Introduction to YQLOpen Hack London - Introduction to YQL
Open Hack London - Introduction to YQL
Christian Heilmann
 
Using YQL Sensibly - YUIConf 2010
Using YQL Sensibly - YUIConf 2010Using YQL Sensibly - YUIConf 2010
Using YQL Sensibly - YUIConf 2010
Christian Heilmann
 
Things you can use (by the Yahoo Developer Network and friends)
Things you can use (by the Yahoo Developer Network and friends)Things you can use (by the Yahoo Developer Network and friends)
Things you can use (by the Yahoo Developer Network and friends)
Christian Heilmann
 
Grails 1.2 探検隊 -新たな聖杯をもとめて・・・-
Grails 1.2 探検隊 -新たな聖杯をもとめて・・・-Grails 1.2 探検隊 -新たな聖杯をもとめて・・・-
Grails 1.2 探検隊 -新たな聖杯をもとめて・・・-
Tsuyoshi Yamamoto
 
Django - 次の一歩 gumiStudy#3
Django - 次の一歩 gumiStudy#3Django - 次の一歩 gumiStudy#3
Django - 次の一歩 gumiStudy#3
makoto tsuyuki
 
Routing @ Scuk.cz
Routing @ Scuk.czRouting @ Scuk.cz
Routing @ Scuk.cz
Jakub Kulhan
 
Undercover Pods / WP Functions
Undercover Pods / WP FunctionsUndercover Pods / WP Functions
Undercover Pods / WP Functions
podsframework
 
Hd insight programming
Hd insight programmingHd insight programming
Hd insight programmingCasear Chu
 
Cross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App EngineCross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App Engine
Andy McKay
 
Django Overview
Django OverviewDjango Overview
Django Overview
Brian Tol
 
Building Go Web Apps
Building Go Web AppsBuilding Go Web Apps
Building Go Web Apps
Mark
 
Django tech-talk
Django tech-talkDjango tech-talk
Django tech-talk
dtdannen
 
Introduction to the Pods JSON API
Introduction to the Pods JSON APIIntroduction to the Pods JSON API
Introduction to the Pods JSON API
podsframework
 

What's hot (20)

Assumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfAssumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourself
 
Pydata-Python tools for webscraping
Pydata-Python tools for webscrapingPydata-Python tools for webscraping
Pydata-Python tools for webscraping
 
How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.
 
Selenium&scrapy
Selenium&scrapySelenium&scrapy
Selenium&scrapy
 
Scrapy
ScrapyScrapy
Scrapy
 
Webscraping with asyncio
Webscraping with asyncioWebscraping with asyncio
Webscraping with asyncio
 
Django
DjangoDjango
Django
 
Open Hack London - Introduction to YQL
Open Hack London - Introduction to YQLOpen Hack London - Introduction to YQL
Open Hack London - Introduction to YQL
 
Using YQL Sensibly - YUIConf 2010
Using YQL Sensibly - YUIConf 2010Using YQL Sensibly - YUIConf 2010
Using YQL Sensibly - YUIConf 2010
 
Things you can use (by the Yahoo Developer Network and friends)
Things you can use (by the Yahoo Developer Network and friends)Things you can use (by the Yahoo Developer Network and friends)
Things you can use (by the Yahoo Developer Network and friends)
 
Grails 1.2 探検隊 -新たな聖杯をもとめて・・・-
Grails 1.2 探検隊 -新たな聖杯をもとめて・・・-Grails 1.2 探検隊 -新たな聖杯をもとめて・・・-
Grails 1.2 探検隊 -新たな聖杯をもとめて・・・-
 
Django - 次の一歩 gumiStudy#3
Django - 次の一歩 gumiStudy#3Django - 次の一歩 gumiStudy#3
Django - 次の一歩 gumiStudy#3
 
Routing @ Scuk.cz
Routing @ Scuk.czRouting @ Scuk.cz
Routing @ Scuk.cz
 
Undercover Pods / WP Functions
Undercover Pods / WP FunctionsUndercover Pods / WP Functions
Undercover Pods / WP Functions
 
Hd insight programming
Hd insight programmingHd insight programming
Hd insight programming
 
Cross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App EngineCross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App Engine
 
Django Overview
Django OverviewDjango Overview
Django Overview
 
Building Go Web Apps
Building Go Web AppsBuilding Go Web Apps
Building Go Web Apps
 
Django tech-talk
Django tech-talkDjango tech-talk
Django tech-talk
 
Introduction to the Pods JSON API
Introduction to the Pods JSON APIIntroduction to the Pods JSON API
Introduction to the Pods JSON API
 

Viewers also liked

Bot or Not
Bot or NotBot or Not
Bot or Not
Erin Shellman
 
Scraping the web with python
Scraping the web with pythonScraping the web with python
Scraping the web with python
Jose Manuel Ortega Candel
 
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Sammy Fung
 
Scrapy-101
Scrapy-101Scrapy-101
Scrapy-101
Snehil Verma
 
Web Scraping in Python with Scrapy
Web Scraping in Python with ScrapyWeb Scraping in Python with Scrapy
Web Scraping in Python with Scrapy
orangain
 
Fun! with the Twitter API
Fun! with the Twitter APIFun! with the Twitter API
Fun! with the Twitter API
Erin Shellman
 
Collaborative Filtering for fun ...and profit!
Collaborative Filtering for fun ...and profit!Collaborative Filtering for fun ...and profit!
Collaborative Filtering for fun ...and profit!
Erin Shellman
 
香港中文開源軟件翻譯
香港中文開源軟件翻譯香港中文開源軟件翻譯
香港中文開源軟件翻譯
Sammy Fung
 
real time real talk
real time real talkreal time real talk
real time real talk
Erin Shellman
 
Crawling the web for fun and profit
Crawling the web for fun and profitCrawling the web for fun and profit
Crawling the web for fun and profit
Federico Feroldi
 
Python beautiful soup - bs4
Python beautiful soup - bs4Python beautiful soup - bs4
Python beautiful soup - bs4
Eueung Mulyana
 
摘星
摘星摘星
摘星
zenyuhao
 
Big data at scrapinghub
Big data at scrapinghubBig data at scrapinghub
Big data at scrapinghub
Dana Brophy
 
Collecting web information with open source tools
Collecting web information with open source toolsCollecting web information with open source tools
Collecting web information with open source tools
Sammy Fung
 
Catching the most with high-throughput screening
Catching the most with high-throughput screeningCatching the most with high-throughput screening
Catching the most with high-throughput screening
Erin Shellman
 
Optimizing Cypher Queries in Neo4j
Optimizing Cypher Queries in Neo4jOptimizing Cypher Queries in Neo4j
Optimizing Cypher Queries in Neo4j
Neo4j
 
快快樂樂學 Scrapy
快快樂樂學 Scrapy快快樂樂學 Scrapy
快快樂樂學 Scrapyrecast203
 
Quokka CMS - Content Management with Flask and Mongo #tdc2014
Quokka CMS - Content Management with Flask and Mongo #tdc2014Quokka CMS - Content Management with Flask and Mongo #tdc2014
Quokka CMS - Content Management with Flask and Mongo #tdc2014
Bruno Rocha
 

Viewers also liked (19)

Bot or Not
Bot or NotBot or Not
Bot or Not
 
Scraping the web with python
Scraping the web with pythonScraping the web with python
Scraping the web with python
 
Scrapy.for.dummies
Scrapy.for.dummiesScrapy.for.dummies
Scrapy.for.dummies
 
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
 
Scrapy-101
Scrapy-101Scrapy-101
Scrapy-101
 
Web Scraping in Python with Scrapy
Web Scraping in Python with ScrapyWeb Scraping in Python with Scrapy
Web Scraping in Python with Scrapy
 
Fun! with the Twitter API
Fun! with the Twitter APIFun! with the Twitter API
Fun! with the Twitter API
 
Collaborative Filtering for fun ...and profit!
Collaborative Filtering for fun ...and profit!Collaborative Filtering for fun ...and profit!
Collaborative Filtering for fun ...and profit!
 
香港中文開源軟件翻譯
香港中文開源軟件翻譯香港中文開源軟件翻譯
香港中文開源軟件翻譯
 
real time real talk
real time real talkreal time real talk
real time real talk
 
Crawling the web for fun and profit
Crawling the web for fun and profitCrawling the web for fun and profit
Crawling the web for fun and profit
 
Python beautiful soup - bs4
Python beautiful soup - bs4Python beautiful soup - bs4
Python beautiful soup - bs4
 
摘星
摘星摘星
摘星
 
Big data at scrapinghub
Big data at scrapinghubBig data at scrapinghub
Big data at scrapinghub
 
Collecting web information with open source tools
Collecting web information with open source toolsCollecting web information with open source tools
Collecting web information with open source tools
 
Catching the most with high-throughput screening
Catching the most with high-throughput screeningCatching the most with high-throughput screening
Catching the most with high-throughput screening
 
Optimizing Cypher Queries in Neo4j
Optimizing Cypher Queries in Neo4jOptimizing Cypher Queries in Neo4j
Optimizing Cypher Queries in Neo4j
 
快快樂樂學 Scrapy
快快樂樂學 Scrapy快快樂樂學 Scrapy
快快樂樂學 Scrapy
 
Quokka CMS - Content Management with Flask and Mongo #tdc2014
Quokka CMS - Content Management with Flask and Mongo #tdc2014Quokka CMS - Content Management with Flask and Mongo #tdc2014
Quokka CMS - Content Management with Flask and Mongo #tdc2014
 

Similar to Downloading the internet with Python + Scrapy

Code is Cool - Products are Better
Code is Cool - Products are BetterCode is Cool - Products are Better
Code is Cool - Products are Better
aaronheckmann
 
Beautiful Java EE - PrettyFaces
Beautiful Java EE - PrettyFacesBeautiful Java EE - PrettyFaces
Beautiful Java EE - PrettyFaces
Lincoln III
 
Django - Framework web para perfeccionistas com prazos
Django - Framework web para perfeccionistas com prazosDjango - Framework web para perfeccionistas com prazos
Django - Framework web para perfeccionistas com prazos
Igor Sobreira
 
Jarv.us Showcase — SenchaCon 2011
Jarv.us Showcase — SenchaCon 2011Jarv.us Showcase — SenchaCon 2011
Jarv.us Showcase — SenchaCon 2011Chris Alfano
 
Advanced Topics in Continuous Deployment
Advanced Topics in Continuous DeploymentAdvanced Topics in Continuous Deployment
Advanced Topics in Continuous Deployment
Mike Brittain
 
Rugalytics | Ruby Manor Nov 2008
Rugalytics | Ruby Manor Nov 2008Rugalytics | Ruby Manor Nov 2008
Rugalytics | Ruby Manor Nov 2008
Rob
 
E2 appspresso hands on lab
E2 appspresso hands on labE2 appspresso hands on lab
E2 appspresso hands on labNAVER D2
 
E3 appspresso hands on lab
E3 appspresso hands on labE3 appspresso hands on lab
E3 appspresso hands on labNAVER D2
 
The Art of AngularJS in 2015 - Angular Summit 2015
The Art of AngularJS in 2015 - Angular Summit 2015The Art of AngularJS in 2015 - Angular Summit 2015
The Art of AngularJS in 2015 - Angular Summit 2015
Matt Raible
 
Coding for marketers
Coding for marketersCoding for marketers
Coding for marketers
Robin Lord
 
Sherlock Markup and Sammy Semantic - drupal theming forensic analysis
Sherlock Markup and Sammy Semantic - drupal theming forensic analysisSherlock Markup and Sammy Semantic - drupal theming forensic analysis
Sherlock Markup and Sammy Semantic - drupal theming forensic analysis
Andreas Sahle
 
2012 03 27_philly_jug_rewrite_static
2012 03 27_philly_jug_rewrite_static2012 03 27_philly_jug_rewrite_static
2012 03 27_philly_jug_rewrite_staticLincoln III
 
02 integrate highchart
02 integrate highchart02 integrate highchart
02 integrate highchart
Erhwen Kuo
 
Metrics on the front, data in the back
Metrics on the front, data in the backMetrics on the front, data in the back
Metrics on the front, data in the back
DiUS
 
High Performance Web Components
High Performance Web ComponentsHigh Performance Web Components
High Performance Web Components
Steve Souders
 
Pyramid Lighter/Faster/Better web apps
Pyramid Lighter/Faster/Better web appsPyramid Lighter/Faster/Better web apps
Pyramid Lighter/Faster/Better web apps
Dylan Jay
 
FrontInBahia 2014: 10 dicas de desempenho para apps mobile híbridas
FrontInBahia 2014: 10 dicas de desempenho para apps mobile híbridasFrontInBahia 2014: 10 dicas de desempenho para apps mobile híbridas
FrontInBahia 2014: 10 dicas de desempenho para apps mobile híbridas
Loiane Groner
 
apidays Paris 2022 - France Televisions : How we leverage API Platform for ou...
apidays Paris 2022 - France Televisions : How we leverage API Platform for ou...apidays Paris 2022 - France Televisions : How we leverage API Platform for ou...
apidays Paris 2022 - France Televisions : How we leverage API Platform for ou...
apidays
 
HTML5 after the hype - JFokus2015
HTML5 after the hype - JFokus2015HTML5 after the hype - JFokus2015
HTML5 after the hype - JFokus2015
Christian Heilmann
 

Similar to Downloading the internet with Python + Scrapy (20)

Code is Cool - Products are Better
Code is Cool - Products are BetterCode is Cool - Products are Better
Code is Cool - Products are Better
 
Beautiful Java EE - PrettyFaces
Beautiful Java EE - PrettyFacesBeautiful Java EE - PrettyFaces
Beautiful Java EE - PrettyFaces
 
Django - Framework web para perfeccionistas com prazos
Django - Framework web para perfeccionistas com prazosDjango - Framework web para perfeccionistas com prazos
Django - Framework web para perfeccionistas com prazos
 
Jarv.us Showcase — SenchaCon 2011
Jarv.us Showcase — SenchaCon 2011Jarv.us Showcase — SenchaCon 2011
Jarv.us Showcase — SenchaCon 2011
 
Advanced Topics in Continuous Deployment
Advanced Topics in Continuous DeploymentAdvanced Topics in Continuous Deployment
Advanced Topics in Continuous Deployment
 
Rugalytics | Ruby Manor Nov 2008
Rugalytics | Ruby Manor Nov 2008Rugalytics | Ruby Manor Nov 2008
Rugalytics | Ruby Manor Nov 2008
 
E2 appspresso hands on lab
E2 appspresso hands on labE2 appspresso hands on lab
E2 appspresso hands on lab
 
E3 appspresso hands on lab
E3 appspresso hands on labE3 appspresso hands on lab
E3 appspresso hands on lab
 
The Art of AngularJS in 2015 - Angular Summit 2015
The Art of AngularJS in 2015 - Angular Summit 2015The Art of AngularJS in 2015 - Angular Summit 2015
The Art of AngularJS in 2015 - Angular Summit 2015
 
Coding for marketers
Coding for marketersCoding for marketers
Coding for marketers
 
Sherlock Markup and Sammy Semantic - drupal theming forensic analysis
Sherlock Markup and Sammy Semantic - drupal theming forensic analysisSherlock Markup and Sammy Semantic - drupal theming forensic analysis
Sherlock Markup and Sammy Semantic - drupal theming forensic analysis
 
2012 03 27_philly_jug_rewrite_static
2012 03 27_philly_jug_rewrite_static2012 03 27_philly_jug_rewrite_static
2012 03 27_philly_jug_rewrite_static
 
Test upload
Test uploadTest upload
Test upload
 
02 integrate highchart
02 integrate highchart02 integrate highchart
02 integrate highchart
 
Metrics on the front, data in the back
Metrics on the front, data in the backMetrics on the front, data in the back
Metrics on the front, data in the back
 
High Performance Web Components
High Performance Web ComponentsHigh Performance Web Components
High Performance Web Components
 
Pyramid Lighter/Faster/Better web apps
Pyramid Lighter/Faster/Better web appsPyramid Lighter/Faster/Better web apps
Pyramid Lighter/Faster/Better web apps
 
FrontInBahia 2014: 10 dicas de desempenho para apps mobile híbridas
FrontInBahia 2014: 10 dicas de desempenho para apps mobile híbridasFrontInBahia 2014: 10 dicas de desempenho para apps mobile híbridas
FrontInBahia 2014: 10 dicas de desempenho para apps mobile híbridas
 
apidays Paris 2022 - France Televisions : How we leverage API Platform for ou...
apidays Paris 2022 - France Televisions : How we leverage API Platform for ou...apidays Paris 2022 - France Televisions : How we leverage API Platform for ou...
apidays Paris 2022 - France Televisions : How we leverage API Platform for ou...
 
HTML5 after the hype - JFokus2015
HTML5 after the hype - JFokus2015HTML5 after the hype - JFokus2015
HTML5 after the hype - JFokus2015
 

Recently uploaded

UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
ViralQR
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 

Recently uploaded (20)

UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 

Downloading the internet with Python + Scrapy

  • 1. Downloading the internet with Python + scrapy 💻🐍 Erin Shellman @erinshellman Puget Sound Programming Python meet-up January 14, 2015
  • 2. hi! I’m a data scientist in the Nordstrom Data Lab. I’ve built scrapers to monitor the product catalogs of various sports retailers.
  • 3. Getting data can be hard Despite the open-data movement and popularity of APIs, volumes of data are locked up in DOMs all over the internet.
  • 4. Monitoring competitor prices • As a retailer, I want to strategically set prices in relation to my competitors. • But they aren’t interested in sharing their prices and mark-down strategies with me. 😭
  • 5. • “Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.” • scrapin’ on rails!
  • 9. class Product(Item):! ! product_title = Field()! description = Field()! price = Field() ! Define what to scrape in items.py
  • 10. protip: get to know the DOM.
  • 11. protip: get to know the DOM.
  • 12. Sometimes there are hidden gems. SKU-level inventory availability? Score!
  • 13. Spider design Spiders have two primary components: 1. Crawling (navigation) instructions 2. Parsing instructions
  • 14. Define the crawl behavior in spiders/backcountry.py After spending some time on backcountry.com, I decided the all brands landing page was the best starting URL.
  • 15. class BackcountrySpider(CrawlSpider):! name = 'backcountry'! def __init__(self, *args, **kwargs):! super(BackcountrySpider, self).__init__(*args, **kwargs)! self.base_url = 'http://www.backcountry.com'! self.start_urls = ['http://www.backcountry.com/Store/catalog/shopAllBrands.jsp']! ! def parse_start_url(self, response):! brands = response.xpath("//a[@class='qa-brand-link']/@href").extract()! ! for brand in brands:! brand_url = str(self.base_url + brand)! self.log("Queued up: %s" % brand_url)! ! yield scrapy.Request(url = brand_url, ! callback = self.parse_brand_landing_pages)! Part I: Crawl Setup
  • 16. e.g. brand_url = http://www.backcountry.com/burton ! def parse_start_url(self, response):! brands = response.xpath("//a[@class='qa-brand-link']/@href").extract()! ! for brand in brands:! brand_url = str(self.base_url + brand)! self.log("Queued up: %s" % brand_url)! ! yield scrapy.Request(url = brand_url, ! callback = self.parse_brand_landing_pages)!
  • 17. ! def parse_brand_landing_pages(self, response):! shop_all_pattern = "//a[@class='subcategory-link brand-plp-link qa-brand-plp-link']/@href"! shop_all_link = response.xpath(shop_all_pattern).extract()! ! if shop_all_link:! all_product_url = str(self.base_url + shop_all_link[0]) ! ! yield scrapy.Request(url = all_product_url,! callback = self.parse_product_pages)! else: ! yield scrapy.Request(url = response.url,! callback = self.parse_product_pages)
  • 18. def parse_product_pages(self, response):! product_page_pattern = "//a[contains(@class, 'qa-product-link')]/@href"! pagination_pattern = "//li[@class='page-link page-number']/a/@href"! ! product_pages = response.xpath(product_page_pattern).extract()! more_pages = response.xpath(pagination_pattern).extract()! ! # Paginate!! for page in more_pages:! next_page = str(self.base_url + page)! yield scrapy.Request(url = next_page,! callback = self.parse_product_pages)! ! for product in product_pages:! product_url = str(self.base_url + product)! ! yield scrapy.Request(url = product_url,! callback = self.parse_item)
  • 19. def parse_product_pages(self, response):! product_page_pattern = "//a[contains(@class, 'qa-product-link')]/@href"! pagination_pattern = "//li[@class='page-link page-number']/a/@href"! ! product_pages = response.xpath(product_page_pattern).extract()! more_pages = response.xpath(pagination_pattern).extract()! ! # Paginate!! for page in more_pages:! next_page = str(self.base_url + page)! yield scrapy.Request(url = next_page,! callback = self.parse_product_pages)! ! for product in product_pages:! product_url = str(self.base_url + product)! ! yield scrapy.Request(url = product_url,! callback = self.parse_item)
  • 20. # Paginate!! for page in more_pages:! next_page = str(self.base_url + page)! yield scrapy.Request(url = next_page,! callback = self.parse_product_pages)! ! for product in product_pages:! product_url = str(self.base_url + product)! ! yield scrapy.Request(url = product_url,! callback = self.parse_item)
  • 21. def parse_item(self, response):! ! item = Product()! dirty_data = {}! ! dirty_data['product_title'] = response.xpath(“//*[@id=‘product-buy-box’]/div/div[1]/h1/text()“).extract()! dirty_data['description'] = response.xpath("//div[@class='product-description']/text()").extract()! dirty_data['price'] = response.xpath("//span[@itemprop='price']/text()").extract()! ! for variable in dirty_data.keys():! if dirty_data[variable]: ! if variable == 'price':! item[variable] = float(''.join(dirty_data[variable]).strip().replace('$', '').replace(',', ''))! else: ! item[variable] = ''.join(dirty_data[variable]).strip()! ! yield item! Part II: Parsing
  • 22. for variable in dirty_data.keys():! if dirty_data[variable]: ! if variable == 'price':! item[variable] = float(''.join(dirty_data[variable]).strip().replace('$', '').replace(',', ''))! else: ! item[variable] = ''.join(dirty_data[variable]).strip() Part II: Clean it now!
  • 24. 2015-01-02 12:32:52-0800 [backcountry] INFO: Closing spider (finished) 2015-01-02 12:32:52-0800 [backcountry] INFO: Stored json feed (38881 items) in: bc.json 2015-01-02 12:32:52-0800 [backcountry] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 33068379, 'downloader/request_count': 41848, 'downloader/request_method_count/GET': 41848, 'downloader/response_bytes': 1715232531, 'downloader/response_count': 41848, 'downloader/response_status_count/200': 41835, 'downloader/response_status_count/301': 9, 'downloader/response_status_count/404': 4, 'dupefilter/filtered': 12481, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 1, 2, 20, 32, 52, 524929), 'item_scraped_count': 38881, 'log_count/DEBUG': 81784, 'log_count/ERROR': 23, 'log_count/INFO': 26, 'request_depth_max': 7, 'response_received_count': 41839, 'scheduler/dequeued': 41848, 'scheduler/dequeued/memory': 41848, 'scheduler/enqueued': 41848, 'scheduler/enqueued/memory': 41848, 'spider_exceptions/IndexError': 23, 'start_time': datetime.datetime(2015, 1, 2, 20, 14, 16, 892071)} 2015-01-02 12:32:52-0800 [backcountry] INFO: Spider closed (finished)
  • 25. { "review_count": 18, "product_id": "BAF0028", "brand": "Baffin", "product_url": "http://www.backcountry.com/baffin-cush-slipper-mens", "source": "backcountry", "inventory": { "BAF0028-ESP-S3XL": 27, "BAF0028-BKA-XL": 40, "BAF0028-NVA-XL": 5, "BAF0028-NVA-L": 7, "BAF0028-BKA-L": 17, "BAF0028-ESP-XXL": 12, "BAF0028-NVA-XXL": 6, "BAF0028-BKA-XXL": 44, "BAF0028-NVA-S3XL": 10, "BAF0028-ESP-L": 50, "BAF0028-ESP-XL": 52, "BAF0028-BKA-S3XL": 19 }, "price_high": 24.95, "price": 23.95, "description_short": "Cush Slipper - Men's", "price_low": 23.95, "review_score": 4 }
  • 27. –Monica Rogati, VP of Data at Jawbone “Data wrangling is a huge — and surprisingly so — part of the job. It’s something that is not appreciated by data civilians. At times, it feels like everything we do.”
  • 28. Resources • Code here!: • https://github.com/erinshellman/backcountry-scraper • Lynn Root’s excellent end-to-end tutorial. • http://newcoder.io/Intro-Scrape/ • Web scraping - It’s your civic duty • http://pbpython.com/web-scraping-mn-budget.html
  • 29. Bring your projects to hacknight! http://www.meetup.com/Seattle-PyLadies Ladies!!
  • 30. Bring your projects to hacknight! http://www.meetup.com/Seattle-PyLadies Ladies!! Thursday, January 29th 6PM ! Intro to iPython and Matplotlib Ada Developers Academy 1301 5th Avenue #1350