SlideShare a Scribd company logo
1 of 30
Download to read offline
Downloading the
internet with
Python + scrapy 💻🐍
Erin Shellman @erinshellman
Puget Sound 
Programming Python meet-up
January 14, 2015
hi!
I’m a data scientist in the Nordstrom Data Lab. I’ve built
scrapers to monitor the product catalogs of various
sports retailers.
Getting data can be hard
Despite the open-data movement and popularity
of APIs, volumes of data are locked up in DOMs all
over the internet.
Monitoring
competitor prices
• As a retailer, I want to strategically set prices in
relation to my competitors.
• But they aren’t interested in sharing their prices and
mark-down strategies with me. 😭
• “Scrapy is an application framework for crawling web
sites and extracting structured data which can be
used for a wide range of useful applications, like data
mining, information processing or historical archival.”
• scrapin’ on rails!
scrapy startproject prices
scrapy project
prices/
scrapy.cfg
prices/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
...
scrapy project
prices/
scrapy.cfg
prices/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
backcountry.py
...
class Product(Item):!
!
product_title = Field()!
description = Field()!
price = Field() !
Define what to
scrape in items.py
protip: get to know
the DOM.
protip: get to know
the DOM.
Sometimes
there are
hidden gems.
SKU-level inventory
availability?
Score!
Spider design
Spiders have two primary components:
1. Crawling (navigation) instructions
2. Parsing instructions
Define the crawl behavior
in spiders/backcountry.py
After spending some
time on backcountry.com,
I decided the all brands
landing page was the
best starting URL.
class BackcountrySpider(CrawlSpider):!
name = 'backcountry'!
def __init__(self, *args, **kwargs):!
super(BackcountrySpider, self).__init__(*args, **kwargs)!
self.base_url = 'http://www.backcountry.com'!
self.start_urls = ['http://www.backcountry.com/Store/catalog/shopAllBrands.jsp']!
!
def parse_start_url(self, response):!
brands = response.xpath("//a[@class='qa-brand-link']/@href").extract()!
!
for brand in brands:!
brand_url = str(self.base_url + brand)!
self.log("Queued up: %s" % brand_url)!
!
yield scrapy.Request(url = brand_url, !
callback = self.parse_brand_landing_pages)!
Part I: Crawl Setup
e.g. brand_url = http://www.backcountry.com/burton
!
def parse_start_url(self, response):!
brands = response.xpath("//a[@class='qa-brand-link']/@href").extract()!
!
for brand in brands:!
brand_url = str(self.base_url + brand)!
self.log("Queued up: %s" % brand_url)!
!
yield scrapy.Request(url = brand_url, !
callback = self.parse_brand_landing_pages)!
!
def parse_brand_landing_pages(self, response):!
shop_all_pattern = "//a[@class='subcategory-link brand-plp-link qa-brand-plp-link']/@href"!
shop_all_link = response.xpath(shop_all_pattern).extract()!
!
if shop_all_link:!
all_product_url = str(self.base_url + shop_all_link[0]) !
!
yield scrapy.Request(url = all_product_url,!
callback = self.parse_product_pages)!
else: !
yield scrapy.Request(url = response.url,!
callback = self.parse_product_pages)
def parse_product_pages(self, response):!
product_page_pattern = "//a[contains(@class, 'qa-product-link')]/@href"!
pagination_pattern = "//li[@class='page-link page-number']/a/@href"!
!
product_pages = response.xpath(product_page_pattern).extract()!
more_pages = response.xpath(pagination_pattern).extract()!
!
# Paginate!!
for page in more_pages:!
next_page = str(self.base_url + page)!
yield scrapy.Request(url = next_page,!
callback = self.parse_product_pages)!
!
for product in product_pages:!
product_url = str(self.base_url + product)!
!
yield scrapy.Request(url = product_url,!
callback = self.parse_item)
def parse_product_pages(self, response):!
product_page_pattern = "//a[contains(@class, 'qa-product-link')]/@href"!
pagination_pattern = "//li[@class='page-link page-number']/a/@href"!
!
product_pages = response.xpath(product_page_pattern).extract()!
more_pages = response.xpath(pagination_pattern).extract()!
!
# Paginate!!
for page in more_pages:!
next_page = str(self.base_url + page)!
yield scrapy.Request(url = next_page,!
callback = self.parse_product_pages)!
!
for product in product_pages:!
product_url = str(self.base_url + product)!
!
yield scrapy.Request(url = product_url,!
callback = self.parse_item)
# Paginate!!
for page in more_pages:!
next_page = str(self.base_url + page)!
yield scrapy.Request(url = next_page,!
callback = self.parse_product_pages)!
!
for product in product_pages:!
product_url = str(self.base_url + product)!
!
yield scrapy.Request(url = product_url,!
callback = self.parse_item)
def parse_item(self, response):!
!
item = Product()!
dirty_data = {}!
!
dirty_data['product_title'] = response.xpath(“//*[@id=‘product-buy-box’]/div/div[1]/h1/text()“).extract()!
dirty_data['description'] = response.xpath("//div[@class='product-description']/text()").extract()!
dirty_data['price'] = response.xpath("//span[@itemprop='price']/text()").extract()!
!
for variable in dirty_data.keys():!
if dirty_data[variable]: !
if variable == 'price':!
item[variable] = float(''.join(dirty_data[variable]).strip().replace('$', '').replace(',', ''))!
else: !
item[variable] = ''.join(dirty_data[variable]).strip()!
!
yield item!
Part II: Parsing
for variable in dirty_data.keys():!
if dirty_data[variable]: !
if variable == 'price':!
item[variable] = float(''.join(dirty_data[variable]).strip().replace('$', '').replace(',', ''))!
else: !
item[variable] = ''.join(dirty_data[variable]).strip()
Part II: Clean it now!
scrapy crawl backcountry -o bc.json
2015-01-02 12:32:52-0800 [backcountry] INFO: Closing spider (finished)
2015-01-02 12:32:52-0800 [backcountry] INFO: Stored json feed (38881 items) in: bc.json
2015-01-02 12:32:52-0800 [backcountry] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 33068379,
'downloader/request_count': 41848,
'downloader/request_method_count/GET': 41848,
'downloader/response_bytes': 1715232531,
'downloader/response_count': 41848,
'downloader/response_status_count/200': 41835,
'downloader/response_status_count/301': 9,
'downloader/response_status_count/404': 4,
'dupefilter/filtered': 12481,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 1, 2, 20, 32, 52, 524929),
'item_scraped_count': 38881,
'log_count/DEBUG': 81784,
'log_count/ERROR': 23,
'log_count/INFO': 26,
'request_depth_max': 7,
'response_received_count': 41839,
'scheduler/dequeued': 41848,
'scheduler/dequeued/memory': 41848,
'scheduler/enqueued': 41848,
'scheduler/enqueued/memory': 41848,
'spider_exceptions/IndexError': 23,
'start_time': datetime.datetime(2015, 1, 2, 20, 14, 16, 892071)}
2015-01-02 12:32:52-0800 [backcountry] INFO: Spider closed (finished)
{
"review_count": 18,
"product_id": "BAF0028",
"brand": "Baffin",
"product_url": "http://www.backcountry.com/baffin-cush-slipper-mens",
"source": "backcountry",
"inventory": {
"BAF0028-ESP-S3XL": 27,
"BAF0028-BKA-XL": 40,
"BAF0028-NVA-XL": 5,
"BAF0028-NVA-L": 7,
"BAF0028-BKA-L": 17,
"BAF0028-ESP-XXL": 12,
"BAF0028-NVA-XXL": 6,
"BAF0028-BKA-XXL": 44,
"BAF0028-NVA-S3XL": 10,
"BAF0028-ESP-L": 50,
"BAF0028-ESP-XL": 52,
"BAF0028-BKA-S3XL": 19
},
"price_high": 24.95,
"price": 23.95,
"description_short": "Cush Slipper - Men's",
"price_low": 23.95,
"review_score": 4
}
prices/
scrapy.cfg
prices/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
backcountry.py
evo.py
rei.py
...
–Monica Rogati, VP of Data at Jawbone
“Data wrangling is a huge — and
surprisingly so — part of the job. It’s
something that is not appreciated by data
civilians. At times, it feels like everything
we do.”
Resources
• Code here!:
• https://github.com/erinshellman/backcountry-scraper
• Lynn Root’s excellent end-to-end tutorial.
• http://newcoder.io/Intro-Scrape/
• Web scraping - It’s your civic duty
• http://pbpython.com/web-scraping-mn-budget.html
Bring your projects to hacknight!
http://www.meetup.com/Seattle-PyLadies
Ladies!!
Bring your projects to hacknight!
http://www.meetup.com/Seattle-PyLadies
Ladies!!
Thursday, January
29th
6PM
!
Intro to iPython
and Matplotlib
Ada Developers Academy
1301 5th Avenue #1350

More Related Content

What's hot

Assumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfAssumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfErin Shellman
 
How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.Diep Nguyen
 
Open Hack London - Introduction to YQL
Open Hack London - Introduction to YQLOpen Hack London - Introduction to YQL
Open Hack London - Introduction to YQLChristian Heilmann
 
Using YQL Sensibly - YUIConf 2010
Using YQL Sensibly - YUIConf 2010Using YQL Sensibly - YUIConf 2010
Using YQL Sensibly - YUIConf 2010Christian Heilmann
 
Things you can use (by the Yahoo Developer Network and friends)
Things you can use (by the Yahoo Developer Network and friends)Things you can use (by the Yahoo Developer Network and friends)
Things you can use (by the Yahoo Developer Network and friends)Christian Heilmann
 
Grails 1.2 探検隊 -新たな聖杯をもとめて・・・-
Grails 1.2 探検隊 -新たな聖杯をもとめて・・・-Grails 1.2 探検隊 -新たな聖杯をもとめて・・・-
Grails 1.2 探検隊 -新たな聖杯をもとめて・・・-Tsuyoshi Yamamoto
 
Django - 次の一歩 gumiStudy#3
Django - 次の一歩 gumiStudy#3Django - 次の一歩 gumiStudy#3
Django - 次の一歩 gumiStudy#3makoto tsuyuki
 
Undercover Pods / WP Functions
Undercover Pods / WP FunctionsUndercover Pods / WP Functions
Undercover Pods / WP Functionspodsframework
 
Hd insight programming
Hd insight programmingHd insight programming
Hd insight programmingCasear Chu
 
Cross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App EngineCross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App EngineAndy McKay
 
Django Overview
Django OverviewDjango Overview
Django OverviewBrian Tol
 
Building Go Web Apps
Building Go Web AppsBuilding Go Web Apps
Building Go Web AppsMark
 
Django tech-talk
Django tech-talkDjango tech-talk
Django tech-talkdtdannen
 
Introduction to the Pods JSON API
Introduction to the Pods JSON APIIntroduction to the Pods JSON API
Introduction to the Pods JSON APIpodsframework
 

What's hot (20)

Assumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfAssumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourself
 
Pydata-Python tools for webscraping
Pydata-Python tools for webscrapingPydata-Python tools for webscraping
Pydata-Python tools for webscraping
 
How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.
 
Selenium&scrapy
Selenium&scrapySelenium&scrapy
Selenium&scrapy
 
Scrapy
ScrapyScrapy
Scrapy
 
Webscraping with asyncio
Webscraping with asyncioWebscraping with asyncio
Webscraping with asyncio
 
Django
DjangoDjango
Django
 
Open Hack London - Introduction to YQL
Open Hack London - Introduction to YQLOpen Hack London - Introduction to YQL
Open Hack London - Introduction to YQL
 
Using YQL Sensibly - YUIConf 2010
Using YQL Sensibly - YUIConf 2010Using YQL Sensibly - YUIConf 2010
Using YQL Sensibly - YUIConf 2010
 
Things you can use (by the Yahoo Developer Network and friends)
Things you can use (by the Yahoo Developer Network and friends)Things you can use (by the Yahoo Developer Network and friends)
Things you can use (by the Yahoo Developer Network and friends)
 
Grails 1.2 探検隊 -新たな聖杯をもとめて・・・-
Grails 1.2 探検隊 -新たな聖杯をもとめて・・・-Grails 1.2 探検隊 -新たな聖杯をもとめて・・・-
Grails 1.2 探検隊 -新たな聖杯をもとめて・・・-
 
Django - 次の一歩 gumiStudy#3
Django - 次の一歩 gumiStudy#3Django - 次の一歩 gumiStudy#3
Django - 次の一歩 gumiStudy#3
 
Routing @ Scuk.cz
Routing @ Scuk.czRouting @ Scuk.cz
Routing @ Scuk.cz
 
Undercover Pods / WP Functions
Undercover Pods / WP FunctionsUndercover Pods / WP Functions
Undercover Pods / WP Functions
 
Hd insight programming
Hd insight programmingHd insight programming
Hd insight programming
 
Cross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App EngineCross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App Engine
 
Django Overview
Django OverviewDjango Overview
Django Overview
 
Building Go Web Apps
Building Go Web AppsBuilding Go Web Apps
Building Go Web Apps
 
Django tech-talk
Django tech-talkDjango tech-talk
Django tech-talk
 
Introduction to the Pods JSON API
Introduction to the Pods JSON APIIntroduction to the Pods JSON API
Introduction to the Pods JSON API
 

Viewers also liked

Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Sammy Fung
 
Web Scraping in Python with Scrapy
Web Scraping in Python with ScrapyWeb Scraping in Python with Scrapy
Web Scraping in Python with Scrapyorangain
 
Fun! with the Twitter API
Fun! with the Twitter APIFun! with the Twitter API
Fun! with the Twitter APIErin Shellman
 
Collaborative Filtering for fun ...and profit!
Collaborative Filtering for fun ...and profit!Collaborative Filtering for fun ...and profit!
Collaborative Filtering for fun ...and profit!Erin Shellman
 
香港中文開源軟件翻譯
香港中文開源軟件翻譯香港中文開源軟件翻譯
香港中文開源軟件翻譯Sammy Fung
 
Crawling the web for fun and profit
Crawling the web for fun and profitCrawling the web for fun and profit
Crawling the web for fun and profitFederico Feroldi
 
Python beautiful soup - bs4
Python beautiful soup - bs4Python beautiful soup - bs4
Python beautiful soup - bs4Eueung Mulyana
 
Big data at scrapinghub
Big data at scrapinghubBig data at scrapinghub
Big data at scrapinghubDana Brophy
 
Collecting web information with open source tools
Collecting web information with open source toolsCollecting web information with open source tools
Collecting web information with open source toolsSammy Fung
 
Catching the most with high-throughput screening
Catching the most with high-throughput screeningCatching the most with high-throughput screening
Catching the most with high-throughput screeningErin Shellman
 
Optimizing Cypher Queries in Neo4j
Optimizing Cypher Queries in Neo4jOptimizing Cypher Queries in Neo4j
Optimizing Cypher Queries in Neo4jNeo4j
 
快快樂樂學 Scrapy
快快樂樂學 Scrapy快快樂樂學 Scrapy
快快樂樂學 Scrapyrecast203
 
Quokka CMS - Content Management with Flask and Mongo #tdc2014
Quokka CMS - Content Management with Flask and Mongo #tdc2014Quokka CMS - Content Management with Flask and Mongo #tdc2014
Quokka CMS - Content Management with Flask and Mongo #tdc2014Bruno Rocha
 

Viewers also liked (19)

Bot or Not
Bot or NotBot or Not
Bot or Not
 
Scraping the web with python
Scraping the web with pythonScraping the web with python
Scraping the web with python
 
Scrapy.for.dummies
Scrapy.for.dummiesScrapy.for.dummies
Scrapy.for.dummies
 
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
 
Scrapy-101
Scrapy-101Scrapy-101
Scrapy-101
 
Web Scraping in Python with Scrapy
Web Scraping in Python with ScrapyWeb Scraping in Python with Scrapy
Web Scraping in Python with Scrapy
 
Fun! with the Twitter API
Fun! with the Twitter APIFun! with the Twitter API
Fun! with the Twitter API
 
Collaborative Filtering for fun ...and profit!
Collaborative Filtering for fun ...and profit!Collaborative Filtering for fun ...and profit!
Collaborative Filtering for fun ...and profit!
 
香港中文開源軟件翻譯
香港中文開源軟件翻譯香港中文開源軟件翻譯
香港中文開源軟件翻譯
 
real time real talk
real time real talkreal time real talk
real time real talk
 
Crawling the web for fun and profit
Crawling the web for fun and profitCrawling the web for fun and profit
Crawling the web for fun and profit
 
Python beautiful soup - bs4
Python beautiful soup - bs4Python beautiful soup - bs4
Python beautiful soup - bs4
 
摘星
摘星摘星
摘星
 
Big data at scrapinghub
Big data at scrapinghubBig data at scrapinghub
Big data at scrapinghub
 
Collecting web information with open source tools
Collecting web information with open source toolsCollecting web information with open source tools
Collecting web information with open source tools
 
Catching the most with high-throughput screening
Catching the most with high-throughput screeningCatching the most with high-throughput screening
Catching the most with high-throughput screening
 
Optimizing Cypher Queries in Neo4j
Optimizing Cypher Queries in Neo4jOptimizing Cypher Queries in Neo4j
Optimizing Cypher Queries in Neo4j
 
快快樂樂學 Scrapy
快快樂樂學 Scrapy快快樂樂學 Scrapy
快快樂樂學 Scrapy
 
Quokka CMS - Content Management with Flask and Mongo #tdc2014
Quokka CMS - Content Management with Flask and Mongo #tdc2014Quokka CMS - Content Management with Flask and Mongo #tdc2014
Quokka CMS - Content Management with Flask and Mongo #tdc2014
 

Similar to Downloading the internet with Python + Scrapy

Code is Cool - Products are Better
Code is Cool - Products are BetterCode is Cool - Products are Better
Code is Cool - Products are Betteraaronheckmann
 
Beautiful Java EE - PrettyFaces
Beautiful Java EE - PrettyFacesBeautiful Java EE - PrettyFaces
Beautiful Java EE - PrettyFacesLincoln III
 
Django - Framework web para perfeccionistas com prazos
Django - Framework web para perfeccionistas com prazosDjango - Framework web para perfeccionistas com prazos
Django - Framework web para perfeccionistas com prazosIgor Sobreira
 
Jarv.us Showcase — SenchaCon 2011
Jarv.us Showcase — SenchaCon 2011Jarv.us Showcase — SenchaCon 2011
Jarv.us Showcase — SenchaCon 2011Chris Alfano
 
Advanced Topics in Continuous Deployment
Advanced Topics in Continuous DeploymentAdvanced Topics in Continuous Deployment
Advanced Topics in Continuous DeploymentMike Brittain
 
Rugalytics | Ruby Manor Nov 2008
Rugalytics | Ruby Manor Nov 2008Rugalytics | Ruby Manor Nov 2008
Rugalytics | Ruby Manor Nov 2008Rob
 
E3 appspresso hands on lab
E3 appspresso hands on labE3 appspresso hands on lab
E3 appspresso hands on labNAVER D2
 
E2 appspresso hands on lab
E2 appspresso hands on labE2 appspresso hands on lab
E2 appspresso hands on labNAVER D2
 
The Art of AngularJS in 2015 - Angular Summit 2015
The Art of AngularJS in 2015 - Angular Summit 2015The Art of AngularJS in 2015 - Angular Summit 2015
The Art of AngularJS in 2015 - Angular Summit 2015Matt Raible
 
Coding for marketers
Coding for marketersCoding for marketers
Coding for marketersRobin Lord
 
Sherlock Markup and Sammy Semantic - drupal theming forensic analysis
Sherlock Markup and Sammy Semantic - drupal theming forensic analysisSherlock Markup and Sammy Semantic - drupal theming forensic analysis
Sherlock Markup and Sammy Semantic - drupal theming forensic analysisAndreas Sahle
 
2012 03 27_philly_jug_rewrite_static
2012 03 27_philly_jug_rewrite_static2012 03 27_philly_jug_rewrite_static
2012 03 27_philly_jug_rewrite_staticLincoln III
 
02 integrate highchart
02 integrate highchart02 integrate highchart
02 integrate highchartErhwen Kuo
 
Metrics on the front, data in the back
Metrics on the front, data in the backMetrics on the front, data in the back
Metrics on the front, data in the backDiUS
 
High Performance Web Components
High Performance Web ComponentsHigh Performance Web Components
High Performance Web ComponentsSteve Souders
 
Pyramid Lighter/Faster/Better web apps
Pyramid Lighter/Faster/Better web appsPyramid Lighter/Faster/Better web apps
Pyramid Lighter/Faster/Better web appsDylan Jay
 
FrontInBahia 2014: 10 dicas de desempenho para apps mobile híbridas
FrontInBahia 2014: 10 dicas de desempenho para apps mobile híbridasFrontInBahia 2014: 10 dicas de desempenho para apps mobile híbridas
FrontInBahia 2014: 10 dicas de desempenho para apps mobile híbridasLoiane Groner
 
apidays Paris 2022 - France Televisions : How we leverage API Platform for ou...
apidays Paris 2022 - France Televisions : How we leverage API Platform for ou...apidays Paris 2022 - France Televisions : How we leverage API Platform for ou...
apidays Paris 2022 - France Televisions : How we leverage API Platform for ou...apidays
 
HTML5 after the hype - JFokus2015
HTML5 after the hype - JFokus2015HTML5 after the hype - JFokus2015
HTML5 after the hype - JFokus2015Christian Heilmann
 

Similar to Downloading the internet with Python + Scrapy (20)

Code is Cool - Products are Better
Code is Cool - Products are BetterCode is Cool - Products are Better
Code is Cool - Products are Better
 
Beautiful Java EE - PrettyFaces
Beautiful Java EE - PrettyFacesBeautiful Java EE - PrettyFaces
Beautiful Java EE - PrettyFaces
 
Django - Framework web para perfeccionistas com prazos
Django - Framework web para perfeccionistas com prazosDjango - Framework web para perfeccionistas com prazos
Django - Framework web para perfeccionistas com prazos
 
Jarv.us Showcase — SenchaCon 2011
Jarv.us Showcase — SenchaCon 2011Jarv.us Showcase — SenchaCon 2011
Jarv.us Showcase — SenchaCon 2011
 
Advanced Topics in Continuous Deployment
Advanced Topics in Continuous DeploymentAdvanced Topics in Continuous Deployment
Advanced Topics in Continuous Deployment
 
Rugalytics | Ruby Manor Nov 2008
Rugalytics | Ruby Manor Nov 2008Rugalytics | Ruby Manor Nov 2008
Rugalytics | Ruby Manor Nov 2008
 
E3 appspresso hands on lab
E3 appspresso hands on labE3 appspresso hands on lab
E3 appspresso hands on lab
 
E2 appspresso hands on lab
E2 appspresso hands on labE2 appspresso hands on lab
E2 appspresso hands on lab
 
The Art of AngularJS in 2015 - Angular Summit 2015
The Art of AngularJS in 2015 - Angular Summit 2015The Art of AngularJS in 2015 - Angular Summit 2015
The Art of AngularJS in 2015 - Angular Summit 2015
 
Coding for marketers
Coding for marketersCoding for marketers
Coding for marketers
 
Sherlock Markup and Sammy Semantic - drupal theming forensic analysis
Sherlock Markup and Sammy Semantic - drupal theming forensic analysisSherlock Markup and Sammy Semantic - drupal theming forensic analysis
Sherlock Markup and Sammy Semantic - drupal theming forensic analysis
 
2012 03 27_philly_jug_rewrite_static
2012 03 27_philly_jug_rewrite_static2012 03 27_philly_jug_rewrite_static
2012 03 27_philly_jug_rewrite_static
 
Test upload
Test uploadTest upload
Test upload
 
02 integrate highchart
02 integrate highchart02 integrate highchart
02 integrate highchart
 
Metrics on the front, data in the back
Metrics on the front, data in the backMetrics on the front, data in the back
Metrics on the front, data in the back
 
High Performance Web Components
High Performance Web ComponentsHigh Performance Web Components
High Performance Web Components
 
Pyramid Lighter/Faster/Better web apps
Pyramid Lighter/Faster/Better web appsPyramid Lighter/Faster/Better web apps
Pyramid Lighter/Faster/Better web apps
 
FrontInBahia 2014: 10 dicas de desempenho para apps mobile híbridas
FrontInBahia 2014: 10 dicas de desempenho para apps mobile híbridasFrontInBahia 2014: 10 dicas de desempenho para apps mobile híbridas
FrontInBahia 2014: 10 dicas de desempenho para apps mobile híbridas
 
apidays Paris 2022 - France Televisions : How we leverage API Platform for ou...
apidays Paris 2022 - France Televisions : How we leverage API Platform for ou...apidays Paris 2022 - France Televisions : How we leverage API Platform for ou...
apidays Paris 2022 - France Televisions : How we leverage API Platform for ou...
 
HTML5 after the hype - JFokus2015
HTML5 after the hype - JFokus2015HTML5 after the hype - JFokus2015
HTML5 after the hype - JFokus2015
 

Recently uploaded

Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 

Downloading the internet with Python + Scrapy

  • 1. Downloading the internet with Python + scrapy 💻🐍 Erin Shellman @erinshellman Puget Sound Programming Python meet-up January 14, 2015
  • 2. hi! I’m a data scientist in the Nordstrom Data Lab. I’ve built scrapers to monitor the product catalogs of various sports retailers.
  • 3. Getting data can be hard Despite the open-data movement and popularity of APIs, volumes of data are locked up in DOMs all over the internet.
  • 4. Monitoring competitor prices • As a retailer, I want to strategically set prices in relation to my competitors. • But they aren’t interested in sharing their prices and mark-down strategies with me. 😭
  • 5. • “Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.” • scrapin’ on rails!
  • 9. class Product(Item):! ! product_title = Field()! description = Field()! price = Field() ! Define what to scrape in items.py
  • 10. protip: get to know the DOM.
  • 11. protip: get to know the DOM.
  • 12. Sometimes there are hidden gems. SKU-level inventory availability? Score!
  • 13. Spider design Spiders have two primary components: 1. Crawling (navigation) instructions 2. Parsing instructions
  • 14. Define the crawl behavior in spiders/backcountry.py After spending some time on backcountry.com, I decided the all brands landing page was the best starting URL.
  • 15. class BackcountrySpider(CrawlSpider):! name = 'backcountry'! def __init__(self, *args, **kwargs):! super(BackcountrySpider, self).__init__(*args, **kwargs)! self.base_url = 'http://www.backcountry.com'! self.start_urls = ['http://www.backcountry.com/Store/catalog/shopAllBrands.jsp']! ! def parse_start_url(self, response):! brands = response.xpath("//a[@class='qa-brand-link']/@href").extract()! ! for brand in brands:! brand_url = str(self.base_url + brand)! self.log("Queued up: %s" % brand_url)! ! yield scrapy.Request(url = brand_url, ! callback = self.parse_brand_landing_pages)! Part I: Crawl Setup
  • 16. e.g. brand_url = http://www.backcountry.com/burton ! def parse_start_url(self, response):! brands = response.xpath("//a[@class='qa-brand-link']/@href").extract()! ! for brand in brands:! brand_url = str(self.base_url + brand)! self.log("Queued up: %s" % brand_url)! ! yield scrapy.Request(url = brand_url, ! callback = self.parse_brand_landing_pages)!
  • 17. ! def parse_brand_landing_pages(self, response):! shop_all_pattern = "//a[@class='subcategory-link brand-plp-link qa-brand-plp-link']/@href"! shop_all_link = response.xpath(shop_all_pattern).extract()! ! if shop_all_link:! all_product_url = str(self.base_url + shop_all_link[0]) ! ! yield scrapy.Request(url = all_product_url,! callback = self.parse_product_pages)! else: ! yield scrapy.Request(url = response.url,! callback = self.parse_product_pages)
  • 18. def parse_product_pages(self, response):! product_page_pattern = "//a[contains(@class, 'qa-product-link')]/@href"! pagination_pattern = "//li[@class='page-link page-number']/a/@href"! ! product_pages = response.xpath(product_page_pattern).extract()! more_pages = response.xpath(pagination_pattern).extract()! ! # Paginate!! for page in more_pages:! next_page = str(self.base_url + page)! yield scrapy.Request(url = next_page,! callback = self.parse_product_pages)! ! for product in product_pages:! product_url = str(self.base_url + product)! ! yield scrapy.Request(url = product_url,! callback = self.parse_item)
  • 19. def parse_product_pages(self, response):! product_page_pattern = "//a[contains(@class, 'qa-product-link')]/@href"! pagination_pattern = "//li[@class='page-link page-number']/a/@href"! ! product_pages = response.xpath(product_page_pattern).extract()! more_pages = response.xpath(pagination_pattern).extract()! ! # Paginate!! for page in more_pages:! next_page = str(self.base_url + page)! yield scrapy.Request(url = next_page,! callback = self.parse_product_pages)! ! for product in product_pages:! product_url = str(self.base_url + product)! ! yield scrapy.Request(url = product_url,! callback = self.parse_item)
  • 20. # Paginate!! for page in more_pages:! next_page = str(self.base_url + page)! yield scrapy.Request(url = next_page,! callback = self.parse_product_pages)! ! for product in product_pages:! product_url = str(self.base_url + product)! ! yield scrapy.Request(url = product_url,! callback = self.parse_item)
  • 21. def parse_item(self, response):! ! item = Product()! dirty_data = {}! ! dirty_data['product_title'] = response.xpath(“//*[@id=‘product-buy-box’]/div/div[1]/h1/text()“).extract()! dirty_data['description'] = response.xpath("//div[@class='product-description']/text()").extract()! dirty_data['price'] = response.xpath("//span[@itemprop='price']/text()").extract()! ! for variable in dirty_data.keys():! if dirty_data[variable]: ! if variable == 'price':! item[variable] = float(''.join(dirty_data[variable]).strip().replace('$', '').replace(',', ''))! else: ! item[variable] = ''.join(dirty_data[variable]).strip()! ! yield item! Part II: Parsing
  • 22. for variable in dirty_data.keys():! if dirty_data[variable]: ! if variable == 'price':! item[variable] = float(''.join(dirty_data[variable]).strip().replace('$', '').replace(',', ''))! else: ! item[variable] = ''.join(dirty_data[variable]).strip() Part II: Clean it now!
  • 24. 2015-01-02 12:32:52-0800 [backcountry] INFO: Closing spider (finished) 2015-01-02 12:32:52-0800 [backcountry] INFO: Stored json feed (38881 items) in: bc.json 2015-01-02 12:32:52-0800 [backcountry] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 33068379, 'downloader/request_count': 41848, 'downloader/request_method_count/GET': 41848, 'downloader/response_bytes': 1715232531, 'downloader/response_count': 41848, 'downloader/response_status_count/200': 41835, 'downloader/response_status_count/301': 9, 'downloader/response_status_count/404': 4, 'dupefilter/filtered': 12481, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 1, 2, 20, 32, 52, 524929), 'item_scraped_count': 38881, 'log_count/DEBUG': 81784, 'log_count/ERROR': 23, 'log_count/INFO': 26, 'request_depth_max': 7, 'response_received_count': 41839, 'scheduler/dequeued': 41848, 'scheduler/dequeued/memory': 41848, 'scheduler/enqueued': 41848, 'scheduler/enqueued/memory': 41848, 'spider_exceptions/IndexError': 23, 'start_time': datetime.datetime(2015, 1, 2, 20, 14, 16, 892071)} 2015-01-02 12:32:52-0800 [backcountry] INFO: Spider closed (finished)
  • 25. { "review_count": 18, "product_id": "BAF0028", "brand": "Baffin", "product_url": "http://www.backcountry.com/baffin-cush-slipper-mens", "source": "backcountry", "inventory": { "BAF0028-ESP-S3XL": 27, "BAF0028-BKA-XL": 40, "BAF0028-NVA-XL": 5, "BAF0028-NVA-L": 7, "BAF0028-BKA-L": 17, "BAF0028-ESP-XXL": 12, "BAF0028-NVA-XXL": 6, "BAF0028-BKA-XXL": 44, "BAF0028-NVA-S3XL": 10, "BAF0028-ESP-L": 50, "BAF0028-ESP-XL": 52, "BAF0028-BKA-S3XL": 19 }, "price_high": 24.95, "price": 23.95, "description_short": "Cush Slipper - Men's", "price_low": 23.95, "review_score": 4 }
  • 27. –Monica Rogati, VP of Data at Jawbone “Data wrangling is a huge — and surprisingly so — part of the job. It’s something that is not appreciated by data civilians. At times, it feels like everything we do.”
  • 28. Resources • Code here!: • https://github.com/erinshellman/backcountry-scraper • Lynn Root’s excellent end-to-end tutorial. • http://newcoder.io/Intro-Scrape/ • Web scraping - It’s your civic duty • http://pbpython.com/web-scraping-mn-budget.html
  • 29. Bring your projects to hacknight! http://www.meetup.com/Seattle-PyLadies Ladies!!
  • 30. Bring your projects to hacknight! http://www.meetup.com/Seattle-PyLadies Ladies!! Thursday, January 29th 6PM ! Intro to iPython and Matplotlib Ada Developers Academy 1301 5th Avenue #1350