Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Downloading the internet with Python + Scrapy

3,988 views

Published on

Here I walk through a quick example of how to build a scraper in Python to monitor retail prices.

Published in: Technology

Downloading the internet with Python + Scrapy

  1. 1. Downloading the internet with Python + scrapy 💻🐍 Erin Shellman @erinshellman Puget Sound Programming Python meet-up January 14, 2015
  2. 2. hi! I’m a data scientist in the Nordstrom Data Lab. I’ve built scrapers to monitor the product catalogs of various sports retailers.
  3. 3. Getting data can be hard Despite the open-data movement and popularity of APIs, volumes of data are locked up in DOMs all over the internet.
  4. 4. Monitoring competitor prices • As a retailer, I want to strategically set prices in relation to my competitors. • But they aren’t interested in sharing their prices and mark-down strategies with me. 😭
  5. 5. • “Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.” • scrapin’ on rails!
  6. 6. scrapy startproject prices
  7. 7. scrapy project prices/ scrapy.cfg prices/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
  8. 8. scrapy project prices/ scrapy.cfg prices/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py backcountry.py ...
  9. 9. class Product(Item):! ! product_title = Field()! description = Field()! price = Field() ! Define what to scrape in items.py
  10. 10. protip: get to know the DOM.
  11. 11. protip: get to know the DOM.
  12. 12. Sometimes there are hidden gems. SKU-level inventory availability? Score!
  13. 13. Spider design Spiders have two primary components: 1. Crawling (navigation) instructions 2. Parsing instructions
  14. 14. Define the crawl behavior in spiders/backcountry.py After spending some time on backcountry.com, I decided the all brands landing page was the best starting URL.
  15. 15. class BackcountrySpider(CrawlSpider):! name = 'backcountry'! def __init__(self, *args, **kwargs):! super(BackcountrySpider, self).__init__(*args, **kwargs)! self.base_url = 'http://www.backcountry.com'! self.start_urls = ['http://www.backcountry.com/Store/catalog/shopAllBrands.jsp']! ! def parse_start_url(self, response):! brands = response.xpath("//a[@class='qa-brand-link']/@href").extract()! ! for brand in brands:! brand_url = str(self.base_url + brand)! self.log("Queued up: %s" % brand_url)! ! yield scrapy.Request(url = brand_url, ! callback = self.parse_brand_landing_pages)! Part I: Crawl Setup
  16. 16. e.g. brand_url = http://www.backcountry.com/burton ! def parse_start_url(self, response):! brands = response.xpath("//a[@class='qa-brand-link']/@href").extract()! ! for brand in brands:! brand_url = str(self.base_url + brand)! self.log("Queued up: %s" % brand_url)! ! yield scrapy.Request(url = brand_url, ! callback = self.parse_brand_landing_pages)!
  17. 17. ! def parse_brand_landing_pages(self, response):! shop_all_pattern = "//a[@class='subcategory-link brand-plp-link qa-brand-plp-link']/@href"! shop_all_link = response.xpath(shop_all_pattern).extract()! ! if shop_all_link:! all_product_url = str(self.base_url + shop_all_link[0]) ! ! yield scrapy.Request(url = all_product_url,! callback = self.parse_product_pages)! else: ! yield scrapy.Request(url = response.url,! callback = self.parse_product_pages)
  18. 18. def parse_product_pages(self, response):! product_page_pattern = "//a[contains(@class, 'qa-product-link')]/@href"! pagination_pattern = "//li[@class='page-link page-number']/a/@href"! ! product_pages = response.xpath(product_page_pattern).extract()! more_pages = response.xpath(pagination_pattern).extract()! ! # Paginate!! for page in more_pages:! next_page = str(self.base_url + page)! yield scrapy.Request(url = next_page,! callback = self.parse_product_pages)! ! for product in product_pages:! product_url = str(self.base_url + product)! ! yield scrapy.Request(url = product_url,! callback = self.parse_item)
  19. 19. def parse_product_pages(self, response):! product_page_pattern = "//a[contains(@class, 'qa-product-link')]/@href"! pagination_pattern = "//li[@class='page-link page-number']/a/@href"! ! product_pages = response.xpath(product_page_pattern).extract()! more_pages = response.xpath(pagination_pattern).extract()! ! # Paginate!! for page in more_pages:! next_page = str(self.base_url + page)! yield scrapy.Request(url = next_page,! callback = self.parse_product_pages)! ! for product in product_pages:! product_url = str(self.base_url + product)! ! yield scrapy.Request(url = product_url,! callback = self.parse_item)
  20. 20. # Paginate!! for page in more_pages:! next_page = str(self.base_url + page)! yield scrapy.Request(url = next_page,! callback = self.parse_product_pages)! ! for product in product_pages:! product_url = str(self.base_url + product)! ! yield scrapy.Request(url = product_url,! callback = self.parse_item)
  21. 21. def parse_item(self, response):! ! item = Product()! dirty_data = {}! ! dirty_data['product_title'] = response.xpath(“//*[@id=‘product-buy-box’]/div/div[1]/h1/text()“).extract()! dirty_data['description'] = response.xpath("//div[@class='product-description']/text()").extract()! dirty_data['price'] = response.xpath("//span[@itemprop='price']/text()").extract()! ! for variable in dirty_data.keys():! if dirty_data[variable]: ! if variable == 'price':! item[variable] = float(''.join(dirty_data[variable]).strip().replace('$', '').replace(',', ''))! else: ! item[variable] = ''.join(dirty_data[variable]).strip()! ! yield item! Part II: Parsing
  22. 22. for variable in dirty_data.keys():! if dirty_data[variable]: ! if variable == 'price':! item[variable] = float(''.join(dirty_data[variable]).strip().replace('$', '').replace(',', ''))! else: ! item[variable] = ''.join(dirty_data[variable]).strip() Part II: Clean it now!
  23. 23. scrapy crawl backcountry -o bc.json
  24. 24. 2015-01-02 12:32:52-0800 [backcountry] INFO: Closing spider (finished) 2015-01-02 12:32:52-0800 [backcountry] INFO: Stored json feed (38881 items) in: bc.json 2015-01-02 12:32:52-0800 [backcountry] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 33068379, 'downloader/request_count': 41848, 'downloader/request_method_count/GET': 41848, 'downloader/response_bytes': 1715232531, 'downloader/response_count': 41848, 'downloader/response_status_count/200': 41835, 'downloader/response_status_count/301': 9, 'downloader/response_status_count/404': 4, 'dupefilter/filtered': 12481, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 1, 2, 20, 32, 52, 524929), 'item_scraped_count': 38881, 'log_count/DEBUG': 81784, 'log_count/ERROR': 23, 'log_count/INFO': 26, 'request_depth_max': 7, 'response_received_count': 41839, 'scheduler/dequeued': 41848, 'scheduler/dequeued/memory': 41848, 'scheduler/enqueued': 41848, 'scheduler/enqueued/memory': 41848, 'spider_exceptions/IndexError': 23, 'start_time': datetime.datetime(2015, 1, 2, 20, 14, 16, 892071)} 2015-01-02 12:32:52-0800 [backcountry] INFO: Spider closed (finished)
  25. 25. { "review_count": 18, "product_id": "BAF0028", "brand": "Baffin", "product_url": "http://www.backcountry.com/baffin-cush-slipper-mens", "source": "backcountry", "inventory": { "BAF0028-ESP-S3XL": 27, "BAF0028-BKA-XL": 40, "BAF0028-NVA-XL": 5, "BAF0028-NVA-L": 7, "BAF0028-BKA-L": 17, "BAF0028-ESP-XXL": 12, "BAF0028-NVA-XXL": 6, "BAF0028-BKA-XXL": 44, "BAF0028-NVA-S3XL": 10, "BAF0028-ESP-L": 50, "BAF0028-ESP-XL": 52, "BAF0028-BKA-S3XL": 19 }, "price_high": 24.95, "price": 23.95, "description_short": "Cush Slipper - Men's", "price_low": 23.95, "review_score": 4 }
  26. 26. prices/ scrapy.cfg prices/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py backcountry.py evo.py rei.py ...
  27. 27. –Monica Rogati, VP of Data at Jawbone “Data wrangling is a huge — and surprisingly so — part of the job. It’s something that is not appreciated by data civilians. At times, it feels like everything we do.”
  28. 28. Resources • Code here!: • https://github.com/erinshellman/backcountry-scraper • Lynn Root’s excellent end-to-end tutorial. • http://newcoder.io/Intro-Scrape/ • Web scraping - It’s your civic duty • http://pbpython.com/web-scraping-mn-budget.html
  29. 29. Bring your projects to hacknight! http://www.meetup.com/Seattle-PyLadies Ladies!!
  30. 30. Bring your projects to hacknight! http://www.meetup.com/Seattle-PyLadies Ladies!! Thursday, January 29th 6PM ! Intro to iPython and Matplotlib Ada Developers Academy 1301 5th Avenue #1350

×