Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)


Published on

Published in: Technology

Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)

  1. 1. Web Scraping 1-2-3 with Python + Scrapy Sammy Fung,
  2. 2. Today Agenda● Some Cases● Python and Scrapy
  3. 3. Web Scraping● a computer software technique of extracting information from websites. (Wikipedia)● for business, hobbies, research.......● NOT talk about business cases today.
  4. 4. CableTV & NOWTV Programme(Past)● 2004.● slow, slow, slow, or worst - cant connect.● use Flash.
  5. 5. HK Observatory and Joint TyphoonWarning Center● no easy data exchange format, eg. RSS/Atom.● We wont check websites everyday.
  6. 6. Transportation - KMB, PTES● no map view on KMB website for a bus route in the past.● Exteremly Poor, Ugly (or much worse) map UI on PTES.
  7. 7. My experiences on web scraping● 2004: php● year after: python● recent year: python with scrapy
  8. 8. Document Types● HTML, XML,......● Text● Others, eg. pictures, videos,......
  9. 9. Web Scraping● Look for right URLs to scrap.● Look for right content from webpages.● Saving data into data store.● When to run the web scraping program ?
  10. 10. What is Scrapy ?● An open source web scraping framework for Python.● Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
  11. 11. Features of Scrapy● define data you want to scrapy● write spider to extract data● Built-in: selecting and extracting data from HTML and XML● Built-in: JSON, CSV, XML output● Interactive shell console● Built-in: web service, telnet console, logging● Others
  12. 12. Installation of Scrapy● pip● APT repo● RPM● tarball (binary/source)
  13. 13. Create new scrapy project$ scrapy startproject mybotmybot/mybot/scrapy.cfgmybot/mybot/items.pymybot/mybot/pipeline.pymybot/mybot/settings.pymybot/mybot/spiders/myspider.pyetc.......
  14. 14. items.pyfrom scrapy.item import Item, Fieldclass HKOCurrentItem(Item): time = Field() station = Field() temperature = Field() humidity = Field() #......
  15. 15. spiders/ (1/5)from scrapy.spider import BaseSpiderfrom scrapy.selector import HtmlXPathSelectorfrom weatherhk.items import HKOCurrentItemimport datetime, re
  16. 16. spiders/ (2/5)class HKOCurrentSpider(BaseSpider): name = "HKOCurrentSpider" #allowed_domains = [""] start_urls = [ "" ]
  17. 17. spiders/ (3/5)def parse(self, response): hxs = HtmlXPathSelector(response) stations = [] # Getting weather data from each stations. tx ="//pre[1]/text()").re([^n]*n)
  18. 18. spiders/ (4/5) for i in tx: if 度,i): data = HKOCurrentItem() data[time] = int(dt) data[station] = self.station.code(i) data[temperature] = int(re.findall(ud+,i)[0]) stations.append(data)
  19. 19. spiders/ (5/5) return stations
  20. 20. (1/2)class HKOCurrentPipeline(object): def process_item(self, item, spider): station = self.db[item[station]] storeditem = dict(item.__dict__)[_values]
  21. 21. (2/2) try: if temperature in storeditem: lasttime = station.find({temperature: {$gt:0}}).sort(time, -1).limit(1) if lasttime[0][time] != storeditem[time]: id = self.insert(storeditem) return item