• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
 

Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)

on

  • 2,938 views

 

Statistics

Views

Total Views
2,938
Views on SlideShare
2,938
Embed Views
0

Actions

Likes
3
Downloads
65
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version) Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version) Presentation Transcript

    • Web Scraping 1-2-3 with Python + Scrapy Sammy Fung sammy.hk, gownjob.com
    • Today Agenda● Some Cases● Python and Scrapy
    • Web Scraping● a computer software technique of extracting information from websites. (Wikipedia)● for business, hobbies, research.......● NOT talk about business cases today.
    • CableTV & NOWTV Programme(Past)● 2004.● slow, slow, slow, or worst - cant connect.● use Flash.
    • HK Observatory and Joint TyphoonWarning Center● no easy data exchange format, eg. RSS/Atom.● We wont check websites everyday.
    • Transportation - KMB, PTES● no map view on KMB website for a bus route in the past.● Exteremly Poor, Ugly (or much worse) map UI on PTES.
    • My experiences on web scraping● 2004: php● year after: python● recent year: python with scrapy
    • Document Types● HTML, XML,......● Text● Others, eg. pictures, videos,......
    • Web Scraping● Look for right URLs to scrap.● Look for right content from webpages.● Saving data into data store.● When to run the web scraping program ?
    • What is Scrapy ?● An open source web scraping framework for Python.● Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
    • Features of Scrapy● define data you want to scrapy● write spider to extract data● Built-in: selecting and extracting data from HTML and XML● Built-in: JSON, CSV, XML output● Interactive shell console● Built-in: web service, telnet console, logging● Others
    • Installation of Scrapy● pip● APT repo● RPM● tarball (binary/source)
    • Create new scrapy project$ scrapy startproject mybotmybot/mybot/scrapy.cfgmybot/mybot/items.pymybot/mybot/pipeline.pymybot/mybot/settings.pymybot/mybot/spiders/myspider.pyetc.......
    • items.pyfrom scrapy.item import Item, Fieldclass HKOCurrentItem(Item): time = Field() station = Field() temperature = Field() humidity = Field() #......
    • spiders/hko_spider.py (1/5)from scrapy.spider import BaseSpiderfrom scrapy.selector import HtmlXPathSelectorfrom weatherhk.items import HKOCurrentItemimport datetime, re
    • spiders/hko_spider.py (2/5)class HKOCurrentSpider(BaseSpider): name = "HKOCurrentSpider" #allowed_domains = ["www.weather.gov.hk"] start_urls = [ "http://www.weather.gov.hk/textonly/forecast/chinesewx.htm" ]
    • spiders/hko_spider.py (3/5)def parse(self, response): hxs = HtmlXPathSelector(response) stations = [] # Getting weather data from each stations. tx = hxs.select("//pre[1]/text()").re([^n]*n)
    • spiders/hko_spider.py (4/5) for i in tx: if re.search(ud 度,i): data = HKOCurrentItem() data[time] = int(dt) data[station] = self.station.code(i) data[temperature] = int(re.findall(ud+,i)[0]) stations.append(data)
    • spiders/hko_spider.py (5/5) return stations
    • pipelines.py (1/2)class HKOCurrentPipeline(object): def process_item(self, item, spider): station = self.db[item[station]] storeditem = dict(item.__dict__)[_values]
    • pipelines.py (2/2) try: if temperature in storeditem: lasttime = station.find({temperature: {$gt:0}}).sort(time, -1).limit(1) if lasttime[0][time] != storeditem[time]: id = self.insert(storeditem) return item