Five Star Open Data
1.make your stuff available on the Web (whatever format)
under an open license.
2.make it available as structured data (e.g., Excel instead of
image scan of a table)
3.use non-proprietary formats (e.g., CSV instead of Excel)
4.use URIs to denote things, so that people can point at your
5.link your data to other data to provide context.
5stardata.info by Tim Berners-Lee, the inventor of the Web.
– Lead by OGCIO of Hong Kong Government.
– Use the term “public sector information” (PSI)
insteads of “open data”.
– Many data are not available in machine-readable
format with useful data structure.
– A lot of data are still requiring web scraping with
customized data extraction to collect useful
Creating Scrapy Project
● Define your data structure
● Write your first spider
– Test with scrapy shell console
● Output / Store collected data
– Output with built-in supported formats
– Store to database / object store.
Define your data structure
reporttime = Field()
station = Field()
temperture = Field()
humidity = Field()
Write your first spider
● Import a Class of your own data structure.
– $ scrapy genspider -t basic <YOUR SPIDER NAME>
– $ scrapy list
● Import any scrapy class which you required.
– eg. Spider, XPath Selector
● Extend parse() function of a Spider class.
Test with scrapy shell console
– $ scrapy shell <URL>
Output / Store collected data
● Use built-in JSON, CSV, XML output at
– $ scrapy crawl <Spider Name> -t json -o <Output
– Import a Class of your own data structure.
– Extend process_item() function.
– Add to ITEM_PIPELINES at settings.
Scrapy + Django
● Define django environment at scrapy settings.
– Load django configuration.
● Use Scrapy DjangoItem class
– Insteads of Item and Field class
– Define which django data model should be linked
● Query and insert data at scrapy pipelines.
● Weather Data Project.
– convert weather information to JSON data from
– python + scrapy + django
● Hong Kong Weather Data.
– 20+ HKO weather stations in Hong Kong.
– Regional weather data.
– Rainfall data.
– Weather forecast report.
● Setup and activate a python virtual enviornment,
and install scrapy and django with pip.
● Clone hk0weather from GitHub
– $ git clone https://github.com/sammyfung/hk0weather.git
● Setup database connection at Django and create
database, tables and first django user.
● Scrap regional weather data
– $ scrapy crawl regionalwx -t json -o regional.json