Creating Open Data with Open Source (beta2)

972 views
881 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
972
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Creating Open Data with Open Source (beta2)

  1. 1. Creating Open Data withOpen SourceSammy Fungsammy.hk[ITFest.HK] Seminar of Free / Open Source in Hong Kong, April 2013.
  2. 2. Agenda● What is Open Data ?● Use of Open Source Software in web crawling.● Starting new Open Source projects to createOpen Data.
  3. 3. Sammy Fung● Software Developer using open source.– Perl → PHP → Python.– Data Mining / Web Crawling.– Also deploying OpenStack Cloud and Linux Solutions.● Open Source Community Leader.– opensource.hk, HKLUG, GNOME Asia committee, MozillaRep, and program committee member of the largestTaiwan open source conference - COSCUP.● Blogger at sammy.hk.
  4. 4. Open DataThree Laws of Open Government Data by David Eaves.1.If it cant be spidered or indexed, it doesnt exist.2.If it isnt available in open and machine readable format, itcant engage.3.If a legal framework doesnt allow it to be repurposed, itdoesnt empower.http://eaves.ca/2009/09/30/three-law-of-open-government-data/
  5. 5. Open Data● Tim Berners-Lee, the inventor of the Web.– 5stardata.info– 5 star deployment scheme of Open Data.
  6. 6. * One Star - Open Data1.make your stuff available on the Web (whatever format) under anopen license.2.make it available as structured data (e.g., Excel instead of imagescan of a table)3.use non-proprietary formats (e.g., CSV instead of Excel)4.use URIs to denote things, so that people can point at your stuff.5.link your data to other data to provide context.5stardata.info by Tim Berners-Lee, the inventor of the Web.
  7. 7. ** Two Star - Open Data1.make your stuff available on the Web (whatever format) under anopen license.2.make it available as structured data (e.g., Excel instead of imagescan of a table)3.use non-proprietary formats (e.g., CSV instead of Excel)4.use URIs to denote things, so that people can point at your stuff.5.link your data to other data to provide context.5stardata.info by Tim Berners-Lee, the inventor of the Web.
  8. 8. *** Three Star - Open Data1.make your stuff available on the Web (whatever format) under anopen license.2.make it available as structured data (e.g., Excel instead of imagescan of a table)3.use non-proprietary formats (e.g., CSV instead of Excel)4.use URIs to denote things, so that people can point at your stuff.5.link your data to other data to provide context.5stardata.info by Tim Berners-Lee, the inventor of the Web.
  9. 9. **** Four Star - Open Data1.make your stuff available on the Web (whatever format) under anopen license.2.make it available as structured data (e.g., Excel instead of imagescan of a table)3.use non-proprietary formats (e.g., CSV instead of Excel)4.use URIs to denote things, so that people can point at your stuff.5.link your data to other data to provide context.5stardata.info by Tim Berners-Lee, the inventor of the Web.
  10. 10. ***** Five Star - Open Data1.make your stuff available on the Web (whatever format) under anopen license.2.make it available as structured data (e.g., Excel instead of imagescan of a table)3.use non-proprietary formats (e.g., CSV instead of Excel)4.use URIs to denote things, so that people can point at your stuff.5.link your data to other data to provide context.5stardata.info by Tim Berners-Lee, the inventor of the Web.
  11. 11. Open Data from HK Government ?● 2 Use Cases of Data:– Legco Meeting Minutes and Voting Results.– Weather at Data.One.
  12. 12. Legco Meeting Minutesand Voting Results
  13. 13. Legco Meeting Minutesand Voting Results
  14. 14. Legco Meeting Minutesand Voting Results● All legco voting results are scanned andreleased in PDF, it is only possible to retrievevoting results manually.● In recent years, it seems scanned minutesfrom sheets scanned are replaced by minutesconverted from original computer documentfiles.
  15. 15. Improving Legco Vote Result Data ?● Legcovotes.net is created by Hong Kongnetitizens(?).● Only 20 famous vote results are included.● It is possible to let public to input other voteresults by hand, and submissions should beverified by legcovotes.net authoritative.● Including other data, eg. Minutes in plain textor paragraphs related to a counciler.
  16. 16. Weather at Data.One● My Chinese Blog Post 「香港政府機構開放資料 Open Data 情況」 on 2013/1/17.● Data.One released on 2011/3/31.● Weather at Data.One provides 7 dataset URLs,returns RSS (XML) format (Eng/TChi/SChi)– One word: Useless.– Data.One dataset (RSS) is completely differentwith HKO own paid service (XML).
  17. 17. Weather at Data.One● Example - Current local weather report:● Plain text report in RSS.● Difference to quote report content:– Website: a pair of HTML tags, eg. <PRE>....</PRE>.– Data.One: a pair of RSS description tags,<description>....</description>.● Other weather data is missing, eg. Regionaltemperture updates per each 12 mins.
  18. 18. Weather at Data.One● Weather at Data.One is report but not data.● Weather RSS is already released by HKObefore launch of Data.One.● Technically, json/xml format is betterreadable by computer programs.
  19. 19. Oversea Open Data ProjectExamples● Toronto:– City Data: http://map.toronto.ca/wellbeing/– Transportation: http://www.rocketradar.net/– Pollution: http://www.emitter.ca/● US & Canada:– https://www.crimereports.com/
  20. 20. Use of Open Source Software inWeb Crawling● Use Open Source Tools to collect useful andmeaningful machine-readable data.● Doesnt need to wait provider to release datain machine-readable format.
  21. 21. Open Source Tools● Python programming lanugage● with Regular Expression library● Scrapy web crawling framework
  22. 22. Why python + scrapy ?● python: my current favourite programminglanguage for few years.● scrapy: web crawling framework written inPython.
  23. 23. Scrapy● scrapy: web crawling framework written inPython.● HtmlXPathSelector● Output: built-in JSON, CSV, XML.● Python: import re
  24. 24. My Products● WeatherHK ← ← ←● TCTrack
  25. 25. WeatherHK● http://twitter.com/weatherhk● hourly current weather report● weather forecast report● tropical signal warning
  26. 26. WeatherHK● Backend: Python + Scrapy + Database +Twitter + NNTP......● Frontend: Twitter + Newsgroup
  27. 27. WeatherHK● http://twitter.com/weatherhk● Interview by MetroPop in 2009.
  28. 28. My Products● WeatherHK● TCTrack ← ← ←
  29. 29. TCTrack● http://sammy.hk/projects/tctrack/tctrack.php● Plot TC current and forecast tracks overGoogle Map.● Source:– JTWC– HKO
  30. 30. TCTrack● http://sammy.hk/projects/tctrack/tctrack.php● Probably first tctrack map in HK usingGoogleMap● Use of GMap: TCTrack -> WeatherUnderground Hong Kong -> HKO
  31. 31. TCTrack● http://twitter.com/tctrack● Tweet JTWC updates for Northwest Pacific.
  32. 32. Starting new Open Source projectsto create Open Data● Develop a open source project.● Release data in standard machine-readabledata format.
  33. 33. Open Source Project Examples● Hk0weather● My weather related open source project.
  34. 34. hk0weather● https://github.com/sammyfung/hk0weather● Open Source Hong Kong Weather Project.● convert to JSON data from HKO webpages.● python + scrapy● 1st version: from current weather report,extracting temperture and humidity from 20+weather stations, export in json format.
  35. 35. hk0weather● https://github.com/sammyfung/hk0weather● $ virtualenv hk0weatherenv● $ source hk0weatherenv/bin/activate● $ pip install scrapy● $ git clonehttps://github.com/sammyfung/hk0weather.git● $ cd hk0weather● $ scrapy crawl currwx -t json -o testresult
  36. 36. hk0weather[{"humidity": 80, "station": "hko", "temperture": 17, "time": 1360785720},{"station": "kingspark", "temperture": 16, "time": 1360785720},{"station": "wongchukhang", "temperture": 17, "time": 1360785720},{"station": "takwuling", "temperture": 16, "time": 1360785720},{"station": "laufaushan", "temperture": 15, "time": 1360785720},{"station": "taipo", "temperture": 16, "time": 1360785720},{"station": "shatin", "temperture": 17, "time": 1360785720},{"station": "tuenmun", "temperture": 17, "time": 1360785720},{"station": "tseungkwano", "temperture": 16, "time": 1360785720},{"station": "saikung", "temperture": 16, "time": 1360785720},{"station": "cheungchau", "temperture": 17, "time": 1360785720},{"station": "cheungchau", "temperture": 17, "time": 1360785720},{"station": "tsingyi", "temperture": 17, "time": 1360785720},{"station": "shekkong", "temperture": 15, "time": 1360785720},{"station": "tsuenwanhokoon", "temperture": 15, "time": 1360785720},{"station": "tsuenwanshingmunvalley", "temperture": 17, "time": 1360785720},{"station": "hongkongpark", "temperture": 17, "time": 1360785720},{"station": "shaukeiwan", "temperture": 16, "time": 1360785720},{"station": "kowlooncity", "temperture": 16, "time": 1360785720},{"station": "happyvalley", "temperture": 18, "time": 1360785720},{"station": "wongtaisin", "temperture": 17, "time": 1360785720},{"station": "stanley", "temperture": 16, "time": 1360785720},{"station": "kwuntong", "temperture": 15, "time": 1360785720},{"station": "shamshuipo", "temperture": 17, "time": 1360785720}]
  37. 37. Items.pyclass Hk0WeatherItem(Item):time = Field()station = Field()temperture = Field()humidity = Field()
  38. 38. Currwx.pystart_urls = (http://www.weather.gov.hk/wxinfo/currwx/currentc.htm,)
  39. 39. Currwx.pydef parse(self, response):laststation = temperture = int()stations = []hxs = HtmlXPathSelector(response)report = hxs.select(//div[@id="ming"])
  40. 40. libhk0class hk0:stations = [(u 天 文 台 , hko),(u 京 士 柏 , kingspark),(u 黃 竹 坑 , wongchukhang),(u 打 鼓 嶺 , takwuling),(u 流 浮 山 , laufaushan),
  41. 41. libhk0class hk0:def gettime(self, report):…def hk0current(self, report):…
  42. 42. hk0weather● Future Planning:● Add more weather reports.● Getting ideas and/or cooperate with proWeather hobbists.● Remarks:● Development of hk0weather is started fromZERO, its code is different than my twitter@weatherhk.
  43. 43. Challenge● Challenge on first day of hk0weather release.● Director of a mobile app developer companytold me by leaving a Facebook comment.– HKO provides data in pretty XML format with their annual service plan for commerical companies.– He think that ***MAYBE*** HKO would provide XML to you ***without*** any charges if I asked.● Remark: This is an assumption only, not listed on HKO website.
  44. 44. Challenge● I replied the following to him after googling for HKO XMLschema.– HKO didnt mention free of charge service of XML data feed onwebsite.– I registered and got authorization from HKO to re-distribute theirweather information for non-profit making. And I received someemails from HKO for any updates of website and HTML structure,but never mention about XML data feed service.– Weather data available on HKO XML data feed is still fewer than itsHTML website.●So, this challenge is FAIL! XD
  45. 45. Open Data Project Examples● Open Government initiative from HKU JMSC.● http://opengov.jmsc.hku.hk/● https://github.com/jmschku
  46. 46. Agenda● What is Open Data ?● Use of Open Source Software in web crawling.● Starting new Open Source projects to createOpen Data.
  47. 47. Thank You!sammy.hk

×