Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Are we there yet?

792 views

Published on

An Open Data Metadata quality checker

Published in: Education, Technology
  • Hi rossdjones, I choose a brute force approach right now, if the API takes too long and times out, I iterate a maximum of three times when fetching data. Seems like some queries take some time and require the database / engine to warm up, what ever, but that way I reliably could get all the data. Thank you for your advice!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • I don't know if this would help you workaround issues with the API timing out, but at data.gov.uk we provide a data-dump in JSON every week - http://data.gov.uk/data/dumps/. Perhaps this might help make the analysis easier?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Are we there yet?

  1. 1. Are we there yet?
  2. 2. What?An Open (Govt.) Data Monitoring Tool– Metadata Quality and Consistency– Benchmarking: Who fixed what and how fast?– Is the data still there?
  3. 3. Why?● Dangling URLs into Nirvana – Data is meant to stay● (Meta-)Data is required to be consistent in order to be useful● Tendency to give without monitoring – Decoupled Metadata from Data – Question of responsibility
  4. 4. How?● Watcher – Get all metadata from CKAN data portal (legacy API calls) – Analyse metadata and URLs – Write result into staging database (SQL) – Watch for new / changed datasets● Analyser – Perform analysis on staging area (partly long-running and tedious), write result into RedisDB ● Who has the most data released? EASY! ● Who uploaded when which datasets? ● Who fixed the most mistakes during the last week? ● Who has the longest outstanding bugs? ● Which datasets are no more available?
  5. 5. How? ctd.● Presentation – Make some fancy display from the Redis results – Data drill-down – – What else?
  6. 6. Architecture● Heroku PaaS● PostgreSQL data store● Redis for ephemeral data● Application logic in Go● Front-end using Bootstrap & AngularJS
  7. 7. Whats there● Metadata spec machine readable http://htmlpreview.github.io/?https://github.com/the42/ogdat/blob/master/ppogdatspec/ogdat_s (automated conversion process from PDF [sic!])● Watcher stable● Analyser work in progress● Presentation layer: HELP
  8. 8. Show me and I believe● Uhm … nothing fancy yet● Business logic & server processes● Source: https://github.com/the42/ogdat/
  9. 9. Lessons learned● There are many (minor) issues with metadata● Heroku is easy to get going● Go as a novel language is easy to develop in – Built-in concurrency features come in handy when checking eg. Urls in parallel● CKAN API@data.gv.at is not that fast and times
  10. 10. Contact Johann Höchtl johann.hoechtl@gmail.com @myprivate42 http://www.slideshare.net/jhoechtl/ https://www.facebook.com/myprivate42●

×