Hidden data treasures. How we can make more government data available for re-use without sacrificing the privacy of citizens?
Presentation University of Bergen Jan 12. 2010
Hidden data treasures. How we can make more government data available for re-use without sacrificing the privacy of citizens? How to contact me: Harald Groven Web developer (actually database guy) [email_address] twitter.com/kongharald
Public data: 150 years of no development? This is how public data was presented more than a century ago: Source: Eilert Sundt Om dødeligheden i Norge 1855
How the kind of data set is being presented 150 years later... Any progress? (except change of font?)
Mashup of household register of 1865, 1886 and 2005 red dots = 2005 map = Friis 1861 New technology makes it possible to do visualizations and analyses not imaginable when data was collected
Recommended reading: The cultural & democratic impact of statistics on the public sphere Sarah Igo The Averaged American: Surveys, Citizens, and the Making of a Mass Public. Harvard UP 2007 Usefulness / uselessness of aggregates ? In the rest of the talk, I will argue that aggregates are mostly useless, but in fact they are not... good starting point
Usefulness of disaggregation Key concepts of data warehousing Rollup = Aggregate up one level Drill down = Disaggregate down one level Slice = Change variable Visual example thanks Obama & Vivek! Public spending in the US Text example thanks to NSD, Bergen % of students not passing exam 2005-09
Practical reason for not publishing disaggregated source data in pre-computer age: Space and cost ! Image: CC Harald Groven
Finding the needle in haystack of unaggregated data: Data warehouse. Data cube, visualized in 3D
What kind of statistical data are published? High level aggregates: Accessible Medium level aggregates (e.g. municipality level): Sometimes Low level aggregates, untraceable to identifiable persons: " grey zone ", accessible but largely unknown Anonymized raw data: Inaccessible for 100 years, some cases available for research! Raw data, unanonymized: Inaccessible for 100 years!
Creating a data.gov.no ? Require government agencies to publish anonymised data - Statistician's method: 1%, 10% samples? - CS method: Randomize variables so that the data set have the same statistical aggregates - Easy method: Publish data sets or each value, with a threshold value (e.g. 3 persons) to avoid tracing ID. - Use categories, not uncoded values.
CC Macwagon Government resources for making use of public data?