1. Formats for Open Data
François Bancilhon
twitter.com/fbancilhon
www.data-publica.com
Share-PSI Workshop
Brussels
May 10, 2011
2. Data Publica
● Develop the most complete and in-depth
knowledge of French electronic data. Provide a
complete directory of public data in France.
● Set up a DataStore, where people can find
data provided by us (data hunting) and by
outside vendors (data reseller)
3. CAVEAT
● I strongly support the 10 principles of the
Sunlight foundation
● From bad to good, there is a spectrum, I
support improvement rather than rejection of
everything that is not perfect
● This work derived from the recommendation of
GFII (Groupement Français de l'Industrie de
l'Information)
4. Summary
● Open formats at the physical level
● Standard formats at the conceptual level
● Agreement on anonymization
● Providing source data with pdf data
● Privileging XML
● Definition of exchange formats
5. Physical level
● At the physical level (text, image, video, etc.),
provide
● an open format (a standard for which anyone can
build tools)
● a format compatible with the commonly used tools
6. Conceptual level
● For every vertical, define standards that take
into account the specificity of the area
● Standards to be elaborated by researchers,
users and industry representatives, at the
European level
● Examples: Inspire, ITS, XBRL, OAI
7. Anonymization
● Provide an operational definition of
anonymization
● Standards for it and operational qualification
● Make up ways to anonymize while keeping
some meaning
● Need for European standard and technology
8. Providing source data with pdf
● PDF is a good format for consumer display
● PDF is a bad format for re-use
● Most of the time PDF is produced from some
other source format
● Request that PDF is provided together with its
source (not always that simple)
9. Pushing for XML
● Principle of improvement: the move to XML
from organizations that were publishing in
some other unfriendly format (eg PDF), is a
good thing
10. Define exchange formats
● Most open data formats are based on the use
that the public body is making internally of this
data
● Define instead an exchange format based on
transmission rather that on internal usage