Roy Roebuck, Management Educator, Researcher, Architect, and Corporate Executive at One World Information System (Non-Profit), favorited this 6 months ago
Towards A Global Infrastructure For Data And Metadata The Open Data Foundation Presentation - Presentation Transcript
Toward a Global Infrastructure for Data and Metadata: The Open Data Foundation Arofan Gregory Executive Manager The Open Data Foundation
Something Really Amazing
Spaceships aren’t that amazing…
Aliens aren’t that amazing…
Mobile telephones aren’t that amazing…
These devices have access to the complete set of human (well, Federation) knowledge, via ship’s computer - That’s AMAZING!
An Epic Feat of Data Standardization!
Tasers aren’t that amazing…
A Big Idea
It might seem too outrageous to imagine that every data source could be accessible and usable via a global network, but…
Consider all the domain “grids” which are emerging
Consider the number of modern technologies for leveraging data across networks
Consider the tools we have for solving problems of semantic interoperability
Maybe Star Trek was only a few decades ahead of its time!
Something Missing…
Technology alone cannot solve this problem
For centuries, scientists, librarians, and archivists have worked to perfect taxonomies and classifications for organizing and accessing human knowledge
Technologists can’t replace the disciplines which have evolved from this work with technology alone
They can only automate it
Having an ontology doesn’t mean you have an agreed, tried, and workable standard classification system!
A thousand little ontologies still produce chaos!
Why Now?
The idea of a global data infrastructure is practical today because…
We have good, standards-based, networked technology
We have a highly sophisticated population of archivists and librarians who understand the challenges of large-scale classification, for all types of media
We have an emerging culture of data producers and users who are beginning to understand the potential offered by modern technology
The Open Data Movement
From Wikipedia:
“ Open Data is a philosophy and practice requiring that certain data are freely available to everyone, without restrictions from copyright, patents or other mechanisms of control. It has a similar ethos to a number of other "Open" movements and communities such as Open Source and Open Access .”
The Open Data Foundation (ODaF)
Although we respect this traditional goal of the Open Data movement, we feel that the technology issues, as opposed to the legal ones, have a different focus:
Much public data is inaccessible or unusable
Confidential data is less accessible than it could be
The collection and publication of some critical data is lacking, notably in the Developing World
It is not enough to put the rights to data into the public domain – it must also be practically accessible to all potential users
What Do We Mean by “Data”?
Official statistics collected by government agencies and international organizations
Usually aggregates and time-series data
Covers a huge range of social, scientific, and economic topics
Numeric research data supporting social sciences and hard sciences
Often lower-level “microdata”
May be gathered by survey or sourced from registers
Qualitative data used in social sciences research
Not research papers, but source data (eg, interviews)
ODaF’s Mission
To bring together individuals from the statistics community, the research community, and the technology standards community
To promote the creation of a global infrastructure for data and metadata by providing open-source tools and supporting the adoption of a coordinated set of open technology standards
To promote the creation and use of knowledge, and fact-based decision-making, through improved access to data and metadata
ODaF - Timeline
The idea started at IASSIST 2006 in Edinburgh
Incorporated in 2006 as a US scientific non-profit
First face-to-face meeting in Washington DC in December 2006 at the National Opinion Research Center (NORC)
September 2007: next face-to-face meeting in St. Helena, California
Next face-to-face meeting: NORC in DC, December 2007, followed by a European meeting (UK, Netherlands, or Germany) in early 2008
NOTE: We are a virtual organization – we don’t rely on face-to-face meetings for conducting work (Thanks, Skype!)
ODaF - Directors
Bob Glushko – head of the UC Berkeley Center for Document Engineering and member of OASIS Board of Directors
Julia Lane – Vice President, NORC and world-class expert in data confidentiality issues
Ernie Boyko , former President of IASSIST
Rune Gloersen – head of IT at Statistics Norway
ODaF - Executive Managers
Arofan Gregory – background in SGML/XML, technology standards (notably ebXML, UBL, UN/CEFACT, ISO TC154, DDI, and SDMX)
Pascal Heus - lead developer for World Bank and International Household Survey Network, much experience with field-work in Africa, DDI implementor
Chris Nelson – veteran OMGer (CWM), worked with many technology standards (UN/EDIFACT, GESMES, ebXML, SDMX, DDI), consummate UML modeler
Jostein Ryssevik – former CEO of Nesstar North America, now with Ideas2Evidence, associated with Gallup Europe; longtime DDI implementor
ODaF - Advisors
Sandra Cannon - Board of Governors of the Federal Reserve System
Gilles Collette - Visual Communications, Pan-American Health Organization (WHO)
Daniel Gillman - US Bureau of Labor Statistics
Eduardo Gutentag – Chair, OASIS Board of Directors
Paul Johanis - Statistics Canada
Graeme Oakley - Australian Bureau of Statistics
Dr. Andrew Nelson - Joint Research Centre of the European Commission
ODaF – Advisors (cont.)
Ken Miller - UK Data Archive / Economic and Social Data Service
Duane Nickull - Chair, OASIS SOA Reference Architecture TC
Juraj Riecan - United Nations Economic Commission for Europe (UNECE)
Gerard Salou - European Central Bank
Professor Bo Sundgren, Ph.D - Statistics Sweden
Wendy Thomas - Minnesota Population Center, University of Minnesota
Wendy Watkins - Data Centre Coordinator, Maps, Data and Government Information Centre, Carleton University Library
ODaF - Organization
We are project-oriented:
Any member can participate in projects
May be paid consultants for specific work, or volunteers
Project proposal is put before Directors by Management team in consultation with Board of Advisors for approval
Work is conducted by specified project team, using specified resources
All Directors, Managers, and Advisors are volunteers
Work is focused on coordination of projects, with resources coming from other participating organizations
The Problem Space
The flows of data can be seen as forming a type of “supply chain”
Collected data are aggregated and reported/disseminated to other organizations
The points where data are exchanged can be problematic:
Loss of metadata
No automated integration into receiving systems
Time- and resource-intensive
This exchange of data and metadata must be managed in an efficient, standard fashion if we are to build a global infrastructure
180 + Countries Internet, Search, Navigation www.z.org www.hub.org www.x.org www.y.org International Organisations Regional Organisations accounts statistics Banks, Corporates Individual Households trans- actions accounts National Statistical Organisations accounts statistics
Data Lifecycle Model
Within each level of the information chain, we see a process:
Data sourcing or collection
Data processing (re-coding, harmonization, aggregation)
Data dissemination and archiving
Data reporting and re-purposing
Throughout this cycle, each step generates important metadata which can be captured to provide better downstream processing and understanding of the data
Today, this metadata is often lost
Between steps of the lifecycle
When the final data product is exchanged in the information chain
Data Lifecycle Model
An Observation on Organizations
Governmental, supra-governmental, and research organizations which produce data have as a primary mission the collection of data
To support policy making
To support research
To support regulatory activities
They do not have a primary mission to focus on the exchange of data with other organizations
This is often perceived as a burden rather than a part of the primary mission of the organization
They are often not well-skilled in the latest technology for data exchange and interoperability
Standards organizations tend to be too busy promoting their own standards to be worried about how users might combine them with other standards in implementations
Issues
Issues with public data:
Public data which is not released: "Users won't understand it“ - Too little metadata!
Public data which is unusable: formats are bad, too little metadata about formats, terminology, methodology, coding, and concepts
Public data which cannot be accessed because its location/existence is not known
Public data which loses value because it cannot be published and accessed in a timely manner
Issues (cont.)
Issues with confidential data:
Public data sets derived from confidential data have been damaged by anonymization
Confidential data which are not seen because access produces unacceptable disclosure risk
There are secure “Research Data Centers” for allowing access to confidential data to qualified researchers
These are not as accessible or as open as they could be, due to their physical nature and the fact that they generally are not in communication with each other
Better metadata management and shared metadata leads to a better understanding of disclosure risk, and thus improved access for researchers
Note on Data Confidentiality
You might think proponents of Open Data would disapprove of confidential data
Response rates are falling for all types of survey data collection due to fears of disclosure
There are many new ways of collecting data about individuals (RFID chips, security cameras, cell phones, etc.)
The standards for data confidentiality are there for a good reason – to protect individuals!
We believe that confidential data should be as open as possible and not more!
Issues (cont.)
Issues with data in the Developing World:
Absent data due to inefficient or nonexistent data collection/publication
Unsustainable data collection/publication produces insufficient continuity of data
Once educated, IT workers get jobs in Europe and America
Funding is typically not on-going, but only for a limited period
The vast majority of the world’s population is in the Developing World, and the trend is increasing
To understand our world and make good policy, we must support sustainable data collection and publication about this huge segment of the population!
How Can We Solve These Problems?
Many of these issues can be solved with modern technology
Better documentation using standard metadata formats
Better mechanisms for data discovery and access between organizations of all types
Better mechanisms for managing semantic interoperability
Free or inexpensive tools for metadata capture and data/metadata exchange
Improved mechanisms for sustainable collection and publication of data in the Developing World
ODaF’s Vision
A network of standard, federated registries provide the ability to discover data and metadata globally
Standard data and metadata formats and models provide the basis for automated use and integration between applications
Standard semantic registries and mappings to standard classifications/ontologies allow for semantic interoperability
All of these standards would be coordinated to work together predictably in an open architecture
Domains are self-governing – each has its own registries, classifications, etc. There must be minimum governance at the center for operation of the entire network.
Interoperability through mapping to the standards-based open architecture
Which Standards?
ISO 17369 Statistical Data and Metadata Exchange (SDMX)
Data Documentation Initiative (DDI)
ISO/IEC 11179 Metadata Registries
ISO 19115 Digital Geographic Data
Metadata Encoding and Transmission Standard (METS)
Extensible Business Reporting Language (XBRL)
Many others (SOA, ebXML, Web Services, Semantic Web, Dublin Core)
ISO 17369 SDMX
Produced by official statistics organizations (BIS, ECB, Eurostat, IMF, OECD, World Bank, UN/SD)
Now available as a 2.0 version
Supports all aggregate data & time-series
Supports all types of metadata (structural & “reference” metadata)
Provides standard registry interfaces for data sourcing and exchange (not specific to SDMX formats)
Based on a formal meta-model (similar to OMG’s Common Warehouse Metamodel, but more focused)
Data and metadata formats and classifications are completely configurable
Also provides recommendations for concepts, codes, and classifications for official statistics
Data Documentation Initiative (DDI)
Produced by a consortium of members (data archives and libraries, national statistical organizations, universities, etc.)
Now in 3.0 candidate version which supports full data lifecycle (release Q1 2008)
Fine-grained metadata for describing:
Data collection (surveys, registers, etc.)
Data processing (for recodes, harmonization, data comparison)
Data archiving and dissemination
Data can be stored inline or in native file formats
Supports microdata and n-dimensional cubes
Aligned with SDMX, ISO/IEC 11179, METS, ISO 19115, and Dublin Core
ISO/IEC 11179 Metadata Registries
Model for managing semantics of a data dictionary and the lifecycle of concepts/terms
There is a separate ISO specification under development for providing bindings in XML, C, and other languages
In widespread use in many other standards, as well as for terminology management within large organizations
ISO 19115 Digital Geographic Data
Provides the standard metadata model for describing geographies
Implemented in several XML standards, including DDI (there is also a standard ISO XML)
Well-accepted within the technology community and among communities of use (geographers, etc.)
METS
A packaging standard for digital libraries/archives
Pulls together associated sets of files and establishes their relationship to one another
Can carry metadata payloads in their native XML namespaces as “metadata sections”
Cooperatively developed with DDI
METS left the description of data to DDI
DDI supports METS for archival packaging
XBRL
XML standard from the accounting world for describing business reports
Widely used by banking supervisory organizations
Major source of financial statistics
Well marketed and widely supported
Ongoing alignment project with SDMX
ODaF Vision - Standards Federated Registries (Based on SDMX, ebXML, web services) Aggregated Data/Metadata (SDMX) XBRL Business Reports DDI Microdata Sets ISO 19115 Geographies Dublin Core Citations Used in registered References to source data Standard classifications Organized using ISO 11179 Semantic definitions METS Packaging
ODaF Activities
We are early in our efforts to create such an infrastructure
To establish a sufficient set of well-aligned standards
To build open-source tools to support the use of these standards
To otherwise support the adoption and use of standard models, formats, and registries
ODaF Projects
Standards Alignment Project : on-going effort to establish an agreed mapping between the mentioned standards
SDMX Registry Hosting : Host SDMX registries on our own servers for those wishing to do prototype implementations
DDI Development Support : provide hosting and infrastructure to support the use and development of DDI 3.0
DDI Foundation Tools Program : providing technical coordination and infrastructure for a multi-institution effort to build an Eclipse-based open-source toolkit for working with DDI 3.0, including transforms to/from SAS, SPSS, and STATA
SDMX Browser : Developing an open-source tool (using Adobe Flex) for collecting, updating, and viewing statistical data in SDMX format – working in informal collaboration with ECB and OECD
ODaF Project (cont.)
DeXtris Browser : beta end-user tool for viewing and searching DDI 1/2.* and 3.0 metadata files – supports version transformations
UKDA QuDEX Draft Standard : Working as technical support for UKDA in their development of a standard for qualitative metadata (may become part of DDI)
Canadian RDC Network : Providing technical advice to the Canadian RDC network on metadata management and implementation in support of DDI 3.0.
NORC Virtual Data Enclave : Working to help develop and deploy the first “virtual” RDC in the US with data from NIST, others
Also involved in proposals to build a European “virtual” RDC
ODaF Projects (cont.)
Have contributed to the creation of training materials and online support for DDI 3.0, for general use
White papers: DDI & SDMX (a comparison), guidelines for open-source tools development, others
Member, DDI Alliance
Sponsored IASSIST 2007 in Montreal (planned also for IASSIST 2008 in Palo Alto, CA)
ODaF - Where We Are Today
New organization, lots of interest and support thus far
Interesting projects are emerging, some early deliverables have been finished
Looking for participation from interested, serious individuals
Still at the stage of supporting and promoting a coordinated set of standards
To Learn More…
ODaF: www.opendatafoundation.org
SDMX: www.sdmx.org
DDI: www.ddialliance.org
ISO/IEC 11179: http://metadata-stds.org/11179/
METS: http://www.loc.gov/standards/mets/
ISO 19115: http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=26020
XBRL: http:// www.xbrl.org /Home/
Tools and Training
For some free SDMX tools, implementation support site, and SDMX and DDI training courses: www.metadatatechnology.com
0 comments
Post a comment