NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BCE themes in 21st century data curation
 

NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BCE themes in 21st century data curation

on

  • 491 views

Opening Keynote: The Many and the One: BCE themes in 21st century data curation ...

Opening Keynote: The Many and the One: BCE themes in 21st century data curation
Allen Renear, Professor and Interim Dean, Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign

Two scientists can be using "the same data" even though the computer files involved appear to be quite different. This is familiar enough, and for the most part, in small communities with shared practices and familiar datasets, raises few problems. But these informal understandings do not scale to 21st century data curation. To get full value from cyberinfrastructure we must support huge quantities of heterogeneous data developed by diverse communities and used by diverse communities -- often with widely varying methods, tools, and purposes. To accomplish this our informal practices and understandings much be replaced, or at least supplemented, by a shared framework of standard terminology for describing complex cascades of representational levels and relationships. Fundamental problems in data curation -- and in particular problems involving provenance, identifiers, and data citation — cannot be fully resolved without such a framework. Although the deepest problems here have ancient origins, useful practical measures are now within reach. Some recent work toward this end that is being carried out at the Center for Informatics Research in Science and Scholarship (CIRSS) at the Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign will be described.

Statistics

Views

Total Views
491
Views on SlideShare
304
Embed Views
187

Actions

Likes
1
Downloads
9
Comments
0

1 Embed 187

http://www.niso.org 187

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • I’ll open with some cries from the heart bear with me while thisYou can find othersAnd more succinctly

NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BCE themes in 21st century data curation NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BCE themes in 21st century data curation Presentation Transcript

  • The Many and the One BCE problems in 21st c. data curationTracking it Back to the Source: Managing and Citing Research Data NISO Forum, Denver, Sept 24, 2012 Allen H. Renear Graduate School of Library and Information Science University of Illinois at Urbana-Champaign Principal researchers of material presented: David Dubin, Karen M. Wickett, Simone Sacchi, Richard Urban, Allen H. Renear Center for Informatics Research in Science and Scholarship Graduate School of Library and Information Science University of Illinois at Urbana-Champaign NSF/OCI-ITR DataNet Award #0830976 IMLS/LB Award #RE-05-08-0062-08
  • Problems, Problems, ProblemsIdentity problems: – Is this the data we think it is? Is it the same data as that data? (involves issues of authenticity, integrity, encoding)Meaning problems: – What is this data supposed to be telling us? (involves interpreting the semantics of the data)Relationship problems: – How is this data related to that data? (involves issues of data provenance)Integration problems: – How can I combine this data with other data? (involves harmonizing conflicts at multiple levels)Interoperation problems: – how can I get this data to work with my software? (involves conversion to equivalent formats)An issue underlying all these is representation… how do files of digital files represent facts about the world?
  • Identity ProblemsTwo scientists, Jill and John, used the same data. What does that mean? And how can well tell?
  • Identity Problems Compare:Two scientists, Jill and John, used the same statistician.
  • Identity Problems Compare:Two scientists, Jill and John, used the same centrifuge.
  • Identity and Representation LevelsConsider two files with the … same data, but relational tables in one case and RDF triples in another … with the same data and the same RDF triples, but an XML serialization in one case, an N3 serialization in another … with the same data, the same RDF triples, the same N3 serialization, but UTF-8 character encoding in one case and UTF-16 encoding in another How many of levels do we need? How do we define and manage them? How can they be identified and re-identified? Which identifier schemes for which level?
  • What is a dataset anyway?!Maybe we should ask a scientist They’ll have an answer, right? 6
  • There are almost as many answers as scientists 7
  • Cries from the heart “ the terms ‘Data Product’, ‘Data Set,’ and ‘Version’ are overlaid with multiple meanings between communities.” (Barkstrom, 2009)“There is ambiguity in what type of object a dataset is; with different groups of users applying different connotations There needs to be an explicit statement of what the intended preservation of a dataset will imply.” (Pepler, 2008) 8
  • Forcing us to conclude… No single object can possibly have all those attributesTherefore it is impossible to give the common colloquial notion of dataset a precise definition It must instead be replaced by a family of new more specific concepts Sound familiar? 9
  • FRBR 10
  • A FRBR inspired solutionFRBR eliminates the ordinary “book” from our world The ordinary “book” can be simultaneously about chordata, in French, typeset in neo-Bauhaus, mustard-stainedbut FRBR replaces the book with four objects the work is about chordata, the expression is in French, the manifestation is typeset in neo-Bauhaus, the item is mustard-stained
  • FRBR entities and attributes Work: “an … intellectual or artistic creation” Expression: “the … realization of a work … notation … etc.” Manifestation: “the physical embodiment of an expression of a work”. Item: “a single exemplar of a manifestation”Attribute assignments characteristically disjoint A work may have a subject. It does not have a language, typeface, or condition. An expression may have a language; It does not have a subject. (or a typeface or a condition). A manifestation may have a typeface. It does not have a subject or a language (or a condition) An item may have a condition. It does not have a subject, language, or typeface. 12
  • Entities? Really?Aren’t some of those rectangles just nominalized relationships? 13
  • AmbiguitiesIs<object name="sample_31"> <feature name="U22376" value="408" /> <feature name="X59417" value="1784" />An expression?Is “00001011” an expression? 14
  • FRBR Refactored StoryM:M Symbol Structure M:M Symbol Structure M:M Matter & Energy 15
  • FRBR refactored and applied to datasets All M:M C1: observations [Semantic Level] expressed by… S1: RDF triples encoded by… S2: N3 statements [Syntax Level] [Encoding levels] encoded by … S3: Unicode characters encoded by… S4: UTF-8 bit streamsBased on the SystematicAssertion Model (SAM) for inscribed in…modeling datasets, developed Instantiation levelby David Dubin et al. M1: RAID array state
  • IdentifiersWhat do we identify with identifiers? An entity? Content Symbol structures Patterned matter and energy A nominalized relationship?How do we confirm identification? 17
  • IdentificationHow do we identify an expression?How do we identify an encoding?How do we identify the data?On the practical side we do this every dayOn the theoretical side it is very difficult to usefully formalize. 18
  • Identity and change problems in PlanetsFrom the Planets Conceptual Data Model, Sharpe et al. (2006) 19
  • Identity and change problems in Planets• A file is a bitstream• A file can be modified• But a bitstream cannot be modified.Credits to Dave Dubin, Simone Sacchi, Karen Wickett. Data Concepts Group, DataConservancy (NSF/OCI-ITR DataNet Award #0830976) 20
  • Center for Informatics Research in Science and Scholarship (CIRSS) Graduate School of Library and Information Science University of Illinois at Urbana-ChampaignDirector: Carole L PalmerAssociate Director: Cathy Blake c. 12 affiliated GSLIS faculty; 8 Phd students.CIRSS research groups: Data Practices: social science of information work Socio-Technical Data Analytics: algorithms + people *Data Concepts: modeling for integration/computationProfessional Education:Data curation specialization within an ALA-accredited LIS programOther options are being planned 21
  • CIRSS Data Concepts GroupRationale Integration and interoperability requires robust formal conceptual models for scientific data Especially if semantic technologies are going to be exploited. Our current models aren’t good enoughMission The data concepts group takes a logic-based approach to to solving conceptual modeling problems in scientific data curation
  • Questions?This research is being carried out by the Data Concepts Group at the Center forResearch in Informatics and Scholarship (CIRSS) at the University of Illinois at Urbana-Champaign,Carole L. Palmer, Director. Principal contributors include David Dubin, Karen M. Wickett, Simone Sacchi, Richard Urban, Allen H Renear NSF/OCI-ITR DataNet Award #0830976 IMLS/LB Award #RE-05-08-0062-08