NISO/DCMI Webinar: Metadata for Managing Scientific Research Data

Metadata for Managing
Scientific Research Data
NISO/DCMI Webinar:
August 22, 2012

Jane Greenberg, Professor and Director of
the SILS Metadata Research Center
janeg@email.unc.edu

Overview
▪ Why should we care?
▪ What is data?
▪ What is metadata‘s role w.r.t data?
▪ Selected metadata standards
▪ Challenges, opportunities, and jumping in
▪ Concluding comments
▪ Q&A

Why should we care?
BIG stuff
▪ Digital data deluge (Hey & Trefethen, 2003)
▪ Big data (New York Times)
2008
▪ The fourth paradigm (Jim Gray, 2007)

Just as important
▪ The long tail (Heidorn, 2008)
▪ CODATA/Data-at-Risk Task Group
▪ Scholarly communications, data citation

Technological affordances for improving and
advancing science

Cultural shift toward data sharing
▪ National and international policies
– US NSF and NIH [1, 2]
– OECD (Organisation for Economic Co-operation and
Development) [3]
– INSPIRE Infrastructure for Spatial Information in the European
Community EU Commission [4]
– UK Medical Research Council [5]

Dryad ―enables scientists to validate
published findings, explore new analysis
methodologies, repurpose data for research
questions unanticipated by the original
authors, and perform synthetic studies.‖
(http://datadryad.org/)

Overview

▪ What is data?
▪ Q&A

Data
▪ No single agreed upon definition
▪ One person‘s data is another person‘s
information
▪ Data often implies the ―raw‖ stuff lacking
context
– Scholarly context, written assessment
▪ ―Essence of science‖ (Greenberg, et al, 2009)
▪ What is science?
– The Archaeology Data Service (ADS)
archaeologydataservice.ac.uk

Data quantity type The Dryad
Repository
3162 Plain Text
I know it when I see it 476 Microsoft Excel
308 Adobe Portable Document
Format
By example: Traditional 302 Comma-separated values
observations, numbers, and 252 Nexus
measures stored in spreadsheets 153 Microsoft Excel OpenXML
and databases, fossils, 108 Microsoft Word
phylogenetic trees, and herbarium 80 Zip file
samples (White, 2008) 62 JPEG image
45 Microsoft Word OpenXML
Other disciplines 40 Extensible Markup Language
▪ Bioinformatics: Gene 35 Hypertext Markup Language
expressions, DNA transcription 21 Rich Text Format
to RNA translation 16 FASTA sequence file
15 Tag Image File Format
▪ Geology, agriculture,
14 Postscript Files
surveillance, and historical
2 Video Quicktime
manuscript research:
2 Mathematica Notebook
Hyperspectral remote sensing
1 Microsoft Powerpoint
(email w/R. Scherle, July 2012)

Overview
▪ What is data?

▪ Q&A

Metadata defined
……data about data
…….information about data

▪―Metadata or ‗data about data‘ describes the
content, quality, condition, and other
characteristics of data.‖ (FGDC Metadata WG,
1998)

▪Structured information about an object (data)
that facilitates functions associated with the
object. (Greenberg, 2002, 2003, 2009)

Typical functions

Control
Discover Manage
rights

Identify Certify Indicate
versions authenticity status

Mark conent Situate Describe
strucure geospatially processes

Overview
▪ What is data?

▪ Q&A

Metadata for Scientific Research Data

Descriptive
– General to granular
▪Value (addressing a topic, ―aboutness‖)
– Topical (ontologies, subject heading lists/thesauri,
taxonomies)
▪Named entities
– Name authority files (people, organizations,
geographical jurisdictions, structures, and events)
▪Geo-spatial (coordinates)
▪Temporal data (ISO 8601/ W3CDTF, or …)

Given the messiness…

―I cannot tell you exactly what metadata
standards, vocabularies, etc. to use…‖

Examining metadata schemes
Objectives and Domains Architectural layout
principles

• Objectives • Discipline • Structural design
• Genre • Extent
• Principles
• Format • Granularity

Metadata Objectives and principles, Domain, and
Architectural Layout (MODAL) framework

(Greenberg, 2005; Willis, et al, JASIST 2012)

Objectives and Domains Architectural
Simple principles layout
schemes
[6] • Interoperability • Multi- • Primarily flat
• Easy to disciplinary • Minimal with
generate, • Any genre or means to
lower barrier format extend
to produce • General (not
granular)
Dublin Core
Metadata
Element Set
(DCMES)
ver.1.1
US MARC • Need training • Primarily flat
bibliographic • Extensible
format
DataCite • Primarily flat

Dublin Core
Application
Profile-
Dryad [7]



DataCite example, ver.2.2 [8]
National Institute for
Environmental Studies and
Center for Climate System
Research Japan

US MARC bibliographic
format: World Ocean
Circulation Experiment global
data (Moss Landing Marine
Labs and the Monterey Bay
Aquarium Research Institute
Library) [9]

Simple/ principles layout
moderate  Interoperability  Greater domain  Primarily flat
balanced focus  Extensibility—
schemes w/specific  Genera via connecting
needs diversity within  Slightly more
 Generation a domain granular
requires more
expertise
Darwin Core

Access to • Not as flat
Biological
Collections Data
(ABCD)
Ecological
Metadata
Language
DCMI Terms • Graph approach

Wieczorek, et al. (2012). Darwin Core: An Evolving Community-
Developed Biodiversity Data Standard.
PLoS One. 2012; 7(1): e29715: doi: 10.1371/journal.pone.0029715.

Access to Biological Collections Data (ABCD) (A minimum record)

<?xml version='1.0' encoding='UTF-8'?> <DataSets
xmlns='http://www.tdwg.org/schemas/abcd/2.06'>
<DataSet>
<TechnicalContacts> <TechnicalContact> <Name>Gerd
MÃŒller</Name> <Email>gerd@dfb.de</Email>
</TechnicalContact> </TechnicalContacts>
<ContentContacts> <ContentContact> <Name>A
Another</Name> <Email>a.another@fake.org</Email>
</ContentContact> </ContentContacts> <Metadata>
<Description> <Representation language='en'>
<Title>PonTaurus collection</Title> </Representation>
</Description> <RevisionData> <DateModified>2001-03-
01T00:00:00</DateModified> </RevisionData> </Metadata>
<Units> <Unit>
<SourceInstitutionID>BGBM</SourceInstitutionID>
<SourceID>PonTaurus</SourceID> <UnitID>1136</UnitID>
</Unit> </Units> </DataSet> </DataSets>

abstract educationLevel modified
accessRights extent provenance
accrualMethod format publisher
accrualPeriodicity hasFormat references
accrualPolicy hasPart relation
alternative hasVersion replaces
audience identifier requires
available instructionalMethod rights
bibliographicCitation isFormatOf rightsHolder
conformsTo isPartOf source
contributor isReferencedBy spatial
coverage isReplacedBy subject
created isRequiredBy tableOfContents
creator issued temporal
date isVersionOf title
dateAccepted language type
dateCopyrighted license valid
dateSubmitted mediator Properties in the /terms/
description medium namespace

Complex principles layout
schemes
 Interoperability • Genre focus  Hierarchical
level • Format  Extensive
 Generation variation  Granular
requires greater
expertise
FGDC
DDI

Content Standard for Digital Data Document Initiative (DDI)
Geospatial Metadata
(CSDGM)/FGDC
1. Identification Information (M) 1. Concept
2. Data Quality Information 2. Collecting
3. Spatial Data Organization Information 3. Processing  Archiving
4. Spatial Reference Information 4. Distribution  Archiving
5. Entity and Attribute Information 5. Discovery
6. Distribution Information 6. Analysis
7. Metadata Reference Information (M) 7. Repurposing

Summary for descriptive schemes
▪ Simple: Interoperable, Easy to generate/low barrier,
generally multidisciplinary, genera/format agnostics,
primarily flat, general (not granular), 15-25 properties

▪ Simple/moderate: Interoperability balanced
w/specific needs, generation requires more expertise,
greater domain focus, extensible--via connecting to
other schemes, more granular, more properties

▪ Complex: Interoperable level, generation requires
expertise, genera focus/format variation, hierarchical,
granular, and extensive (100+ properties)

Challenges and opportunities
Challenges Opportunities

Workflow/When to Educate scientists early (Qin, 2009)
▪ Stop
generate the here Integrate into social setting w/Center for
metadata? Embedded Networked Sensing
(CENS) (Borgman, Mayernik, etc., 2009-current;
Mayernik‘s dissertation, 2011)
Methods for generating Use automatic techniques as much as possible,
metadata (labor leverage human expertise (Dryad, DataOne Excel
intensive) project)

Too many standards Don‘t panic, join communities, look for
Which one do I use? examples. (If you can‘t find them?)
Do I need to No. Explore and develop a best practice.
implement my Pursue a 2 pronged approach (Greenberg, et al,
metadata as linked 2009)
data.

Jumping in…
1. DCMI/NISO Seminars !!
2. DCMI Science and Metadata Community
(http://wiki.dublincore.org/index.php/DCMI_Science_And_Metadata)

3. Digital Curation Center (DCC)
(http://www.dcc.ac.uk/)

4. The Research Data Management
Training, or MANTRA project
(http://datalib.edina.ac.uk/mantra/)

5. DataONE workshops and tutorials
(www.dataone.org/)

Concluding comments
▪ Standards are guidelines; no police
– Aim for reasonable quality

▪ KISS: Keep it simple stupid
– What’s vital; what will aid reuse?
▪ Help to move the practice forward
– Share what you learn

▪ Nothing new/it‘s all new
– Data documentation since ancient times
– SILOS; let‘s break them down (Willis, et al, 2012)
– Greater connectivity than ever
– Cross-disciplinary approaches for problem solving

Overview
▪ What is data?

▪ Q&A

Footnotes
[1] NSF Data Sharing Policy: http://www.nsf.gov/bfa/dias/policy/dmp.jsp.
[2] NIH Data Sharing Policy: http://grants.nih.gov/grants/policy/data_sharing/.
[3] ORGANISATION FOR ECONOMIC CO-OPERATION AND DEVELOPMENT/Data and
Metadata Reporting and Presentation Handbook: http://www.oecd.org/std/37671574.pdf.
[4] The INSPIRE Infrastructure for Spatial Information in the European Community):
http://inspire.ec.europa.eu/index.cfm/pageid/48. directive released 15 May 2007 and will be
implemented in various stages, with full implementation required by 2019, and aims to create a
European Union (EU) spatial data infrastructure.
[5] UK medical research council:
http://www.mrc.ac.uk/Ourresearch/Ethicsresearchguidance/datasharing/index.html.
[6] The DCMI Glossary (scroll down for ―schema‖ entry):
http://dublincore.org/documents/usageguide/glossary.shtml#schema.
[7] Dublin Core Example: Data from: Divergence time estimation using fossils as terminal taxa
and the origins of Lissamphibia (Dryad repository):
http://datadryad.org/resource/doi:10.5061/dryad.8120?show=full.
[8] National Institute for Environmental Studies and Center for Climate System Research
Japan—animation data (DataCite): http://schema.datacite.org/meta/kernel-
2.2/example/datacite-metadata-sample-v2.2.xml.
[9] US MARC bibliographic format: World Ocean Circulation Experiment global data (Moss
Landing Marine Labs and the Monterey Bay Aquarium Research Institute Library):
http://mlml.kohalibrary.com/cgi-bin/koha/opac-detail.pl?biblionumber=9282.

NISO/DCMI Webinar: Metadata for Managing Scientific Research Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to NISO/DCMI Webinar: Metadata for Managing Scientific Research Data

Similar to NISO/DCMI Webinar: Metadata for Managing Scientific Research Data (20)

More from National Information Standards Organization (NISO)

More from National Information Standards Organization (NISO) (20)

Recently uploaded

Recently uploaded (20)

NISO/DCMI Webinar: Metadata for Managing Scientific Research Data