2. īĄ âData about Dataâ
īĄ âDataâ broadly covers any information resource
ī§ digital or physical
ī§ narrative, multimedia, structured
ī§ raw data, processed data, aggregates of datasets, or
discrete elements within data sets
īĄ More formally, âMetadata is structured information
that describes, explains, locates, or otherwise makes
it easier to retrieve, use, or manage an information
resourceâ
METADATA
(NISO (2004) Understanding Metadata. Bethesda, NISO Press )
3. īĄ Descriptive metadata: supports discovery and identification
ī§ e.g. title, author, identifiers, subjects, keywords
īĄ Structural metadata: describes how the components of a
resource are organized
ī§ e.g. table of contents for a book, schema of database tables,
manifest of files in an aggregate âresearch objectâ
īĄ Administrative metadata: helps manage the resource
ī§ Technical - describes technical aspects of a resource
ī§ e.g. file type, version information, how/when created
ī§ Rights management - explains intellectual property rights
ī§ e.g. licensing, use restrictions, privacy concerns
ī§ Preservation - supports maintenance and archiving of a resource
ī§ e.g. provenance/ownership, history of use, authenticity
METADATA SERVES MANY PURPOSES . . .
http://www.niso.org/publications/press/UnderstandingMetadata.pdf
4. Metadata comes in many forms, serves many needs,
and operates in very diverse settings
īĄ I. Resource metadata (on the web)
ī§ Target: information resources as a whole
ī§ 1o
Goals: resource discovery and use
ī§ Form: structured, separate records
ī§ Users: everyone
ī§ Standards: many metadata frameworks/vocabularies
īĄ II. Metadata in database systems
ī§ Target: structured data and data elements
ī§ 1o Goals: data consistency, aggregation, analysis
ī§ Form: ER diagrams, summary tables, data dictionaries
ī§ Users: professional data administrators and scientists
ī§ Standards: metadata and CDE registries
. . . AND OPERATES IN MANY CONTEXTS
5. I. Resource Metadata (on the web)
A. Overview
B. Examples
C. Metadata Frameworks
i. Schema
ii. Vocabularies
iii. Conceptual Models
iv. Practical Specifications
v. Encoding Specifications
D. Metadata Storage and Retrieval
II. Metadata in Databases Systems
A. Overview
B. Data Elements
C. Data Dictionaries
D. Common Data Elements (CDEs)
E. CDE Registries
OUTLINE
6. īĄ Metadata in the world that all of us have used and
created in work and life
īĄ Attached to information resources we find on the web
ī§ books, videos, images, websites, datasets, . . .
īĄ Helps us to find a resource and understand what it is
and how to use it
I. RESOURCE METADATA (ON THE WEB)
10. īĄ Resource metadata is increasingly structured
according to established schemas and standards
īĄ Many standards exist that vary in their:
ī§ complexity (schemas, specifications, conceptual models)
ī§ targets (music, video, images, books, art, datasets)
ī§ goals (descriptive, administrative and preservation)
ī§ communities served (libraries, museums, research)
īĄ Benefits:
ī§ leverage existing resources
ī§ vetted by community
ī§ interoperability and integration
STANDARDS ARE KEY
11. īĄNormative standards for metadata are
captured in metadata frameworks.
īĄThere are five possible components of a
metadata framework:
A. Schema
B. Vocabularies
C. Conceptual Model
D. Practical Specifications
E. Encoding Specifications
METADATA FRAMEWORKS
12. īĄ Core of any framework â specifies the categories of
information recorded
īĄ Comprised of a set of data elements along with
descriptions of their attributes and rules for use
ī§ attributes described should minimally include an identifier
and/or name and a definition of each element
īĄ Can also specify data types and âvalue domainsâ that
describe allowable values for a given element
ī§ e.g. term lists, CVs, ontologies
īĄ Example schema: Dublin Core, LOM, HCLS Dataset Std.
A. METADATA SCHEMA
13. īĄ First effort at standardizing metadata to improve
resource discovery on the web
īĄ Very simple core schema consisting of 15 general
data elements representing properties of a
information resource, with no value restrictions.
īĄ Data Elements: title, identifier, type, description,
creator, contributor, date, subject, format, language,
source, publisher, relation, coverage, rights
īĄ Element Attributes: URI, label, definition, domain,
range, version, comment
EXAMPLE 1: DUBLIN CORE
METADATA INITIATIVE (DCMI)
http://dublincore.org/documents/dcmi-terms/
14. īĄ Extensive set of metadata elements describing
âlearning objectsâ
ī§ âAny digital or non-digital entity that may be used for learning,
education, or training"
īĄ Based loosely on DCMI schema, but:
ī§ >50 new elements to describe educational attributes of learning
objects
ī§ organizes elements into a hierarchical structure
ī§ provides detailed specifications for allowable values
ī§ supports âapplication profilesâ that extend model for
specific domains
EXAMPLE 2: LEARNING OBJECT
METADATA (IEEE-LOM)
http://www.imsglobal.org/metadata/mdv1p3/imsmd_bestv1p3.html
16. īĄ The LOM base schema defines 9 categories of metadata elements
īĄ Hierarchical structure supports user understanding, metadata
organization and aggregation for analysis
LOM ELEMENT HIERARCHY
17. īĄ A unified schema that provides all key metadata fields
needed to comprehensively describe research datasets
ī§ what they are, how they are produced, where they are found
ī§ meets pressing need in current research climate to support sharing,
discovery, and re-use of public datasets in a standardized way
īĄ Metadata elements describe general features, identifiers,
provenance and change, availability and distribution, and
dataset statistics
īĄ Comprised entirely of elements (properties) from existing
community vocabularies, e.g. DCMI, DCAT, PROV, VOID,
FOAF
ī§ attributes and rules for element use defined in source schema
EXAMPLE 3: W3C HCLS DATASET
DESCRIPTION STANDARD
http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/
18. B. VOCABULARIES
īĄ Set of terms (often structured) that is used to
constrain entry of metadata values
īĄ Vocabularies represent general concepts
ī§ Word or code lists
ī§ Hierarchical classifications
ī§ taxonomies, thesauri, ontologies
ī§ e.g. ICD9, SNOMed, MeSH, NCIthesaurus
īĄ Authority lists provide controlled names for proper nouns
ī§ FundRef (organizations)
ī§ Global Gazeteer (places)
ī§ ORCID (people)
19. īĄ Open Researcher and Contributor IDentifier (ORCID)
ī§ a nonproprietary alphanumeric code that uniquely identifies scientific
and other academic authors (a persistent âauthor DOIâ for researchers)
īĄ The ORCID identifier set is coming to serve as a de facto
authority list to record persons contributing to scholarly research
products
īĄ ORCIDs facilitates efforts to track productivity, impact, and
attribution based on all scholarly outputs (publications, grants,
datasets, protocols, presentations, abstracts, code, blogs, etc)
īĄ Services can aggregate scholarly outputs for a given researcher
ī§ resolves to a âCVâ listing all scholarly contributions linked across various
venues (e.g. Pubmed, Scopus, Slideshare, Figshare, Github, Dryad, . . .)
ORCID AS AN AUTHORITY LIST
20. īĄ An underlying model that describes how all the
information and concepts inherent in a resource are
related to one another
īĄ Metadata Models
ī§ conceptualize the metadata schema itself (hierarchical
relationships or other mappings between elements )
īĄ Domain Models
ī§ conceptualize domain in which the metadata schema
operates (classes of things that are annotated and the
relationships between them)
C. CONCEPTUAL MODELS
21. EXAMPLE METADATA MODEL:
LOM ELEMENT HIERARCHY
The structure of the LOM is an example of a simple conceptual metadata model,
which organizes elements into disjoint hierarchies
22. īĄ The summary level describes the dataset in general
īĄ The version level describes a specific version
īĄ The distribution level describes a representation of a version
EXAMPLE DOMAIN MODEL:
HCLS DATASET âLEVELSâ
Supports recommendations for how each should be described using the standard
23. īĄ D. Practical specifications for use
ī§ provide guidance for how to apply metadata under a given schema
ī§ e.g. HCLS model provides recommendations when and how to apply
certain elements to types of targets in the domain
īĄ E. Encoding specifications for presentation & exchange
ī§ rules for binding metadata to syntactic formats such as XML or RDF
ī§ e.g. LOM has precise specification for binding to XML or RDF
D/E. SPECIFICATIONS
24. STORING AND ACCESSING
RESOURCE METADATA
īĄ Typically lives separately from annotated resources,
in databases and/or XML files
īĄ Can also be stored within a resource (e.g. photo
metadata embedded in image file itself)
īĄ Increasing number of resource catalogs and
repositories on the web provide access to metadata
and often the resource itself
ī§ will have seen examples for books, images, and datasets
īĄ These repositories are indexed by search tools and
provide programmatic interfaces to allow for
resource discovery and re-use
25. īĄ Serves same basic needs, but different scale and target of
annotation, user base, and primary use cases
II. METADATA IN DATABASE SYSTEMS
īĄ Two main categories:
1. Structural metadata
ī§ describes the structure
of database objects
and the relationships
between them
ī§ commonly encoded
externally as ER-
diagrams, or internally
as summary tables
http://www.visn20.med.va.gov/VISN20/V20/DataWarehouse/Images/LabAutopsy.jpg
Example ER diagram
for VA autopsy data
26. īĄ Serves same basic needs, but different scale and target of
annotation, user base, and primary use cases
II. METADATA IN DATABASE SYSTEMS
īĄ Two main categories:
2. Content metadata
ī§ describes meaning of
data at a very fine
granularity
ī§ specifies attributes of
data elements , and
rules for recoding their
values
ī§ encoded internally or
externally as âdata
dictionariesâ
Example of a data set that needs a dictionary to interpret
27. īĄ The notion of a âdata elementâ obtains a more precise meaning
and specification in the context of a database.
ī§ elements can be specified at finer granularity in a databases holding
structured data in a controlled operational system
īĄ Conceptually, a data element is comprised of a concept and a
value domain
ī§ concept = the subject of the data recorded for a given element
ī§ value domain = the defined value set for how that data is recorded
īĄ Example: PT_ETHNIC
ī§ concept = patient ethnicity
ī§ value domain = [ E1 (caucasian), E2 (hispanic/latino), E3 (african),
E4 (asian), E5 (mixed) ]
DATA ELEMENTS
28. īĄ Provide detailed metadata about data elements
ī§ element identifiers and name(s)
ī§ definitions and descriptions
ī§ value constraints
ī§ data type
ī§ default value
ī§ length
ī§ allowable values
ī§ value frequency (mandatory or not)
ī§ provenance and tracking
ī§ version number, entry and termination dates
ī§ indicate source table(s)
ī§ mappings to elements in other schema dictionaries
DATA DICTIONARIES
30. īĄ Key Functions
ī§ unambiguous and shared understanding of the data by all users
(administrators, analysts, and clients)
ī§ consistent data representation and manipulation
(addition, extraction, aggregation, and transformation)
ī§ maintenance of the data model
ī§ data integration, exchange, and re-use
īĄ Encoding
ī§ as an external document and/or represented as a table in the
database itself
DATA DICTIONARIES
31. 1. Clear and thorough element definitions and value
set explanations are key
2. Give persistent identifiers to data elements
3. Map data elements to community standards where
possible
ī§ common data elements (CDEs)
4. Specify value sets in terms of open controlled
vocabularies CVs where possible
5. Provide notes and guidelines for context of use
6. Make dictionary easily accessible to all users
DATA DICTIONARY
BEST PRACTICES
32. īĄ As research moves toward 'big dataâ, information from
diverse sources is being shared and aggregated for
analysis.
īĄ A major challenge for managing this data is the diversity of
ways that a given idea can be described in data elements
ī§ Sex/gender definitions can be based on genetics, phenotype, or self-
identification. Values can be recorded as local codes, abbreviations,
full labels, or community vocabularies.
DATABASE METADATA
INTEROPERABILITY
33. īĄ The Common Data Element (CDE) movement aims to
address this problem by providing standardized data
elements that can be re-used across medical datasets
īĄ CDEs are
ī§ owned, managed, & curated by single authority (NINDS, NCI)
ī§ stored and managed in large repositories called CDE registries
ī§ available for diverse areas of clinical practice and research,
and at very fine granularity
ī§ larger repositories hold up to 50,000 elements available
īĄ CDEs serving as a foundation for interoperability across
data systems
COMMON DATA ELEMENTS (CDE )s
34. īĄ Metadata registries that collect common data elements for a
defined domain
īĄ Resemble large scale data dictionaries, but with key
differences:
ī§ Exposed in searchable public repositories with additional
services to promote extraction and re-use
ī§ Coverage is wider as they are used across different domains
and systems
ī§ Metadata element descriptions are far richer to support
discovery, provenance, versioning, mappings, meta-modeling
īĄ The NIH maintains a portal to information about existing CDE
initiatives, registries, and tools (http://www.nlm.nih.gov/cde/)
CDE REGISTRIES
35. īĄ Houses >20,000 CDEs
ī§ âCoreâ element set covers general concepts in medical domain
ī§ patient demographics, medical history, assessment & examinations,
treatments & interventions, outcomes, and study protocol
ī§ âSupplementaryâ sets covering specific diseases/research areas
ī§ spinal injury, brain injury, epilepsy and stroke, Parkinsonâs disease, ALS
īĄ Metadata schema captures 30 element attributes
ī§ this expanded set of attributes supports use cases of enabling
discovery and community re-use across different implementations
īĄ Portal has search functionality and support for
generating clinical forms (CRFs) with CDE mappings
embedded in collected data
NINDS CDE REGISTRY
http://www.commondataelements.ninds.nih.gov/
36. īĄ The National Cancer Institute cancer Data Standards Registry
(caDSR) is the largest and most widely used CDE registry
ī§ >50,000 total elements
īĄ Integrates CDEs from several initiatives under a unified model
and technical infrastructure
īĄ Broad and deep coverage to fine granularity (as with NINDS)
īĄ Metadata model is VERY complex
ī§ captures >100 distinct attributes describing each data element in the
registry (vs 30 for NINDS)
ī§ implements a complex conceptual model based on the
ISO/IEC 11179 metadata registry standard
ī§ decomposes data elements into component parts that are
mapped to NCI thesaurus terms (formal encoding of semantics)
NCI DSR CDE REGISTRYca
https://cdebrowser.nci.nih.gov/CDEBrowser/
37. DSR CONCEPTUAL MODEL
1. To understand the data element table and explain
why it is so expansive
2. Follows a standard for database metadata
registries called ISO11179
ī§ commonly implemented in other efforts you may
encounter
ī§ e.g. the Clinical Data Interchange Standards Consortium
(CDISC), which has similar goals as the caDSR but across a
broader domain
3. Is the basis for semantic mappings to ontologies
such as the NCI thesaurus which are an important
feature of the model
ca
38. Data Element
Concept Value Domain
Value
Represen-
tation
Valid
Values
Class Property
DSR CONCEPTUAL MODELca
41. Data Element
PT_GENDER_CODE
Value Domain
VALUE DOMAIN MAPPING
Value domain = a set of attributes
describing representational
characteristics of instance data
Value Representation = type of
value the data represents
(along different dimensions)
Valid Values = the actual allowed
values for a given value domain
Value Rep.
âpersonâ,
âgenderâ, âcodeâ
Valid Values
â0â,â1â,â2â,â9â
42. Data Element
PT_GENDER_CODE
Value Domain
Value Rep.
âpersonâ,
âgenderâ, âcodeâ
Value Representation Mappings
âpersonâ = C25190
âgenderâ = C17357
âcodeâ = C25162
Valid Value Mappings
0 = unknown C17998
1 = female gender C46110
2 = male gender C46109
9 = unspecified n/a
VALUE DOMAIN MAPPING
Valid Values
â0â,â1â,â2â,â9â
43. Concept Value Domain
Value
Represen-
tation
Valid
Values
Class Property
"SEMANTICALLY UNAMBIGUOUS
INTEROPERABILITY"
īĄ Semantic mappings of these four elements can
support more sophisticated search and analysis
īĄ Computational tools can leverage logic in the NCI
hierarchy for query expansion and data aggregation
44. īĄ The structure of the NCI taxonomy supports synonym
and hierarchical query expansion
LEVERAGING SEMANTICS
īĄ User searches âcancer
biologyâ to view all CDEs
related to this concept.
īĄ The query is expanded
(1) to include any
children of this term in
the taxonomy, and
(2) to include elements
with text matching any
synonym of cancer in
the taxonomy
NCI Thesaurus
âCancer Biologyâ
branch
45. īĄStrategies
ī§Map elements in local data dictionaries to CDEs
ī§ Parkinsonâs Disease Biomarkers Program (PDBP) data dictionary
ī§ NINDS registry form builder
ī§Build libraries of re-usable, pre-fabricated forms with
embedded CDE metadata
ī§ NINDS Case Report Form (CRF) library
ī§ medical-data-models.org forms
ī§Initialize software with CDEs so that electronic forms
automatically carry mappings when they are
generated
ī§ caDSR registry and CDISC tools
CDE IN PRACTICEs
46. īĄ CDEs standardize data elements for use across multiple systems
īĄ Available in registries that vary in size and complexity
ī§ some resemble simple data dictionaries with expanded attributes to
support discovery and provenance (NINDS)
ī§ some are implemented with complex conceptual models and semantic
mappings (caDSR)
īĄ Tools and standards supporting practical application exist but are
not yet state of the art
īĄ Worlds collide: the intersection of metadata for web resources and
database systems
ī§ CDEs represent discoverable web resources, that are used in the context of
data collection and description in database systems
ī§ Each registry defines a metadata framework/schema for a given domain
CDE SUMMARY
47. īĄ Promote standardized and systematic data collection
īĄ Improve data quality and consistency
īĄ Facilitate data sharing and integration
īĄ Reduce the cost and time needed to develop data
collection tools
īĄ Improve opportunities for meta-analysis comparing
results across studies
īĄ Increase the availability of data for the planning and
design of new trials
BENEFITS OF CDEs
48. īĄ Data elements across efforts are not well aligned
īĄ Tooling support for discovery & application immature
īĄ Limited use of community taxonomy and ontology
mappings
īĄ Navigating complexity and redundancy
ī§. . . of medical data itself
ī§ many ways to calculate and represent simple and complex
measures such as tumor burden or medical prognosis
ī§. . . of metadata elements/schemas
ī§ thousands of elements with very nuanced meaning and use
ī§ redundant representation poses challenges for data collection,
aggregation, and integrated analysis (even for simple measures)
CHALLENGES FOR DATA
INTEGRATION AND ANALYSIS
49. LINKS
Schema Examples:
DCMI: http://dublincore.org/documents/dcmi-terms/
IEEE-LOM: http://www.imsglobal.org/metadata/mdv1p3/imsmd_bestv1p3.html
HCLS Dataset description standard: http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/
Data Dictionary Example:
http://library.ahima.org/xpedio/groups/public/documents/ahima/bok1_048618.pdf
CDE Sites:
NIH CDE Portal: http://www.nlm.nih.gov/cde/
NINDS CE Registry: http://www.commondataelements.ninds.nih.gov/
caDSR browser: https://cdebrowser.nci.nih.gov/CDEBrowser/
caDSR tools: http://cbiit.nci.nih.gov/ncip/biomedical-informatics-resources/interoperability-and-semantics/metadata-and-
models
CDEs in Practice:
PDBP gender data dictionary entry:
https://dictionary.pdbp.ninds.nih.gov/portal/publicData/dataElementAction!view.action?dataElementId=5585
NINDS form builder http://www.commondataelements.ninds.nih.gov/CRF.aspx?source=formBuilder
Downloadable forms (CRFs) from NINDs with embedded CDE links: http://www.commondataelements.ninds.nih.gov/CRF.aspx
medical-data-models.org forms https://medical-data-models.org/forms/1049
Suite of tools on caDSR site http://cbiit.nci.nih.gov/ncip/biomedical-informatics-resources/interoperability-and-
semantics/metadata-and-models