Matthew
Brush
Ontology
Development
Group
OHSU Library,
DMICE
METADATA
PERSPECTIVES
FROM THE WEB
AND DATABASES
SYSTEMS
Sept 17, 2014
brushm@ohsu.edu
 “Data about Data”
 “Data” broadly covers any information resource
 digital or physical
 narrative, multimedia, structured
 raw data, processed data, aggregates of datasets, or
discrete elements within data sets
 More formally, “Metadata is structured information
that describes, explains, locates, or otherwise makes
it easier to retrieve, use, or manage an information
resource”
METADATA
(NISO (2004) Understanding Metadata. Bethesda, NISO Press )
 Descriptive metadata: supports discovery and identification
 e.g. title, author, identifiers, subjects, keywords
 Structural metadata: describes how the components of a
resource are organized
 e.g. table of contents for a book, schema of database tables,
manifest of files in an aggregate ‘research object’
 Administrative metadata: helps manage the resource
 Technical - describes technical aspects of a resource
 e.g. file type, version information, how/when created
 Rights management - explains intellectual property rights
 e.g. licensing, use restrictions, privacy concerns
 Preservation - supports maintenance and archiving of a resource
 e.g. provenance/ownership, history of use, authenticity
METADATA SERVES MANY PURPOSES . . .
http://www.niso.org/publications/press/UnderstandingMetadata.pdf
Metadata comes in many forms, serves many needs,
and operates in very diverse settings
 I. Resource metadata (on the web)
 Target: information resources as a whole
 1o
Goals: resource discovery and use
 Form: structured, separate records
 Users: everyone
 Standards: many metadata frameworks/vocabularies
 II. Metadata in database systems
 Target: structured data and data elements
 1o Goals: data consistency, aggregation, analysis
 Form: ER diagrams, summary tables, data dictionaries
 Users: professional data administrators and scientists
 Standards: metadata and CDE registries
. . . AND OPERATES IN MANY CONTEXTS
I. Resource Metadata (on the web)
A. Overview
B. Examples
C. Metadata Frameworks
i. Schema
ii. Vocabularies
iii. Conceptual Models
iv. Practical Specifications
v. Encoding Specifications
D. Metadata Storage and Retrieval
II. Metadata in Databases Systems
A. Overview
B. Data Elements
C. Data Dictionaries
D. Common Data Elements (CDEs)
E. CDE Registries
OUTLINE
 Metadata in the world that all of us have used and
created in work and life
 Attached to information resources we find on the web
 books, videos, images, websites, datasets, . . .
 Helps us to find a resource and understand what it is
and how to use it
I. RESOURCE METADATA (ON THE WEB)
Descriptive
Structural
Administrative
Book Catalog Record
http://ohsulibrary.worldcat.org/title/metadata/oclc/225088362
Descriptive
Structural
Administrative
Digital Photograph Library
http://crdl.usg.edu/cgi/crdl?query=id:highlander_highlanderphotos_p2-wi3-3
Data Set Description
http://datadryad.org/resource/doi:10.5061/dryad.4ms68
Research Data Sets
and Files (datadryad.org)
Data File Description
 Resource metadata is increasingly structured
according to established schemas and standards
 Many standards exist that vary in their:
 complexity (schemas, specifications, conceptual models)
 targets (music, video, images, books, art, datasets)
 goals (descriptive, administrative and preservation)
 communities served (libraries, museums, research)
 Benefits:
 leverage existing resources
 vetted by community
 interoperability and integration
STANDARDS ARE KEY
Normative standards for metadata are
captured in metadata frameworks.
There are five possible components of a
metadata framework:
A. Schema
B. Vocabularies
C. Conceptual Model
D. Practical Specifications
E. Encoding Specifications
METADATA FRAMEWORKS
 Core of any framework – specifies the categories of
information recorded
 Comprised of a set of data elements along with
descriptions of their attributes and rules for use
 attributes described should minimally include an identifier
and/or name and a definition of each element
 Can also specify data types and ‘value domains’ that
describe allowable values for a given element
 e.g. term lists, CVs, ontologies
 Example schema: Dublin Core, LOM, HCLS Dataset Std.
A. METADATA SCHEMA
 First effort at standardizing metadata to improve
resource discovery on the web
 Very simple core schema consisting of 15 general
data elements representing properties of a
information resource, with no value restrictions.
 Data Elements: title, identifier, type, description,
creator, contributor, date, subject, format, language,
source, publisher, relation, coverage, rights
 Element Attributes: URI, label, definition, domain,
range, version, comment
EXAMPLE 1: DUBLIN CORE
METADATA INITIATIVE (DCMI)
http://dublincore.org/documents/dcmi-terms/
 Extensive set of metadata elements describing
‘learning objects’
 “Any digital or non-digital entity that may be used for learning,
education, or training"
 Based loosely on DCMI schema, but:
 >50 new elements to describe educational attributes of learning
objects
 organizes elements into a hierarchical structure
 provides detailed specifications for allowable values
 supports ‘application profiles’ that extend model for
specific domains
EXAMPLE 2: LEARNING OBJECT
METADATA (IEEE-LOM)
http://www.imsglobal.org/metadata/mdv1p3/imsmd_bestv1p3.html
LOM SCHEMA ELEMENTS AND
ATTRIBUTES
 The LOM base schema defines 9 categories of metadata elements
 Hierarchical structure supports user understanding, metadata
organization and aggregation for analysis
LOM ELEMENT HIERARCHY
 A unified schema that provides all key metadata fields
needed to comprehensively describe research datasets
 what they are, how they are produced, where they are found
 meets pressing need in current research climate to support sharing,
discovery, and re-use of public datasets in a standardized way
 Metadata elements describe general features, identifiers,
provenance and change, availability and distribution, and
dataset statistics
 Comprised entirely of elements (properties) from existing
community vocabularies, e.g. DCMI, DCAT, PROV, VOID,
FOAF
 attributes and rules for element use defined in source schema
EXAMPLE 3: W3C HCLS DATASET
DESCRIPTION STANDARD
http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/
B. VOCABULARIES
 Set of terms (often structured) that is used to
constrain entry of metadata values
 Vocabularies represent general concepts
 Word or code lists
 Hierarchical classifications
 taxonomies, thesauri, ontologies
 e.g. ICD9, SNOMed, MeSH, NCIthesaurus
 Authority lists provide controlled names for proper nouns
 FundRef (organizations)
 Global Gazeteer (places)
 ORCID (people)
 Open Researcher and Contributor IDentifier (ORCID)
 a nonproprietary alphanumeric code that uniquely identifies scientific
and other academic authors (a persistent “author DOI” for researchers)
 The ORCID identifier set is coming to serve as a de facto
authority list to record persons contributing to scholarly research
products
 ORCIDs facilitates efforts to track productivity, impact, and
attribution based on all scholarly outputs (publications, grants,
datasets, protocols, presentations, abstracts, code, blogs, etc)
 Services can aggregate scholarly outputs for a given researcher
 resolves to a “CV” listing all scholarly contributions linked across various
venues (e.g. Pubmed, Scopus, Slideshare, Figshare, Github, Dryad, . . .)
ORCID AS AN AUTHORITY LIST
 An underlying model that describes how all the
information and concepts inherent in a resource are
related to one another
 Metadata Models
 conceptualize the metadata schema itself (hierarchical
relationships or other mappings between elements )
 Domain Models
 conceptualize domain in which the metadata schema
operates (classes of things that are annotated and the
relationships between them)
C. CONCEPTUAL MODELS
EXAMPLE METADATA MODEL:
LOM ELEMENT HIERARCHY
The structure of the LOM is an example of a simple conceptual metadata model,
which organizes elements into disjoint hierarchies
 The summary level describes the dataset in general
 The version level describes a specific version
 The distribution level describes a representation of a version
EXAMPLE DOMAIN MODEL:
HCLS DATASET ‘LEVELS’
Supports recommendations for how each should be described using the standard
 D. Practical specifications for use
 provide guidance for how to apply metadata under a given schema
 e.g. HCLS model provides recommendations when and how to apply
certain elements to types of targets in the domain
 E. Encoding specifications for presentation & exchange
 rules for binding metadata to syntactic formats such as XML or RDF
 e.g. LOM has precise specification for binding to XML or RDF
D/E. SPECIFICATIONS
STORING AND ACCESSING
RESOURCE METADATA
 Typically lives separately from annotated resources,
in databases and/or XML files
 Can also be stored within a resource (e.g. photo
metadata embedded in image file itself)
 Increasing number of resource catalogs and
repositories on the web provide access to metadata
and often the resource itself
 will have seen examples for books, images, and datasets
 These repositories are indexed by search tools and
provide programmatic interfaces to allow for
resource discovery and re-use
 Serves same basic needs, but different scale and target of
annotation, user base, and primary use cases
II. METADATA IN DATABASE SYSTEMS
 Two main categories:
1. Structural metadata
 describes the structure
of database objects
and the relationships
between them
 commonly encoded
externally as ER-
diagrams, or internally
as summary tables
http://www.visn20.med.va.gov/VISN20/V20/DataWarehouse/Images/LabAutopsy.jpg
Example ER diagram
for VA autopsy data
 Serves same basic needs, but different scale and target of
annotation, user base, and primary use cases
II. METADATA IN DATABASE SYSTEMS
 Two main categories:
2. Content metadata
 describes meaning of
data at a very fine
granularity
 specifies attributes of
data elements , and
rules for recoding their
values
 encoded internally or
externally as ‘data
dictionaries’
Example of a data set that needs a dictionary to interpret
 The notion of a ‘data element’ obtains a more precise meaning
and specification in the context of a database.
 elements can be specified at finer granularity in a databases holding
structured data in a controlled operational system
 Conceptually, a data element is comprised of a concept and a
value domain
 concept = the subject of the data recorded for a given element
 value domain = the defined value set for how that data is recorded
 Example: PT_ETHNIC
 concept = patient ethnicity
 value domain = [ E1 (caucasian), E2 (hispanic/latino), E3 (african),
E4 (asian), E5 (mixed) ]
DATA ELEMENTS
 Provide detailed metadata about data elements
 element identifiers and name(s)
 definitions and descriptions
 value constraints
 data type
 default value
 length
 allowable values
 value frequency (mandatory or not)
 provenance and tracking
 version number, entry and termination dates
 indicate source table(s)
 mappings to elements in other schema dictionaries
DATA DICTIONARIES
DATA DICTIONARIES
http://library.ahima.org/xpedio/groups/public/documents/ahima/bok1_048618.pdf
Simple
example
of a data
dictionary
 Key Functions
 unambiguous and shared understanding of the data by all users
(administrators, analysts, and clients)
 consistent data representation and manipulation
(addition, extraction, aggregation, and transformation)
 maintenance of the data model
 data integration, exchange, and re-use
 Encoding
 as an external document and/or represented as a table in the
database itself
DATA DICTIONARIES
1. Clear and thorough element definitions and value
set explanations are key
2. Give persistent identifiers to data elements
3. Map data elements to community standards where
possible
 common data elements (CDEs)
4. Specify value sets in terms of open controlled
vocabularies CVs where possible
5. Provide notes and guidelines for context of use
6. Make dictionary easily accessible to all users
DATA DICTIONARY
BEST PRACTICES
 As research moves toward 'big data‘, information from
diverse sources is being shared and aggregated for
analysis.
 A major challenge for managing this data is the diversity of
ways that a given idea can be described in data elements
 Sex/gender definitions can be based on genetics, phenotype, or self-
identification. Values can be recorded as local codes, abbreviations,
full labels, or community vocabularies.
DATABASE METADATA
INTEROPERABILITY
 The Common Data Element (CDE) movement aims to
address this problem by providing standardized data
elements that can be re-used across medical datasets
 CDEs are
 owned, managed, & curated by single authority (NINDS, NCI)
 stored and managed in large repositories called CDE registries
 available for diverse areas of clinical practice and research,
and at very fine granularity
 larger repositories hold up to 50,000 elements available
 CDEs serving as a foundation for interoperability across
data systems
COMMON DATA ELEMENTS (CDE )s
 Metadata registries that collect common data elements for a
defined domain
 Resemble large scale data dictionaries, but with key
differences:
 Exposed in searchable public repositories with additional
services to promote extraction and re-use
 Coverage is wider as they are used across different domains
and systems
 Metadata element descriptions are far richer to support
discovery, provenance, versioning, mappings, meta-modeling
 The NIH maintains a portal to information about existing CDE
initiatives, registries, and tools (http://www.nlm.nih.gov/cde/)
CDE REGISTRIES
 Houses >20,000 CDEs
 “Core” element set covers general concepts in medical domain
 patient demographics, medical history, assessment & examinations,
treatments & interventions, outcomes, and study protocol
 “Supplementary” sets covering specific diseases/research areas
 spinal injury, brain injury, epilepsy and stroke, Parkinson’s disease, ALS
 Metadata schema captures 30 element attributes
 this expanded set of attributes supports use cases of enabling
discovery and community re-use across different implementations
 Portal has search functionality and support for
generating clinical forms (CRFs) with CDE mappings
embedded in collected data
NINDS CDE REGISTRY
http://www.commondataelements.ninds.nih.gov/
 The National Cancer Institute cancer Data Standards Registry
(caDSR) is the largest and most widely used CDE registry
 >50,000 total elements
 Integrates CDEs from several initiatives under a unified model
and technical infrastructure
 Broad and deep coverage to fine granularity (as with NINDS)
 Metadata model is VERY complex
 captures >100 distinct attributes describing each data element in the
registry (vs 30 for NINDS)
 implements a complex conceptual model based on the
ISO/IEC 11179 metadata registry standard
 decomposes data elements into component parts that are
mapped to NCI thesaurus terms (formal encoding of semantics)
NCI DSR CDE REGISTRYca
https://cdebrowser.nci.nih.gov/CDEBrowser/
DSR CONCEPTUAL MODEL
1. To understand the data element table and explain
why it is so expansive
2. Follows a standard for database metadata
registries called ISO11179
 commonly implemented in other efforts you may
encounter
 e.g. the Clinical Data Interchange Standards Consortium
(CDISC), which has similar goals as the caDSR but across a
broader domain
3. Is the basis for semantic mappings to ontologies
such as the NCI thesaurus which are an important
feature of the model
ca
Data Element
Concept Value Domain
Value
Represen-
tation
Valid
Values
Class Property
DSR CONCEPTUAL MODELca
Data Element
PT_GENDER_CODE
Concept
‘patient gender’
Class
‘person’
Property
‘gender’
CONCEPT ELEMENT MAPPING
Concept = idea represented by the
data element, described
independently of a particular
representation
Class = a set of real world objects
with shared characteristics
Property = a characteristic common
to all members of an class
Data Element
PT_GENDER_CODE
Concept
‘patient gender’
Class
‘person’
C25190
Property
‘gender’
C17357
Class and property concepts
are mapped to NCI taxonomy
terms to formally encode
their semantics
Class Mapping
• person = C25190
Property Mapping
• gender = C17357
CONCEPT ELEMENT MAPPING
Data Element
PT_GENDER_CODE
Value Domain
VALUE DOMAIN MAPPING
Value domain = a set of attributes
describing representational
characteristics of instance data
Value Representation = type of
value the data represents
(along different dimensions)
Valid Values = the actual allowed
values for a given value domain
Value Rep.
‘person’,
‘gender’, ‘code’
Valid Values
‘0’,’1’,’2’,’9’
Data Element
PT_GENDER_CODE
Value Domain
Value Rep.
‘person’,
‘gender’, ‘code’
Value Representation Mappings
‘person’ = C25190
‘gender’ = C17357
‘code’ = C25162
Valid Value Mappings
0 = unknown C17998
1 = female gender C46110
2 = male gender C46109
9 = unspecified n/a
VALUE DOMAIN MAPPING
Valid Values
‘0’,’1’,’2’,’9’
Concept Value Domain
Value
Represen-
tation
Valid
Values
Class Property
"SEMANTICALLY UNAMBIGUOUS
INTEROPERABILITY"
 Semantic mappings of these four elements can
support more sophisticated search and analysis
 Computational tools can leverage logic in the NCI
hierarchy for query expansion and data aggregation
 The structure of the NCI taxonomy supports synonym
and hierarchical query expansion
LEVERAGING SEMANTICS
 User searches ‘cancer
biology’ to view all CDEs
related to this concept.
 The query is expanded
(1) to include any
children of this term in
the taxonomy, and
(2) to include elements
with text matching any
synonym of cancer in
the taxonomy
NCI Thesaurus
‘Cancer Biology’
branch
Strategies
Map elements in local data dictionaries to CDEs
 Parkinson’s Disease Biomarkers Program (PDBP) data dictionary
 NINDS registry form builder
Build libraries of re-usable, pre-fabricated forms with
embedded CDE metadata
 NINDS Case Report Form (CRF) library
 medical-data-models.org forms
Initialize software with CDEs so that electronic forms
automatically carry mappings when they are
generated
 caDSR registry and CDISC tools
CDE IN PRACTICEs
 CDEs standardize data elements for use across multiple systems
 Available in registries that vary in size and complexity
 some resemble simple data dictionaries with expanded attributes to
support discovery and provenance (NINDS)
 some are implemented with complex conceptual models and semantic
mappings (caDSR)
 Tools and standards supporting practical application exist but are
not yet state of the art
 Worlds collide: the intersection of metadata for web resources and
database systems
 CDEs represent discoverable web resources, that are used in the context of
data collection and description in database systems
 Each registry defines a metadata framework/schema for a given domain
CDE SUMMARY
 Promote standardized and systematic data collection
 Improve data quality and consistency
 Facilitate data sharing and integration
 Reduce the cost and time needed to develop data
collection tools
 Improve opportunities for meta-analysis comparing
results across studies
 Increase the availability of data for the planning and
design of new trials
BENEFITS OF CDEs
 Data elements across efforts are not well aligned
 Tooling support for discovery & application immature
 Limited use of community taxonomy and ontology
mappings
 Navigating complexity and redundancy
. . . of medical data itself
 many ways to calculate and represent simple and complex
measures such as tumor burden or medical prognosis
. . . of metadata elements/schemas
 thousands of elements with very nuanced meaning and use
 redundant representation poses challenges for data collection,
aggregation, and integrated analysis (even for simple measures)
CHALLENGES FOR DATA
INTEGRATION AND ANALYSIS
LINKS
Schema Examples:
DCMI: http://dublincore.org/documents/dcmi-terms/
IEEE-LOM: http://www.imsglobal.org/metadata/mdv1p3/imsmd_bestv1p3.html
HCLS Dataset description standard: http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/
Data Dictionary Example:
http://library.ahima.org/xpedio/groups/public/documents/ahima/bok1_048618.pdf
CDE Sites:
NIH CDE Portal: http://www.nlm.nih.gov/cde/
NINDS CE Registry: http://www.commondataelements.ninds.nih.gov/
caDSR browser: https://cdebrowser.nci.nih.gov/CDEBrowser/
caDSR tools: http://cbiit.nci.nih.gov/ncip/biomedical-informatics-resources/interoperability-and-semantics/metadata-and-
models
CDEs in Practice:
PDBP gender data dictionary entry:
https://dictionary.pdbp.ninds.nih.gov/portal/publicData/dataElementAction!view.action?dataElementId=5585
NINDS form builder http://www.commondataelements.ninds.nih.gov/CRF.aspx?source=formBuilder
Downloadable forms (CRFs) from NINDs with embedded CDE links: http://www.commondataelements.ninds.nih.gov/CRF.aspx
medical-data-models.org forms https://medical-data-models.org/forms/1049
Suite of tools on caDSR site http://cbiit.nci.nih.gov/ncip/biomedical-informatics-resources/interoperability-and-
semantics/metadata-and-models

Metadata lecture(9 17-14)

  • 1.
  • 2.
     “Data aboutData”  “Data” broadly covers any information resource  digital or physical  narrative, multimedia, structured  raw data, processed data, aggregates of datasets, or discrete elements within data sets  More formally, “Metadata is structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource” METADATA (NISO (2004) Understanding Metadata. Bethesda, NISO Press )
  • 3.
     Descriptive metadata:supports discovery and identification  e.g. title, author, identifiers, subjects, keywords  Structural metadata: describes how the components of a resource are organized  e.g. table of contents for a book, schema of database tables, manifest of files in an aggregate ‘research object’  Administrative metadata: helps manage the resource  Technical - describes technical aspects of a resource  e.g. file type, version information, how/when created  Rights management - explains intellectual property rights  e.g. licensing, use restrictions, privacy concerns  Preservation - supports maintenance and archiving of a resource  e.g. provenance/ownership, history of use, authenticity METADATA SERVES MANY PURPOSES . . . http://www.niso.org/publications/press/UnderstandingMetadata.pdf
  • 4.
    Metadata comes inmany forms, serves many needs, and operates in very diverse settings  I. Resource metadata (on the web)  Target: information resources as a whole  1o Goals: resource discovery and use  Form: structured, separate records  Users: everyone  Standards: many metadata frameworks/vocabularies  II. Metadata in database systems  Target: structured data and data elements  1o Goals: data consistency, aggregation, analysis  Form: ER diagrams, summary tables, data dictionaries  Users: professional data administrators and scientists  Standards: metadata and CDE registries . . . AND OPERATES IN MANY CONTEXTS
  • 5.
    I. Resource Metadata(on the web) A. Overview B. Examples C. Metadata Frameworks i. Schema ii. Vocabularies iii. Conceptual Models iv. Practical Specifications v. Encoding Specifications D. Metadata Storage and Retrieval II. Metadata in Databases Systems A. Overview B. Data Elements C. Data Dictionaries D. Common Data Elements (CDEs) E. CDE Registries OUTLINE
  • 6.
     Metadata inthe world that all of us have used and created in work and life  Attached to information resources we find on the web  books, videos, images, websites, datasets, . . .  Helps us to find a resource and understand what it is and how to use it I. RESOURCE METADATA (ON THE WEB)
  • 7.
  • 8.
  • 9.
    Data Set Description http://datadryad.org/resource/doi:10.5061/dryad.4ms68 ResearchData Sets and Files (datadryad.org) Data File Description
  • 10.
     Resource metadatais increasingly structured according to established schemas and standards  Many standards exist that vary in their:  complexity (schemas, specifications, conceptual models)  targets (music, video, images, books, art, datasets)  goals (descriptive, administrative and preservation)  communities served (libraries, museums, research)  Benefits:  leverage existing resources  vetted by community  interoperability and integration STANDARDS ARE KEY
  • 11.
    Normative standards formetadata are captured in metadata frameworks. There are five possible components of a metadata framework: A. Schema B. Vocabularies C. Conceptual Model D. Practical Specifications E. Encoding Specifications METADATA FRAMEWORKS
  • 12.
     Core ofany framework – specifies the categories of information recorded  Comprised of a set of data elements along with descriptions of their attributes and rules for use  attributes described should minimally include an identifier and/or name and a definition of each element  Can also specify data types and ‘value domains’ that describe allowable values for a given element  e.g. term lists, CVs, ontologies  Example schema: Dublin Core, LOM, HCLS Dataset Std. A. METADATA SCHEMA
  • 13.
     First effortat standardizing metadata to improve resource discovery on the web  Very simple core schema consisting of 15 general data elements representing properties of a information resource, with no value restrictions.  Data Elements: title, identifier, type, description, creator, contributor, date, subject, format, language, source, publisher, relation, coverage, rights  Element Attributes: URI, label, definition, domain, range, version, comment EXAMPLE 1: DUBLIN CORE METADATA INITIATIVE (DCMI) http://dublincore.org/documents/dcmi-terms/
  • 14.
     Extensive setof metadata elements describing ‘learning objects’  “Any digital or non-digital entity that may be used for learning, education, or training"  Based loosely on DCMI schema, but:  >50 new elements to describe educational attributes of learning objects  organizes elements into a hierarchical structure  provides detailed specifications for allowable values  supports ‘application profiles’ that extend model for specific domains EXAMPLE 2: LEARNING OBJECT METADATA (IEEE-LOM) http://www.imsglobal.org/metadata/mdv1p3/imsmd_bestv1p3.html
  • 15.
    LOM SCHEMA ELEMENTSAND ATTRIBUTES
  • 16.
     The LOMbase schema defines 9 categories of metadata elements  Hierarchical structure supports user understanding, metadata organization and aggregation for analysis LOM ELEMENT HIERARCHY
  • 17.
     A unifiedschema that provides all key metadata fields needed to comprehensively describe research datasets  what they are, how they are produced, where they are found  meets pressing need in current research climate to support sharing, discovery, and re-use of public datasets in a standardized way  Metadata elements describe general features, identifiers, provenance and change, availability and distribution, and dataset statistics  Comprised entirely of elements (properties) from existing community vocabularies, e.g. DCMI, DCAT, PROV, VOID, FOAF  attributes and rules for element use defined in source schema EXAMPLE 3: W3C HCLS DATASET DESCRIPTION STANDARD http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/
  • 18.
    B. VOCABULARIES  Setof terms (often structured) that is used to constrain entry of metadata values  Vocabularies represent general concepts  Word or code lists  Hierarchical classifications  taxonomies, thesauri, ontologies  e.g. ICD9, SNOMed, MeSH, NCIthesaurus  Authority lists provide controlled names for proper nouns  FundRef (organizations)  Global Gazeteer (places)  ORCID (people)
  • 19.
     Open Researcherand Contributor IDentifier (ORCID)  a nonproprietary alphanumeric code that uniquely identifies scientific and other academic authors (a persistent “author DOI” for researchers)  The ORCID identifier set is coming to serve as a de facto authority list to record persons contributing to scholarly research products  ORCIDs facilitates efforts to track productivity, impact, and attribution based on all scholarly outputs (publications, grants, datasets, protocols, presentations, abstracts, code, blogs, etc)  Services can aggregate scholarly outputs for a given researcher  resolves to a “CV” listing all scholarly contributions linked across various venues (e.g. Pubmed, Scopus, Slideshare, Figshare, Github, Dryad, . . .) ORCID AS AN AUTHORITY LIST
  • 20.
     An underlyingmodel that describes how all the information and concepts inherent in a resource are related to one another  Metadata Models  conceptualize the metadata schema itself (hierarchical relationships or other mappings between elements )  Domain Models  conceptualize domain in which the metadata schema operates (classes of things that are annotated and the relationships between them) C. CONCEPTUAL MODELS
  • 21.
    EXAMPLE METADATA MODEL: LOMELEMENT HIERARCHY The structure of the LOM is an example of a simple conceptual metadata model, which organizes elements into disjoint hierarchies
  • 22.
     The summarylevel describes the dataset in general  The version level describes a specific version  The distribution level describes a representation of a version EXAMPLE DOMAIN MODEL: HCLS DATASET ‘LEVELS’ Supports recommendations for how each should be described using the standard
  • 23.
     D. Practicalspecifications for use  provide guidance for how to apply metadata under a given schema  e.g. HCLS model provides recommendations when and how to apply certain elements to types of targets in the domain  E. Encoding specifications for presentation & exchange  rules for binding metadata to syntactic formats such as XML or RDF  e.g. LOM has precise specification for binding to XML or RDF D/E. SPECIFICATIONS
  • 24.
    STORING AND ACCESSING RESOURCEMETADATA  Typically lives separately from annotated resources, in databases and/or XML files  Can also be stored within a resource (e.g. photo metadata embedded in image file itself)  Increasing number of resource catalogs and repositories on the web provide access to metadata and often the resource itself  will have seen examples for books, images, and datasets  These repositories are indexed by search tools and provide programmatic interfaces to allow for resource discovery and re-use
  • 25.
     Serves samebasic needs, but different scale and target of annotation, user base, and primary use cases II. METADATA IN DATABASE SYSTEMS  Two main categories: 1. Structural metadata  describes the structure of database objects and the relationships between them  commonly encoded externally as ER- diagrams, or internally as summary tables http://www.visn20.med.va.gov/VISN20/V20/DataWarehouse/Images/LabAutopsy.jpg Example ER diagram for VA autopsy data
  • 26.
     Serves samebasic needs, but different scale and target of annotation, user base, and primary use cases II. METADATA IN DATABASE SYSTEMS  Two main categories: 2. Content metadata  describes meaning of data at a very fine granularity  specifies attributes of data elements , and rules for recoding their values  encoded internally or externally as ‘data dictionaries’ Example of a data set that needs a dictionary to interpret
  • 27.
     The notionof a ‘data element’ obtains a more precise meaning and specification in the context of a database.  elements can be specified at finer granularity in a databases holding structured data in a controlled operational system  Conceptually, a data element is comprised of a concept and a value domain  concept = the subject of the data recorded for a given element  value domain = the defined value set for how that data is recorded  Example: PT_ETHNIC  concept = patient ethnicity  value domain = [ E1 (caucasian), E2 (hispanic/latino), E3 (african), E4 (asian), E5 (mixed) ] DATA ELEMENTS
  • 28.
     Provide detailedmetadata about data elements  element identifiers and name(s)  definitions and descriptions  value constraints  data type  default value  length  allowable values  value frequency (mandatory or not)  provenance and tracking  version number, entry and termination dates  indicate source table(s)  mappings to elements in other schema dictionaries DATA DICTIONARIES
  • 29.
  • 30.
     Key Functions unambiguous and shared understanding of the data by all users (administrators, analysts, and clients)  consistent data representation and manipulation (addition, extraction, aggregation, and transformation)  maintenance of the data model  data integration, exchange, and re-use  Encoding  as an external document and/or represented as a table in the database itself DATA DICTIONARIES
  • 31.
    1. Clear andthorough element definitions and value set explanations are key 2. Give persistent identifiers to data elements 3. Map data elements to community standards where possible  common data elements (CDEs) 4. Specify value sets in terms of open controlled vocabularies CVs where possible 5. Provide notes and guidelines for context of use 6. Make dictionary easily accessible to all users DATA DICTIONARY BEST PRACTICES
  • 32.
     As researchmoves toward 'big data‘, information from diverse sources is being shared and aggregated for analysis.  A major challenge for managing this data is the diversity of ways that a given idea can be described in data elements  Sex/gender definitions can be based on genetics, phenotype, or self- identification. Values can be recorded as local codes, abbreviations, full labels, or community vocabularies. DATABASE METADATA INTEROPERABILITY
  • 33.
     The CommonData Element (CDE) movement aims to address this problem by providing standardized data elements that can be re-used across medical datasets  CDEs are  owned, managed, & curated by single authority (NINDS, NCI)  stored and managed in large repositories called CDE registries  available for diverse areas of clinical practice and research, and at very fine granularity  larger repositories hold up to 50,000 elements available  CDEs serving as a foundation for interoperability across data systems COMMON DATA ELEMENTS (CDE )s
  • 34.
     Metadata registriesthat collect common data elements for a defined domain  Resemble large scale data dictionaries, but with key differences:  Exposed in searchable public repositories with additional services to promote extraction and re-use  Coverage is wider as they are used across different domains and systems  Metadata element descriptions are far richer to support discovery, provenance, versioning, mappings, meta-modeling  The NIH maintains a portal to information about existing CDE initiatives, registries, and tools (http://www.nlm.nih.gov/cde/) CDE REGISTRIES
  • 35.
     Houses >20,000CDEs  “Core” element set covers general concepts in medical domain  patient demographics, medical history, assessment & examinations, treatments & interventions, outcomes, and study protocol  “Supplementary” sets covering specific diseases/research areas  spinal injury, brain injury, epilepsy and stroke, Parkinson’s disease, ALS  Metadata schema captures 30 element attributes  this expanded set of attributes supports use cases of enabling discovery and community re-use across different implementations  Portal has search functionality and support for generating clinical forms (CRFs) with CDE mappings embedded in collected data NINDS CDE REGISTRY http://www.commondataelements.ninds.nih.gov/
  • 36.
     The NationalCancer Institute cancer Data Standards Registry (caDSR) is the largest and most widely used CDE registry  >50,000 total elements  Integrates CDEs from several initiatives under a unified model and technical infrastructure  Broad and deep coverage to fine granularity (as with NINDS)  Metadata model is VERY complex  captures >100 distinct attributes describing each data element in the registry (vs 30 for NINDS)  implements a complex conceptual model based on the ISO/IEC 11179 metadata registry standard  decomposes data elements into component parts that are mapped to NCI thesaurus terms (formal encoding of semantics) NCI DSR CDE REGISTRYca https://cdebrowser.nci.nih.gov/CDEBrowser/
  • 37.
    DSR CONCEPTUAL MODEL 1.To understand the data element table and explain why it is so expansive 2. Follows a standard for database metadata registries called ISO11179  commonly implemented in other efforts you may encounter  e.g. the Clinical Data Interchange Standards Consortium (CDISC), which has similar goals as the caDSR but across a broader domain 3. Is the basis for semantic mappings to ontologies such as the NCI thesaurus which are an important feature of the model ca
  • 38.
    Data Element Concept ValueDomain Value Represen- tation Valid Values Class Property DSR CONCEPTUAL MODELca
  • 39.
    Data Element PT_GENDER_CODE Concept ‘patient gender’ Class ‘person’ Property ‘gender’ CONCEPTELEMENT MAPPING Concept = idea represented by the data element, described independently of a particular representation Class = a set of real world objects with shared characteristics Property = a characteristic common to all members of an class
  • 40.
    Data Element PT_GENDER_CODE Concept ‘patient gender’ Class ‘person’ C25190 Property ‘gender’ C17357 Classand property concepts are mapped to NCI taxonomy terms to formally encode their semantics Class Mapping • person = C25190 Property Mapping • gender = C17357 CONCEPT ELEMENT MAPPING
  • 41.
    Data Element PT_GENDER_CODE Value Domain VALUEDOMAIN MAPPING Value domain = a set of attributes describing representational characteristics of instance data Value Representation = type of value the data represents (along different dimensions) Valid Values = the actual allowed values for a given value domain Value Rep. ‘person’, ‘gender’, ‘code’ Valid Values ‘0’,’1’,’2’,’9’
  • 42.
    Data Element PT_GENDER_CODE Value Domain ValueRep. ‘person’, ‘gender’, ‘code’ Value Representation Mappings ‘person’ = C25190 ‘gender’ = C17357 ‘code’ = C25162 Valid Value Mappings 0 = unknown C17998 1 = female gender C46110 2 = male gender C46109 9 = unspecified n/a VALUE DOMAIN MAPPING Valid Values ‘0’,’1’,’2’,’9’
  • 43.
    Concept Value Domain Value Represen- tation Valid Values ClassProperty "SEMANTICALLY UNAMBIGUOUS INTEROPERABILITY"  Semantic mappings of these four elements can support more sophisticated search and analysis  Computational tools can leverage logic in the NCI hierarchy for query expansion and data aggregation
  • 44.
     The structureof the NCI taxonomy supports synonym and hierarchical query expansion LEVERAGING SEMANTICS  User searches ‘cancer biology’ to view all CDEs related to this concept.  The query is expanded (1) to include any children of this term in the taxonomy, and (2) to include elements with text matching any synonym of cancer in the taxonomy NCI Thesaurus ‘Cancer Biology’ branch
  • 45.
    Strategies Map elements inlocal data dictionaries to CDEs  Parkinson’s Disease Biomarkers Program (PDBP) data dictionary  NINDS registry form builder Build libraries of re-usable, pre-fabricated forms with embedded CDE metadata  NINDS Case Report Form (CRF) library  medical-data-models.org forms Initialize software with CDEs so that electronic forms automatically carry mappings when they are generated  caDSR registry and CDISC tools CDE IN PRACTICEs
  • 46.
     CDEs standardizedata elements for use across multiple systems  Available in registries that vary in size and complexity  some resemble simple data dictionaries with expanded attributes to support discovery and provenance (NINDS)  some are implemented with complex conceptual models and semantic mappings (caDSR)  Tools and standards supporting practical application exist but are not yet state of the art  Worlds collide: the intersection of metadata for web resources and database systems  CDEs represent discoverable web resources, that are used in the context of data collection and description in database systems  Each registry defines a metadata framework/schema for a given domain CDE SUMMARY
  • 47.
     Promote standardizedand systematic data collection  Improve data quality and consistency  Facilitate data sharing and integration  Reduce the cost and time needed to develop data collection tools  Improve opportunities for meta-analysis comparing results across studies  Increase the availability of data for the planning and design of new trials BENEFITS OF CDEs
  • 48.
     Data elementsacross efforts are not well aligned  Tooling support for discovery & application immature  Limited use of community taxonomy and ontology mappings  Navigating complexity and redundancy . . . of medical data itself  many ways to calculate and represent simple and complex measures such as tumor burden or medical prognosis . . . of metadata elements/schemas  thousands of elements with very nuanced meaning and use  redundant representation poses challenges for data collection, aggregation, and integrated analysis (even for simple measures) CHALLENGES FOR DATA INTEGRATION AND ANALYSIS
  • 49.
    LINKS Schema Examples: DCMI: http://dublincore.org/documents/dcmi-terms/ IEEE-LOM:http://www.imsglobal.org/metadata/mdv1p3/imsmd_bestv1p3.html HCLS Dataset description standard: http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ Data Dictionary Example: http://library.ahima.org/xpedio/groups/public/documents/ahima/bok1_048618.pdf CDE Sites: NIH CDE Portal: http://www.nlm.nih.gov/cde/ NINDS CE Registry: http://www.commondataelements.ninds.nih.gov/ caDSR browser: https://cdebrowser.nci.nih.gov/CDEBrowser/ caDSR tools: http://cbiit.nci.nih.gov/ncip/biomedical-informatics-resources/interoperability-and-semantics/metadata-and- models CDEs in Practice: PDBP gender data dictionary entry: https://dictionary.pdbp.ninds.nih.gov/portal/publicData/dataElementAction!view.action?dataElementId=5585 NINDS form builder http://www.commondataelements.ninds.nih.gov/CRF.aspx?source=formBuilder Downloadable forms (CRFs) from NINDs with embedded CDE links: http://www.commondataelements.ninds.nih.gov/CRF.aspx medical-data-models.org forms https://medical-data-models.org/forms/1049 Suite of tools on caDSR site http://cbiit.nci.nih.gov/ncip/biomedical-informatics-resources/interoperability-and- semantics/metadata-and-models