Webinar slides: Interoperability between resources involved in TDM at the level of metadata

twitter.com/openminted_eu
(penny@ilsp.gr)

OpenMinTeD sets out to create an open, service-oriented
e-Infrastructure for TDM of scientific and scholarly
content. Researchers can collaboratively create, discover,
share and re-use Knowledge from a wide range of text-
based scientific related sources in a seamless way.
"Achieving interoperability between resources involved in TDM
at the level of metadata"

• Text and Data Mining: “the discovery by computer of new,
previously unknown information, by automatically extracting
and relating information from different (…) resources to reveal
otherwise hidden meanings” [Hearst 1999]
• Interoperability: Relating to systems, especially of computers
or telecommunications, that are capable of working together
without being specially configured to do so. [American Heritage®
Dictionary of the English Language, Fifth Edition. (2011)]

• Language Resource: It encompasses both data sets (textual,
multimodal/multimedia and lexical data, grammars, language
models etc.) and tools/technologies/services used for their
processing [WG1 - Wiki glossary]
• Metadata: contains descriptive, contextual and provenance
assertions about the properties of a Digital Object [RDA - DFT Core
terms]

• to be mined, i.e. in OpenMinTeD, scientific & scholarly publications (built
as "corpora")
• ancillary/reference resources, e.g. typesystems, linguistic tagsets,
terminological lexica, ontologies, machine learning models, reference
corpora, training corpora etc.
• "components" in the form of
• downloadable and locally executable tools
• web services
• workflows composed of the above

• registry service: to register and, later on, search and find content and
s/w components that can process this content - targeting end users
including TDM experts
• workflow service: to search and find s/w components & ancillary
resources that are (or can be made) compatible (hence,
interoperable) in order to compose workflows - targeting TDM service
developers

• document properties that users use in their queries to discover the
resources but also
• document properties that will support the automatic discovery of
compatibility between (a) s/w components & (b) between content &
s/w components (aka. find interoperable resources)

OMTD
publications ancillary content resources s/w components
OpenAIRE CORE
publishers
etc.
…
MetaShare
discipline
portal 2
discipline
portal 1 Maven Docker
component
collections
typesystems
tagsets
ML models
…
lexica
ontologies
ML models
corpora
…

Goal: achieve interoperability
• per language resource type
• across language resource types
Problems
• various metadata schemas
• various communities
 semantics!!
 crosswalks/mappings/semantic links

• Need to define a common core vocabulary for the description of
the resource properties, e.g.
• language of a publication/corpus & language of the input of a s/w
component
• domain/subject of a publication/corpus & domain/subject of an
ontology that can be used to annotate it
but
• how can we select the "common denominator" from all the
schemas?
• gaps in original metadata records deemed important for TDM
• wealth of original records & loss of information
• mismatches between metadata elements/values

organize the schema elements and accommodate common vs.
particular features of resources
be flexible enough to support varying degrees of documentation
completeness
cover documentation needs of all resource types involved in TDM
cover needs of resource discoverability and TDM processing
reuse what is available vs. create and recommend new elements
and values
document processing procedure and outputs
standardize/normalize user input vs. allow for free user input

• OMTD Deliverable D5.2 - Interoperability Requirements Specification
[soon to be made publicly available]
• scenarios & use cases targeted by OMTD in the Areas of: scholarly
communication, life sciences, agriculture & biodiversity, social
sciences
• overview of relevant metadata schemas (e.g. OpenAIRE, CORE,
RIOXX guidelines, CrossRef, MetaShare, DataCite, DCAT, CMDI
relevant metadata profiles etc.) – cf. OMTD Deliverable D5.1 -
Interoperability Landscaping Report

<corpusMetadataRecord>
<metadataHeaderInfo>
<metadataRecordIdentifier metadataIdentifierSchemeName="hdl">PIDtest</metadataRecordIdentifier>
<metadataCreationDate>2016-07-29</metadataCreationDate>
<collectedFrom>
<repositoryName lang="en-us">OpenAIRE</repositoryName>
<repositoryName lang="en-us">CORE</repositoryName>
</collectedFrom>
</metadataHeaderInfo>
<corpusInfo>
<resourceType>corpus</resourceType>
<identificationInfo>
<resourceName>Corpus of English articles in biomedicine from OpenAIRE and CORE</resourceName>
<description lang="en-us">A corpus created automatically by the corpus building process in OpenMinTeD,
consisting of 17987 articles related to biomedicine</description>
<identifier resourceIdentifierSchemeName="hdl">temporary identifier</identifier>
</identificationInfo>
<contactEmail>user@omtd.com</contactEmail>
<resourceCreationInfo>
<resourceCreator>Smith, John</resourceCreator>
</resourceCreationInfo>

<datasetDistributionInfo>
<licence>CC-BY-NC</licence>
<version>4.0</version>
</rightsInfo>
</datasetDistributionInfo>
<corpusSubtype>rawCorpus</corpusSubtype>
<languageTag>en</languageTag>
<domain classificationSchemeName="PAROLE_topicClassification">science</domain>
<textFormats>
<mimeType>text/plain</mimeType>
<mimeType>application/pdf</mimeType>
</textFormats>
<characterEncoding>UTF-8</characterEncoding>
<sizeInfo>
<size>17987</size>
<sizeUnit>articles</sizeUnit>
</sizeInfo>
</corpusInfo>
</corpusMetadataRecord>

<componentMetadataRecord>
<metadataHeaderInfo>…</metadataHeaderInfo>
<componentInfo>
<resourceType>component</resourceType>
<identificationInfo>
<resourceName lang="en-us">ILSP Feature-based multi-tiered POS Tagger</resourceName>
<description lang="en-us">FBT part-of-speech tagger for Greek texts. </description>
<resourceShortName lang="en-us">ilsp_fbt</resourceShortName>
</identificationInfo>
<contactEmail >test@omtd.com</contactEmail>
<version>v1.0.0</version>
<resourceCreator>ILSP team</resourceCreator>
<componentType>morphologicalTagger</componentType>
<componentDistributionMedium>webService</componentDistributionMedium>
<accessURL>http://access.com</accessURL>
<webServiceType>SOAP</webServiceType>
<rightsInfo>
<licence>nonStandardLicenceTerms</licence>
<nonStandardLicenceName lang="en-us">terms of service</nonStandardLicenceName>
<nonStandaradLicenceTermsURL>http://example.com</nonStandaradLicenceTermsURL>
</rightsInfo>

<inputContentResourceInfo>
<resourceType>document</resourceType>
<languageTag>el</languageTag>
<mimeType>text/plain</mimeType>
</inputContentResourceInfo>
<outputResourceInfo>
<resourceType>document</resourceType>
<languageTag>el</languageTag>
<mimeType>text/xml</mimeType>
<dataFormatSpecific>xces; format-variant=ilsp</dataFormatSpecific>
<typesystem>ILSP-typesystem</typesystem>
<tagset>ILSP-POStagset</tagset>
<annotationLevel>morphosyntacticAnnotation-bPosTagging</annotationLevel>
</outputResourceInfo>
<componentDependencies>
<typesystem>ILSP-typesystem</typesystem>
<tagset>ILSP-POStagset</tagset>
</componentDependencies>
</componentMetadataRecord>

• obligatory: record what is necessary for intended purposes vs. ease
to document,
• e.g. language for scholarly publications but title and author??, format and subject
of a document??
• recommended: features that can help the user or future uses or that
users find useful but providers have not yet standardized,
• e.g. documentation / help files, attribution, citation papers
• optional: all remaining information related to the lifecycle of a
resource
• e.g. funding information (still: funding agencies are becoming more and more
interested in it!), projects where the resources have been used and created
outputs

• organize the schema into semantically coherent elements
• common to all types of resources (e.g. identification, licensing etc.)
• per resource type
• re-usable for more than one resource type but not globally applicable (e.g. subject
classification) and
• strictly applied to specific resource types (e.g. evaluation for s/w components)

set of
elements element
link to entity

• identification & provenance of the metadata record
• metadata record identifier
• metadata creation date
• identification of the resource
• identifiers with identificationScheme (name/URI)
• title & description (multilingual; English should be there but ?)
• distribution & licensing/access
• distribution medium/format (e.g. executable code, downloadable text etc.)
• licence and/or rightsStatement (ongoing work)
• licence text or URL (provided by system for standard licences)
• contact information
• either email or landing page
• resource type (& subtype)

abstract
/ full text
typesystem
title
character
encoding
format
language
dependencies
input / output
content resourc
e
algorithm
tagset
typesystem
language annotation
level
language
classification
tagset
annotation
level
character
encoding
format
size
language
character
encoding
format
publisher
/ journal
classification
authors
annotation
resource
typesystem
tagset
annotation
resource
language
classification
character
encoding
format
size

• relations between resources can be encoded
• inside each metadata record (e.g. between publication & authors)
• separately, from both metadata records (e.g. between component & model)
• implementation issues for optionality and restrictions: uniformity of
metadata records across sources vs. better treatment of restrictions
via the registry service  which restrictions should be in the schema
and which restrictions should be in a system built on top of the
schema?

• recommend and link to authority lists for properties
• format: IANA list of media types BUT need for extensions!
• language: ISO 639-3 vs. IETF BCP47
• subject classification: DDC, LCSH, EUROVOC, discipline-specific lists…  we
cannot enforce one scheme, so we recommend their use and ask for reference to
it; this is currently encoded as enumerations but link to external source is a better
solution
• create elements & values in attested gaps & where considered best
for OMTD purposes
• classification of components, lexical/conceptual resources etc.
• annotation set of elements and values [ to be included in the output resources
automatically via the platform]
•  links to be provided to elements in other metadata schemas (DataCite,
CrossRef, DCAT, etc.) (ongoing work)

• adopt entire metadata schemas and registries for satellite entities
• repositories & registries  openDOAR & re3data
• journals  DOAJ
BUT
• persons  ORCID & SCOPUS id
• organisations  ISNI & fundref
& covered with own metadata elements

• link to other resources or satellite entities via identifier (PID) or
descriptive elements: recommend but allow for backup solutions when
the identifier is not there
• identifier preferably from an authority source, with reference to it: DOI for
publications, DataCite for datasets & services, ORCID for persons, ISNI or fundef
for organizations etc.
• but allow for other identifiers too: "identifierSchemeName" &
"identifierSchemeURL"
• descriptive elements: title, full name, etc.
• value system for elements
• e.g. free text vs. controlled vocabularies
• represented as enumeration
• semantics of closed & open vocabularies
• open vocabularies with the additional value "other" but … how can one add values
and yet curate the vocabularies??

• annotation set of elements:
• set of elements and values that can be added independently as a block to each
resource following the processing
• information on s/w component(s), type of annotation, tagsets, annotation
resources, annotators, format etc.
• covering provenance requirements but also to be used as input for further
processing

• XSD schemas v1.0.0 & documentation:
https://openminted.github.io/openminted-site/releases/omtd-
share/1.0.0/html/index.html
• Guidelines: on the way!
• Conversions from existing descriptors (ongoing work)

• not all information is available (e.g. licence, direct link to publication
contents, language of metadata fields, subject etc.)
• different approach between schemas (element vs. attribute)
• lack of a common API approach (as OAI-PMH across repositories)
• different mechanisms for flagging OA content
• inconsistent provision of full text links (incl. in CrossRef TDM)
• legal and technical issues around systematic full text aggregation
from publishers (including via CrossRef TDM)
• full text harvesting/crawling limits in place on publisher endpoints
• lack of support for discovery of new content
• lack of documentation on publisher systems

• largely technical information
• some non-technical information possible but seldom used (e.g.
developer information in Maven) - why?
• technical elements present but in many cases possible values not
restricted (e.g. media-type or language)
• "persistent identifier" e.g. in Maven is self-assigned and global
uniqueness is not enforced but governed by best-practice in contrast
to e.g. centrally assigned DOI- good or bad or tolerable?

• the closest to OMTD-SHARE schema (for obvious reasons)
• resource types converted: corpora, components, lexical/conceptual
resources & models
• main problems were the lack of persistent identifiers and the
decisions taken for further standardization/normalisation

www.openminted.eutwitter.com/openminted_eu
penny@ilsp.gr

Webinar slides: Interoperability between resources involved in TDM at the level of metadata

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Webinar slides: Interoperability between resources involved in TDM at the level of metadata

Similar to Webinar slides: Interoperability between resources involved in TDM at the level of metadata (20)

More from openminted_eu

More from openminted_eu (18)

Recently uploaded

Recently uploaded (20)

Webinar slides: Interoperability between resources involved in TDM at the level of metadata