Lost In Translation when machines meet STM content
Lost In Translation
when machines meet
This presentation was designed to be delivered live. To help you
understand the content we have added these notes…
Behind the shared vision held by the partners of the Resource
Identification Initiative lies a number of questions. …
REPRODUCIBILITY: In the scientific community it is difficult to find
objective qualitative information about research materials. Choosing the
wrong products often means failed experiments…
Where is it?
EFFICIENCY: Poor resource visibility means that labs around the world
waste thousands of man hours duplicating eachothers work. Greater
visibility of produced research materials would dramatically improve
efficiency in science and reduce waste. ..
CONNECTIVITY: By its nature, science is a collaborative endevour.
Efficiently identifying knowledge and expertise when required is key to
progressing discovery. ..
The Role of
Research content has evovled over time as a means of communicating
conclusions. However, the real untapped value in content is the
information about the journey…
Who has used it?
Where is it?
Every article contains valuable information about experimental
procedures and materials. When cross referenced with location, author
and time data, powerful new experimental and research insights are
Which one? revealed…
Todays research articles are designed to be read one at a time by
humans. To cross reference we rely on our notetaking, memory and
prior knowledge. Machines have the potential to dramatically improve
the efficiency of how we glean insight from content. But….
1 2 3
1) Every publisher has slightly different XML standards.
2) The vocabularly for describing research entities is ambiguous.
3) There is a poor culture of facilitating data mining and enforcing best
annotation practice in the publishing industry.
The XML produced by different publishers can be significantly different.
This makes indexing and analysing content at scale challenging…
Insufficient annotation and naming in content makes it difficult to
disambiguate material entities. Take this glass beads example….
Sigma produces at least 5 variations of glass beads, which version is
being referred to in the article?
Publishers have traditionally made money by attracting great content and
selling access to as many people as possible . Advances in technology
have largely been viewed by publishers as a means to do more of the
same at a lower cost. Publishers have been slow to adopt practices that
make their content machine accessible…
The RII is backed a wide group of interests working together to change
how experimental resources are documented in new research
The group includes publishers, academic groups, funding agenicies,
resource repositories and commercial companies…
The group has a number of shared goals with the aim of improving the
machine accessiblity of STM content in a practical and sustainable way…
1. Unique Identifier
1) By agreeing and assigning standard unique identifiers for all known
research materials (commercial and non-commercial)…
2. Editor Awareness
• Drive adoption
• Better XML standards
• Content machine friendly
2) By working with publishers and other community members to
encourage the inclusion of unique indentifiers at the authoring stage and
devising strategies for XML standardisation...
3) By developing technology and APIs to diseminate research material
information in a standarised form…
4) While NIF is focussing on research material annotation at the prepublication stage, scrazzl is working on a seperate initiative to drive
retrospective annotation of published research…
5) One of the main aims of the RII is to support the adoption of a
standardised public research material onthology and vocabulary that is
interoperable with other exsisting biological onthologies…
So what does success look like?
Every new article published will contain unique identifiers either in the
visible text or in the underlying metadata. This will improve machine
readability and will dramatically improve the semantic connectivity of
Data driven qualitative metrics of material entities will be available,
improving reproducibility and driving efficiency..
Improved Geo and time dependent resource availability visualisation will
be possible. Finding where resources are and identifying key technical
experts will be more efficient…