Text Data Mining: Unlocking the hidden potential from scholarly content.

•

0 likes•28 views

Emma Warren-Jones

Scholarcy's presentation at STM Innovations Seminar, London, 2019.

Technology

1
TDM: Unlocking the hidden
potential from scholarly
content

2
Until recently, text mining has mostly been
restricted to post-publication PDFs and has
proved slow and difficult. The focus for scholarly
content has often been limited to metadata and
abstracts.
TDM is evolving to extract a wealth of
information that can support the entire scholarly
community – from authors to publishers.
Making sense of unstructured
content

4
6% YoY growth in manuscript submissions
42% authors post their preprint before
journal submission
300% increase in the number of preprint
servers since 2015
The research keeps growing
Published work and preprints
6%
300%
42%

5
Too many manuscripts. Not enough time.
Submission to publication time expanding.
48 Hours
First review
round
Submission to
publication
Screening
13 Weeks 400 Days

6
XML often made available for Open Access articles, but not all publishers make XML
available to TDM services (API).
Rise of preprint servers and number of journals inviting article submission via these
servers increases need to mine non-XML content.
Most authors still submit manuscripts to publishers & preprint servers in Word or
PDF.
Some servers convert content into XML, but majority of platforms only allow for the
preprint to be downloaded in the same format it was uploaded in.
The format challenge

7
Software used by authors
Word still the preferred format
Writing software used by authors submitting to bioRxiv.
Source: Sever et al (2019) bioRxiv: the preprint server for biology. https://dx.doi.org/10.1101/833400

9
Extracting structured content from any document
Dixon WG, Beukenhorst AL, Yimer BB et al. 2019. doi:10.1038/s41746-019-
0180-3
Content extracted to a structured format

10
Distilling research into headlines and key information
Rosyadi S, Haryanto A. 2019. doi:10.31124/advance.9989639.v1 Distillation to unified format

12
Manuscript
submission
Manuscript
screening
Peer review
Promotion
TDM: What are the opportunities?
TDM can work at any stage of the publishing process, opening up a huge number of opportunities from
manuscript drafting and screening to promoting the published article.

13
• Metadata extraction to automate
population of submissions system (Title,
author, affiliations, abstract, keywords).
• Reduces author friction / duplication of
effort.
• Previous work in this area has focused on
the biomedical domain, but this
opportunity can apply to any domain.
Automating submissions process

14
• Data extraction for manuscript screening
(key methods, results, sample size,
participants, ethical compliance etc.)
• Clear article context/overview for
reviewers.
• One-click access of cited sources & main
findings.
• Table extraction for analysis of statistical
calculations.
Speeding up peer review

15
Surfacing cited sources & their main findings
Krohn L, Ruskey JA, Rudakou U et al. 2019. doi:10.1101/19010991 Cited sources and their main findings surfaced

16
• Extract, parse and link citations from
archives dating back hundreds of years.
• Large scale reference population of open
citation networks (BMJ Case study)
• Improve exposure/discovery of older
research.
Exposing more content through
citation networks

18
How publishers can help.
Make XML available for all Open Access articles rather than just the final
PDF for text mining.
Enrich citation networks with additional content (e.g. abstract,
highlights) in a machine-readable format.
Make all cited sources more easily verifiable for authors and
researchers.
Converting articles & preprints into a universally structured format for
more effective TDM. Allow authors to write articles natively in a
machine-readable format.
1
2
3
4

19
…equal rights for friendly bots!
And finally…

What's hot

Data Citation: A Critical Role for PublishersBrian Hole

Data availability and feasibility of validation – A genomics case studyVerena139

2015 NISO Forum: The Future of Library Resource DiscoveryNational Information Standards Organization (NISO)

Citation Analysis for the Free, Online LiteratureBalachandar Radhakrishnan

UKSG 2018 Breakout - Trouble(shooting) with a capital T: how categorising and...UKSG: connecting the knowledge community

Capturing and Analyzing Publication, Citation and Usage Data for Contextual C...NASIG

Where you should publishJason Price, PhD

2015 NISO Forum: The Future of Library Resource DiscoveryNational Information Standards Organization (NISO)

Data Metadata and Data Citation - Emma Ganley (PLoS)National Information Standards Organization (NISO)

2015 NISO Forum: The Future of Library Resource DiscoveryNational Information Standards Organization (NISO)

COVID-19 and Changing Paradigm in Scholarly communication Vasantha Raju N

Oct 14 NISO Webinar: Cloud and Web Services for LibrariesNational Information Standards Organization (NISO)

2015 NISO Forum: The Future of Library Resource DiscoveryNational Information Standards Organization (NISO)

How Accessible Is Our Collection? Performing an E-Resources Accessibility ReviewNASIG

Advancing the International Plant Names Index (IPNI) nickyn

CI4CC sustainability-panelRavi Madduri

Fox-Keynote-Now and Now of Data Publishing-nfdp13DataDryad

The Intersection of InterLibrary Loan and Acquisition Models: A review of rec...NASIG

Biosharing sansone-dryad-may13Susanna-Assunta Sansone

2 flash presentations for annual meeting tdm and cross check finalCrossref

What's hot (20)

Data Citation: A Critical Role for Publishers

Data availability and feasibility of validation – A genomics case study

2015 NISO Forum: The Future of Library Resource Discovery

Citation Analysis for the Free, Online Literature

UKSG 2018 Breakout - Trouble(shooting) with a capital T: how categorising and...

Capturing and Analyzing Publication, Citation and Usage Data for Contextual C...

Where you should publish

2015 NISO Forum: The Future of Library Resource Discovery

Data Metadata and Data Citation - Emma Ganley (PLoS)

2015 NISO Forum: The Future of Library Resource Discovery

COVID-19 and Changing Paradigm in Scholarly communication

Oct 14 NISO Webinar: Cloud and Web Services for Libraries

2015 NISO Forum: The Future of Library Resource Discovery

How Accessible Is Our Collection? Performing an E-Resources Accessibility Review

Advancing the International Plant Names Index (IPNI)

CI4CC sustainability-panel

Fox-Keynote-Now and Now of Data Publishing-nfdp13

The Intersection of InterLibrary Loan and Acquisition Models: A review of rec...

Biosharing sansone-dryad-may13

2 flash presentations for annual meeting tdm and cross check final

Similar to Text Data Mining: Unlocking the hidden potential from scholarly content.

OSFair2017 training | Machine accessibility of Open Access scientific publica...Open Science Fair

How can we ensure research data is re-usable? The role of Publishers in Resea...LEARN Project

UKSG 2018 Breakout - Setting your cites to open I4OC - MaccallumUKSG: connecting the knowledge community

Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)Frank Oellien

Data, Data Everywhere: What's A Publisher to Do?Anita de Waard

ALAMW14 Altmetrics Panel: Redefining Research ImpactWilliam Gunn

Elsevier - Smart Data and Algorithms for the Publishing IndustryAntonio Gulli

A scalable hybrid research paper recommender system for microaman341480

Engaging Information Professionals in the Process of Authoritative Interlinki...Lucy McKenna

CrossRef Text and Data MiningCrossref

NISO April 30th RA21 WebinarNational Information Standards Organization (NISO)

Better together: building services for public good on top of content from the...petrknoth

Supporting the ref5lshavald

A Pragmatic Approach to Facilitating Text and Data Mining Chris Shillum

From Open Access to Open DataBrian Hole

Research Data PublishingBrian Hole

Simons orcid forum canberra 2018-PIDs in researchARDC

OpenAIRE and Eudat services and tools to support FAIR DMP implementation Research Data Alliance

Similar to Text Data Mining: Unlocking the hidden potential from scholarly content. (20)

OSFair2017 training | Machine accessibility of Open Access scientific publica...

How can we ensure research data is re-usable? The role of Publishers in Resea...

UKSG 2018 Breakout - Setting your cites to open I4OC - Maccallum

Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)

Data, Data Everywhere: What's A Publisher to Do?

ALAMW14 Altmetrics Panel: Redefining Research Impact

Elsevier - Smart Data and Algorithms for the Publishing Industry

A scalable hybrid research paper recommender system for micro

Engaging Information Professionals in the Process of Authoritative Interlinki...

CrossRef Text and Data Mining

NISO April 30th RA21 Webinar

Better together: building services for public good on top of content from the...

Supporting the ref5

A Pragmatic Approach to Facilitating Text and Data Mining

From Open Access to Open Data

Research Data Publishing

Simons orcid forum canberra 2018-PIDs in research

OpenAIRE and Eudat services and tools to support FAIR DMP implementation

Recently uploaded

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang

Key Features Of Token Development (1).pptxLBM Solutions

SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida

Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

Artificial intelligence in the post-deep learning eraDeakin University

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community

Build your next Gen AI Breakthrough - April 2024Neo4j

Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes

Pigging Solutions Piggable Sweeping ElbowsPigging Solutions

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko

Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group

Recently uploaded (20)

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)

Key Features Of Token Development (1).pptx

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Unlocking the Potential of the Cloud for IBM Power Systems

Scanning the Internet for External Cloud Exposures via SSL Certs

Science&tech:THE INFORMATION AGE STS.pdf

Benefits Of Flutter Compared To Other Frameworks

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

Designing IA for AI - Information Architecture Conference 2024

Artificial intelligence in the post-deep learning era

Are Multi-Cloud and Serverless Good or Bad?

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx

Build your next Gen AI Breakthrough - April 2024

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners

Pigging Solutions Piggable Sweeping Elbows

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

Snow Chain-Integrated Tire for a Safe Drive on Winter Roads

Text Data Mining: Unlocking the hidden potential from scholarly content.

1. 1 TDM: Unlocking the hidden potential from scholarly content

2. 2 Until recently, text mining has mostly been restricted to post-publication PDFs and has proved slow and difficult. The focus for scholarly content has often been limited to metadata and abstracts. TDM is evolving to extract a wealth of information that can support the entire scholarly community – from authors to publishers. Making sense of unstructured content

3. 3 Landscape

4. 4 6% YoY growth in manuscript submissions 42% authors post their preprint before journal submission 300% increase in the number of preprint servers since 2015 The research keeps growing Published work and preprints 6% 300% 42%

5. 5 Too many manuscripts. Not enough time. Submission to publication time expanding. 48 Hours First review round Submission to publication Screening 13 Weeks 400 Days

6. 6 XML often made available for Open Access articles, but not all publishers make XML available to TDM services (API). Rise of preprint servers and number of journals inviting article submission via these servers increases need to mine non-XML content. Most authors still submit manuscripts to publishers & preprint servers in Word or PDF. Some servers convert content into XML, but majority of platforms only allow for the preprint to be downloaded in the same format it was uploaded in. The format challenge

7. 7 Software used by authors Word still the preferred format Writing software used by authors submitting to bioRxiv. Source: Sever et al (2019) bioRxiv: the preprint server for biology. https://dx.doi.org/10.1101/833400

8. 8 Format shouldn’t matter

9. 9 Extracting structured content from any document Dixon WG, Beukenhorst AL, Yimer BB et al. 2019. doi:10.1038/s41746-019- 0180-3 Content extracted to a structured format

10. 10 Distilling research into headlines and key information Rosyadi S, Haryanto A. 2019. doi:10.31124/advance.9989639.v1 Distillation to unified format

11. 11 Opportunities

12. 12 Manuscript submission Manuscript screening Peer review Promotion TDM: What are the opportunities? TDM can work at any stage of the publishing process, opening up a huge number of opportunities from manuscript drafting and screening to promoting the published article.

13. 13 • Metadata extraction to automate population of submissions system (Title, author, affiliations, abstract, keywords). • Reduces author friction / duplication of effort. • Previous work in this area has focused on the biomedical domain, but this opportunity can apply to any domain. Automating submissions process

14. 14 • Data extraction for manuscript screening (key methods, results, sample size, participants, ethical compliance etc.) • Clear article context/overview for reviewers. • One-click access of cited sources & main findings. • Table extraction for analysis of statistical calculations. Speeding up peer review

15. 15 Surfacing cited sources & their main findings Krohn L, Ruskey JA, Rudakou U et al. 2019. doi:10.1101/19010991 Cited sources and their main findings surfaced

16. 16 • Extract, parse and link citations from archives dating back hundreds of years. • Large scale reference population of open citation networks (BMJ Case study) • Improve exposure/discovery of older research. Exposing more content through citation networks

17. 17 What’s needed?

18. 18 How publishers can help. Make XML available for all Open Access articles rather than just the final PDF for text mining. Enrich citation networks with additional content (e.g. abstract, highlights) in a machine-readable format. Make all cited sources more easily verifiable for authors and researchers. Converting articles & preprints into a universally structured format for more effective TDM. Allow authors to write articles natively in a machine-readable format. 1 2 3 4

19. 19 …equal rights for friendly bots! And finally…

Text Data Mining: Unlocking the hidden potential from scholarly content.

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Text Data Mining: Unlocking the hidden potential from scholarly content.

Similar to Text Data Mining: Unlocking the hidden potential from scholarly content. (20)

Recently uploaded

Recently uploaded (20)

Text Data Mining: Unlocking the hidden potential from scholarly content.