Text Data Mining: Unlocking the hidden potential from scholarly content.

•

0 likes•28 views

The document discusses opportunities for text and data mining (TDM) across the scholarly publishing process. TDM can work at any stage, from manuscript drafting and screening to promoting published articles. This includes automating metadata extraction to populate submissions systems, extracting data to assist with manuscript screening during peer review, and exposing more older content by building out open citation networks linking archived research. However, for TDM to realize its full potential, publishers need to make XML versions of articles more broadly available, enrich citation networks with additional machine-readable content, and allow authors to natively write articles in structured formats that facilitate text mining.

1
TDM: Unlocking the hidden
potential from scholarly
content

2
Until recently, text mining has mostly been
restricted to post-publication PDFs and has
proved slow and difficult. The focus for scholarly
content has often been limited to metadata and
abstracts.
TDM is evolving to extract a wealth of
information that can support the entire scholarly
community – from authors to publishers.
Making sense of unstructured
content

4
6% YoY growth in manuscript submissions
42% authors post their preprint before
journal submission
300% increase in the number of preprint
servers since 2015
The research keeps growing
Published work and preprints
6%
300%
42%

5
Too many manuscripts. Not enough time.
Submission to publication time expanding.
48 Hours
First review
round
Submission to
publication
Screening
13 Weeks 400 Days

6
XML often made available for Open Access articles, but not all publishers make XML
available to TDM services (API).
Rise of preprint servers and number of journals inviting article submission via these
servers increases need to mine non-XML content.
Most authors still submit manuscripts to publishers & preprint servers in Word or
PDF.
Some servers convert content into XML, but majority of platforms only allow for the
preprint to be downloaded in the same format it was uploaded in.
The format challenge

7
Software used by authors
Word still the preferred format
Writing software used by authors submitting to bioRxiv.
Source: Sever et al (2019) bioRxiv: the preprint server for biology. https://dx.doi.org/10.1101/833400

9
Extracting structured content from any document
Dixon WG, Beukenhorst AL, Yimer BB et al. 2019. doi:10.1038/s41746-019-
0180-3
Content extracted to a structured format

10
Distilling research into headlines and key information
Rosyadi S, Haryanto A. 2019. doi:10.31124/advance.9989639.v1 Distillation to unified format

12
Manuscript
submission
Manuscript
screening
Peer review
Promotion
TDM: What are the opportunities?
TDM can work at any stage of the publishing process, opening up a huge number of opportunities from
manuscript drafting and screening to promoting the published article.

13
• Metadata extraction to automate
population of submissions system (Title,
author, affiliations, abstract, keywords).
• Reduces author friction / duplication of
effort.
• Previous work in this area has focused on
the biomedical domain, but this
opportunity can apply to any domain.
Automating submissions process

14
• Data extraction for manuscript screening
(key methods, results, sample size,
participants, ethical compliance etc.)
• Clear article context/overview for
reviewers.
• One-click access of cited sources & main
findings.
• Table extraction for analysis of statistical
calculations.
Speeding up peer review

15
Surfacing cited sources & their main findings
Krohn L, Ruskey JA, Rudakou U et al. 2019. doi:10.1101/19010991 Cited sources and their main findings surfaced

16
• Extract, parse and link citations from
archives dating back hundreds of years.
• Large scale reference population of open
citation networks (BMJ Case study)
• Improve exposure/discovery of older
research.
Exposing more content through
citation networks

18
How publishers can help.
Make XML available for all Open Access articles rather than just the final
PDF for text mining.
Enrich citation networks with additional content (e.g. abstract,
highlights) in a machine-readable format.
Make all cited sources more easily verifiable for authors and
researchers.
Converting articles & preprints into a universally structured format for
more effective TDM. Allow authors to write articles natively in a
machine-readable format.
1
2
3
4

19
…equal rights for friendly bots!
And finally…

This document summarizes Jessica Polka's presentation on emerging visions for preprints. Some key points include: 1) Preprints allow for faster dissemination of research which can accelerate discovery and collaboration. They also help prevent duplication of efforts. 2) Authors want and receive feedback on preprints from other researchers through forums like bioRxiv comments and social media. Making this feedback more transparent could help readers and editors. 3) While preprints are not a replacement for peer-reviewed publications, they allow authors to share work earlier. Versioning of published articles also needs to be improved to allow for corrections. 4) Trust in preprints comes from transparency around moderation practices by different preprint

COAR Next Generation Repositories WG - Text mining and Recommender system sto...

petrknoth

One of the key aims of the COAR NGR group is to help us to overcome the challenges that still make it difficult to move beyond repositories as document silos. The group wants to see a globally interoperable network of repositories and global services built on top of repositories fulfilling the expectations of users in the 21st century. During this talk, I will address two use cases the COAR NGR working group aims to enable: text and data mining and recommender systems.

Integrating research indicators for use in the repositories infrastructure

petrknoth

The current repository infrastructure, which consists of thousands of repositories, does not make effective use of research indicators largely exploited by commercial players in the area. Research indicators, including citation counts and Mendeley reader counts, enable the development and improvement of functionality researchers use on a daily basis. For example, they make it possible to increase the performance in information retrieval and recommendation tasks and serve as an enabler for the development of research analytics & metrics functionality, such as the analysis of research trends or collaboration networks. We believe that there is a strong case for making a better use of these indicators within the repositories infrastructure to improve the functionality of services users rely on.

Walters "Preprints, the Institutional Repository and the Impact on the Resear...

National Information Standards Organization (NISO)

Shearer "Next Generation Repositories: Developing a Distributed Architecture ...

National Information Standards Organization (NISO)

Funk and Beck "Driving Use: Identifiers and Enhanced Metadata"

National Information Standards Organization (NISO)

Sharing IR metadata with SHARE

NASIG

Sharing IR Metadata with SHARE summarizes the SHARE initiative, which aims to improve access to research metadata by aggregating metadata from institutional repositories. SHARE advocates for consistent, high-quality metadata using open standards like Dublin Core and DataCite. The presentation provided information on registering an institutional repository with SHARE and guidelines for fields like authors, type, rights, publisher, and source to ensure interoperability of metadata. Contact information was provided for individuals involved with SHARE who could provide more details.

A snake, a planet, and a bear ditching spreadsheets for quick, reproducible r...

NASIG

The document discusses the critical role publishers play in data citation. It emphasizes the importance of publishers establishing clear guidelines for citing data, training copy editors to ensure data is properly cited, promoting the use of data papers to incentivize data sharing and reuse, and making data citations machine-readable through XML tagging or RDF to facilitate discovery and analysis of cited data.

Data availability and feasibility of validation – A genomics case study

Verena139

1) The study examined data sharing of genome-wide association studies (GWAS) and found that only 10.6% of articles reported sharing summary statistics. Data sharing seems to need mandates to become widespread. 2) Automatically identifying shared data from articles is difficult due to complex article structures, lack of standard terminology, and variation in data availability statements. 3) Basic information like whether data is shared and where can be extracted accurately from data availability statements using machine learning techniques. However, standardization of statements would be needed for applications like monitoring data sharing over time.

2015 NISO Forum: The Future of Library Resource Discovery

National Information Standards Organization (NISO)

Citation Analysis for the Free, Online Literature

Balachandar Radhakrishnan

1) The document discusses various services for open access literature including institutional archives, metadata harvesting through Celestial, and citation analysis services like Citebase. 2) It describes how Citebase extracts references from texts and stores them in a structured database to enable citation linking and navigation between cited and citing articles. 3) Early download frequency data from arXiv.org is shown to correlate with longer-term citation frequency, indicating web impact may predict future citation impact.

UKSG 2018 Breakout - Trouble(shooting) with a capital T: how categorising and...

UKSG: connecting the knowledge community

Link resolver failures, erroneous URLs, EZproxy configuration errors and inaccurate metadata in e-resource records are commonplace problems reported by users in pursuit of e-resource access. This presentation describes the categorisation and analysis of data generated from the troubleshooting process over the period of an academic year. The process is designed to be pre-emptive, seeking to anticipate e-resource problems that users may encounter, and productive, providing insight to inform user instruction and trigger mechanisms to create enhanced electronic access for users. Geraldine O Beirn, Queen’s University Belfast

Capturing and Analyzing Publication, Citation and Usage Data for Contextual C...

NASIG

Libraries have long sought to demonstrate the value of their collections through a variety of usage statistics. Traditionally, a strong emphasis is placed on high usage statistics when evaluating journals in collection development discussions. However, as budget pressures persist, administrators are increasingly concerned with looking beyond traditional usage metrics to determine the real impact of library services and collections. By examining journal usage in the context of scholarly communication, we hope to gain a more holistic understanding of the use and impact of our library’s resources. In this session, we begin by outlining our methodology for gathering comprehensive publication and citation data for authors affiliated with Northwestern University’s Feinberg School of Medicine, utilizing Web of Science as our primary data source and leveraging a custom Python script to manage the data. Using this data we discuss various potential metrics that could be employed to measure and evaluate journals in institutional and field-specific contexts, including but not limited to: number of publications and references per journal, co-citation networks, percentage of references per journal, and increases or decreases of references over time per title. We then consider the development of normalized benchmarks and criteria for creating field-specific core journal lists. We also discuss a process for establishing usage thresholds to evaluate existing journal subscriptions and to highlight potential gaps in the collection. Finally, we apply and compare these metrics to traditional collection development tools like COUNTER usage reports, cost-per-use analysis, Inter-Library Loan statistics and turnaway reports, to determine what correlations or discrepancies might exist. We finish by highlighting some use-cases which demonstrate the value of considering publication and citation metrics, and provide suggestions for incorporating these metrics into library collection development practices. Speakers: Joelen Pastva and Jonathan Shank, Northwestern University Project GitHub page: https://goo.gl/2C2Pcy

Where you should publish

Jason Price, PhD

The document summarizes key points about open access publishing options for scholarly articles, including open access journals, NIH public access policy, self-archiving in "green" journals, and their potential impact on increasing citations. Open access provides digital, online, and free access to articles while removing barriers of price and permission. Studies show open access articles tend to be cited more frequently than non-open access articles, though the reasons for this are still being explored.

2015 NISO Forum: The Future of Library Resource Discovery

National Information Standards Organization (NISO)

Data Metadata and Data Citation - Emma Ganley (PLoS)

National Information Standards Organization (NISO)

2015 NISO Forum: The Future of Library Resource Discovery

National Information Standards Organization (NISO)

COVID-19 and Changing Paradigm in Scholarly communication

Vasantha Raju N

The document discusses how the COVID-19 pandemic has impacted scholarly communications. It notes that researchers are publishing preprints to disseminate their research on COVID-19 quickly. This has led traditional publishers to speed up peer review and make more literature open access. It also discusses how preprint servers are benefiting research by allowing quick sharing of findings and how preprints differ from traditional publications. Finally, it explores how scholarly communications may evolve in the future with more open peer review systems and use of altmetrics and AI.

Oct 14 NISO Webinar: Cloud and Web Services for Libraries

National Information Standards Organization (NISO)

Agenda Introduction Todd Carpenter, Executive Director, NISO (Working placeholder title) Utilizing the Cloud to Empower Research Efforts John “JG” Chirapurath, Senior Vice President and General Manager, ProQuest Workflow Solutions Migrating CDL Infrastructure to Amazon Web Services Kurt Ewoldsen, Manager, Infrastructure and Applications Support, California Digital Library, University of California Surveying the Horizon: Preservation and the Cloud Heather Lea Moulaison, Assistant Professor, The iSchool (School of Information Science & Learning Technologies), University of Missouri

2015 NISO Forum: The Future of Library Resource Discovery

National Information Standards Organization (NISO)

How Accessible Is Our Collection? Performing an E-Resources Accessibility Review

NASIG

Michael Fernandez, presenter While the growth and adoption of electronic resources has been exponential, there has been a concurrent lag in ensuring that e-resources are accessible by users with disabilities. Vendors have become increasingly aware of this issue and are taking steps to address it; however, given the sheer size of the library marketplace, there is a noticeable lack of consistency across vendor platforms. In the Summer of 2016, American University Library began evaluating the accessibility of its web content as part of a university-wide initiative focusing on Section 508 compliance. This review entailed not only library hosted websites, but also third party platforms for databases, e-journals, and e-books. In order to assess the accessibility of the library’s subscribed e-resources, the Electronic Resources Management Unit created an accessibility inventory. All subscribed e-resources were evaluated to gauge the efforts being made by vendors to make their products accessible. The methodology for this inventory involved seeking out voluntary product accessibility templates (VPATs), identifying clearly marked accessibility statements on the vendor site or platform, and reviewing current license agreements for verbiage that ensures a commitment to accessibility regulations and allows for remediation of accessibility issues that may be identified. This inventory represented an initial but crucial step towards e-resource accessibility. AU Library was able to identify the vendors who have already taken measures, and for those who had not, we identified the opportunity to create a dialogue. In this presentation, I’ll detail methods and resources that can be used in order to assess the status of a collection’s accessibility. Additionally, I’ll describe how AU Library was able to collaborate on this shared goal by identifying allies across the university in the offices of assistive technology and procurement. Finally, I’ll discuss our strategies for further educating and engaging with vendors.

Advancing the International Plant Names Index (IPNI)

nickyn

The "names and taxa" information space is often thought of as being composed of three layers: Taxonomic concepts Code governed nomenclatural acts Name occurrences In many circumstances the distinction of these layers is blurred, leading to confusion and inefficiencies in information management. To date, IPNI has been mainly concerned with the middle layer comprising ICBN governed nomenclatural acts, and is formed of three key components: curated data, information services to expose this data, and dedicated editorial staff to provide nomenclatural expertise. IPNI will be advanced from its current state to better connect to the layers above (taxonomic concepts) and below (name occurrences). This will require the expansion of data holdings, improved linkages, and the development of information services and associated workflows. These will be offered to key actors including name authors, publishers, taxonomists and managers of biodiversity information.

CI4CC sustainability-panel

Ravi Madduri

This document summarizes a presentation about Globus Genomics, a service that provides genomic data analysis tools and workflows through a web interface. It allows users to securely transfer data, run standardized analysis pipelines, access computational resources on demand through Amazon Web Services, and collaborate on shared data and workflows. The service aims to make genomic analysis more accessible, reproducible, and sustainable through various pricing models and support for individual labs and bioinformatics cores.

Fox-Keynote-Now and Now of Data Publishing-nfdp13

DataDryad

The document summarizes Peter Fox's presentation at the Now and Now for Data conference in Oxford, UK on May 22, 2013. Fox discusses different metaphors for making data publicly available, including data publication, ecosystems, and frameworks for conversations about data. He examines pros and cons of different approaches like data centers, publishers, and linked data. The presentation considers how to improve data sharing and what roles different stakeholders like producers and consumers play.

The Intersection of InterLibrary Loan and Acquisition Models: A review of rec...

NASIG

The document summarizes recent research on the intersection of interlibrary loan and acquisitions models. It discusses how libraries have experimented with different pay-per-view, demand-driven, and article rental options to balance user needs with budgets. The presentation reviews literature from the past six years on trials of e-resource packages, rentals through DeepDyve, pay-per-view programs, and demand-driven acquisition of e-content. It emphasizes the importance of collaboration and experimentation in finding solutions to access challenges.

Biosharing sansone-dryad-may13

Susanna-Assunta Sansone

This document discusses standards for describing experiments and reporting data. It proposes creating a registry of standards to help researchers identify the appropriate standards and databases. The registry would curate information on standards, associate them with relevant databases and data policies, and develop criteria to assess standards' usability and popularity. These criteria include things like formal specifications, adoption levels, community support, and interoperability. The registry would also curate core attributes about databases like data types, standards implemented, and data access policies to help evaluate databases.

2 flash presentations for annual meeting tdm and cross check final

Crossref

This document provides statistics and information about Crossref and its text and data mining (TDM) support. It details that Crossref's database contains over 41 million documents from over 121,000 journal titles. Usage of its iThenticate plagiarism detection service has increased yearly, with over 1.8 million documents checked in 2014 so far. Crossref also launched a TDM service in May 2014 that allows researchers to discover and access full text links and license information for over 1 million documents to enable text and data mining.

OSFair2017 training | Machine accessibility of Open Access scientific publica...

Open Science Fair

Petr Knoth talks about machine accessibility of Open Access scientific publications from publisher systems via ResourceSync Training title:TDM unlocking a goldmine of information Training overview: Text and Data Mining (TDM) is a natural ‘next step’ in open science. It can lead to new and unexpected discoveries and increase the impact of publications and repositories. This workshop showcases examples of successful TDM and infrastructural solutions for researchers. We will also discuss what is needed to make most of infrastructures and how publishers and repositories can open up their content. DAY 2 - PARALLEL SESSION 4 & 5

How can we ensure research data is re-usable? The role of Publishers in Resea...

LEARN Project

What's hot

Data Citation: A Critical Role for Publishers

Brian Hole

Data availability and feasibility of validation – A genomics case study

Verena139

2015 NISO Forum: The Future of Library Resource Discovery

National Information Standards Organization (NISO)

Citation Analysis for the Free, Online Literature

Balachandar Radhakrishnan

UKSG 2018 Breakout - Trouble(shooting) with a capital T: how categorising and...

UKSG: connecting the knowledge community

Capturing and Analyzing Publication, Citation and Usage Data for Contextual C...

NASIG

Where you should publish

Jason Price, PhD

2015 NISO Forum: The Future of Library Resource Discovery

National Information Standards Organization (NISO)

Data Metadata and Data Citation - Emma Ganley (PLoS)

National Information Standards Organization (NISO)

2015 NISO Forum: The Future of Library Resource Discovery

National Information Standards Organization (NISO)

COVID-19 and Changing Paradigm in Scholarly communication

Vasantha Raju N

Oct 14 NISO Webinar: Cloud and Web Services for Libraries

National Information Standards Organization (NISO)

2015 NISO Forum: The Future of Library Resource Discovery

National Information Standards Organization (NISO)

How Accessible Is Our Collection? Performing an E-Resources Accessibility Review

NASIG

Advancing the International Plant Names Index (IPNI)

nickyn

CI4CC sustainability-panel

Ravi Madduri

Fox-Keynote-Now and Now of Data Publishing-nfdp13

DataDryad

The Intersection of InterLibrary Loan and Acquisition Models: A review of rec...

NASIG

Biosharing sansone-dryad-may13

Susanna-Assunta Sansone

2 flash presentations for annual meeting tdm and cross check final

Crossref

What's hot (20)

Data Citation: A Critical Role for Publishers

Data availability and feasibility of validation – A genomics case study

2015 NISO Forum: The Future of Library Resource Discovery

Citation Analysis for the Free, Online Literature

UKSG 2018 Breakout - Trouble(shooting) with a capital T: how categorising and...

Capturing and Analyzing Publication, Citation and Usage Data for Contextual C...

Where you should publish

2015 NISO Forum: The Future of Library Resource Discovery

Data Metadata and Data Citation - Emma Ganley (PLoS)

2015 NISO Forum: The Future of Library Resource Discovery

COVID-19 and Changing Paradigm in Scholarly communication

Oct 14 NISO Webinar: Cloud and Web Services for Libraries

2015 NISO Forum: The Future of Library Resource Discovery

How Accessible Is Our Collection? Performing an E-Resources Accessibility Review

Advancing the International Plant Names Index (IPNI)

CI4CC sustainability-panel

Fox-Keynote-Now and Now of Data Publishing-nfdp13

The Intersection of InterLibrary Loan and Acquisition Models: A review of rec...

Biosharing sansone-dryad-may13

2 flash presentations for annual meeting tdm and cross check final

Similar to Text Data Mining: Unlocking the hidden potential from scholarly content.

OSFair2017 training | Machine accessibility of Open Access scientific publica...

Open Science Fair

How can we ensure research data is re-usable? The role of Publishers in Resea...

LEARN Project

UKSG 2018 Breakout - Setting your cites to open I4OC - Maccallum

UKSG: connecting the knowledge community

Sitations are the way that researchers communicate how their work builds on and relates to the work of others and they can be used to trace how a discovery spreads and is used by researchers in different disciplines and countries. Creating a truly comprehensive map of scholarship, however, relies on having a curated machine-readable database of citation information, where the provenance of every citation is clear and reusable. The Initiative for Open Citations (I4OC), a campaign launched on 6 April 2017, sought to make publisher members of Crossref aware that they could open up the citation metadata they already give to Crossref simply by asking them. With the support of major publishers and the endorsement of funders and other organisations, more than 50% of citation data in Crossref is now freely available, up from less than 1% before the campaign. This provides the foundation of a well-structured, open database of literally millions of datapoints that anyone can query, mine, consume and explore. The presenter will discuss the aims of the campaign, the new innovative services that are already using the data, what more still needs to be done and how you can support the initiative. Catriona J MacCallum, Hindawi

Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)

Frank Oellien

Presentation given at the 6th and last meeting of the European Commission "Licenses for Europe" Text and Data Mining Working Group (WG4). The first part of the talk gives a very brief introduction of some basic concepts of text mining techniques used in Pharmaceutical industry using the Accelrys PP text mining collection. The second part of the talk focuses on existing limitations pharmaceutical companies are facing in the field of Text mining. http://ec.europa.eu/licences-for-europe-dialogue/en/content/text-and-data-mining-working-group-wg4

Data, Data Everywhere: What's A Publisher to Do?

Anita de Waard

The document discusses publishers' roles in data sharing and challenges in open science. It notes that while most scientists agree access to others' data would benefit research, fewer are willing to share their own data due to lack of training and incentives. Publishers are working to establish data sharing guidelines and integrate platforms to store, share, and analyze research data and tools. However, many questions remain around publishing data science given distributed and interconnected data, tools, and knowledge networks. Publishers will need to transition from pipelines to platforms and enable these new network effects.

ALAMW14 Altmetrics Panel: Redefining Research Impact

William Gunn

This document discusses new ways of measuring research impact beyond traditional citations. It describes how Mendeley collects data on researcher behavior directly from their platform to provide faster and more comprehensive metrics on researcher engagement. This includes data on document views, saves, annotations and more. It also discusses how this broader dataset could enable new services for stakeholders to better understand research impact and discovery.

Elsevier - Smart Data and Algorithms for the Publishing Industry

Antonio Gulli

This document provides an overview of Elsevier's platforms and capabilities related to smart data and algorithms. It discusses how Elsevier uses data from its publications, authors, and readers to power platforms like ScienceDirect, Scopus, Mendeley, and recommendation engines. It also describes some of Elsevier's work using algorithms and machine learning, including entity fingerprinting, research trend analysis, and content recommendation. Elsevier aims to develop smart data capabilities that can benefit readers, editors, authors and other stakeholders across the publishing and research landscape.

A scalable hybrid research paper recommender system for micro

aman341480

This document summarizes a hybrid recommender system used by Microsoft Academic to provide recommendations for over 160 million research papers. The system combines co-citation based recommendations, which analyze citation networks, and content based recommendations, which analyze paper metadata like titles and abstracts. It generates paper embeddings from text and clusters them to improve scalability. The recommendations are evaluated through a user study and made publicly available to facilitate further research.

Engaging Information Professionals in the Process of Authoritative Interlinki...

Lucy McKenna

Through the use of Linked Data (LD), Libraries, Archives and Museums (LAMs) have the potential to expose their collections to a larger audience and to allow for more efficient user searches. Despite this, relatively few LAMs have invested in LD projects and the majority of these display limited interlinking across datasets and institutions. A survey was conducted to understand Information Professionals' (IPs') position with regards to LD, with a particular focus on the interlinking problem. The survey was completed by 185 librarians, archivists, metadata cataloguers and researchers. Results indicated that, when interlinking, IPs find the process of ontology and property selection to be particularly challenging, and LD tooling to be technologically complex and unsuitable for their needs. Our research is focused on developing an authoritative interlinking framework for LAMs with a view to increasing IP engagement in the linking process. Our framework will provide a set of standards to facilitate IPs in the selection of link types, specifically when linking local resources to authorities. The framework will include guidelines for authority, ontology and property selection, and for adding provenance data. A user-interface will be developed which will direct IPs through the resource interlinking process as per our framework. Although there are existing tools in this domain, our framework differs in that it will be designed with the needs and expertise of IPs in mind. This will be achieved by involving IPs in the design and evaluation of the framework. A mock-up of the interface has already been tested and adjustments have been made based on results. We are currently working on developing a minimal viable product so as to allow for further testing of the framework. We will present our updated framework, interface, and proposed interlinking solutions.

CrossRef Text and Data Mining

Crossref

After a successful pilot under the name "Prospect," CrossRef will provide a means for publishers to simplify text and data mining access for researchers. Both researchers and publishers will benefit from support of standard APIs and data representations to enable text and data mining across open access and subscription-based publishers, and this is what CrossRef is aiming to provide. This webinar was held on October 28, 2014.

NISO April 30th RA21 Webinar

National Information Standards Organization (NISO)

Better together: building services for public good on top of content from the...

petrknoth

CORE hosts the world’s largest collection of open access full texts, offering seamless, unrestricted access to research for citizens, researchers, libraries, software developers, funders and others. CORE’s aggregated content comes from thousands of institutional and subject repositories as well as journals and covers all research disciplines. In January 2019, CORE has hit the mark of 10 million monthly active users (10.41 million users). In September 2019, core.ac.uk has made it to the top 5k websites globally by user engagement as measured by the independent Alexa Rank, making it clearly one of the world’s most widely used Open Access services. In this talk, Petr and Nancy will explain the role of CORE in the open science ecosystem. They will introduce the solutions CORE offers for improving the delivery of research literature, including tools for discovering freely available copies of papers that might be behind publishers’ paywalls as well as a recommender system for open access literature. The use of CORE data to monitor compliance with open access policies has also recently received attention. The presenters will then reflect on the challenges in the sector and share their experience of building value-added services for the society on top of open content offered by libraries and their affiliated institutional repositories and open access journals.

Better together: building services for public good on top of content from the...

petrknoth

Supporting the ref5

lshavald

1. The document discusses preparing researchers for the next Research Excellence Framework (REF) assessment in the UK. It covers open access policies, bibliometrics, altmetrics, and ORCID identifiers. 2. Open access requirements for REF submissions are that journal articles and conference papers be made publicly available within 3 months of acceptance in an institutional repository. 3. Bibliometrics like citation counts and journal impact factors may play a larger role in REF assessments in the future, though peer review will still be primary. Concerns about gaming the system and disciplinary biases remain.

A Pragmatic Approach to Facilitating Text and Data Mining

Chris Shillum

From Open Access to Open Data

Brian Hole

The document discusses moving from open access to open data in scientific publishing. It outlines the social contract of science which involves validation, dissemination and further development of research. When these principles are not followed, it can constitute scientific malpractice by various stakeholders. The presentation advocates for data journals as an incentive that can help recognize data as a valid research output and encourage data sharing by providing metrics like citations. It provides details on what constitutes a data paper and reviews factors like peer review that are important for data journals to be successful.

Research Data Publishing

Brian Hole

The document discusses research data publishing and evaluating the impact of data. It summarizes the results of an RDA survey which found that researchers currently use a variety of methods to evaluate data impact, including citation counts, downloads, and mentions in papers. However, many felt current methods are inadequate. Researchers want standardized data citation practices and metrics in the future. The document also describes Ubiquity Press's approach to publishing research data, which aims to make data publication easy and encourage data sharing through open access policies and peer review of the data itself.

Simons orcid forum canberra 2018-PIDs in research

ARDC

OpenAIRE and Eudat services and tools to support FAIR DMP implementation

Research Data Alliance

The document provides an overview of the Open Research Data Pilot, the data management plan, and OPENAIRE tools and services to support implementation of FAIR data management plans. It discusses the aims of the Open Research Data Pilot, which Horizon 2020 projects are required to participate, and the types of data that must be deposited. It also covers topics like creating a data management plan, selecting a repository, making data FAIR, and OPENAIRE support resources like briefing papers, webinars, and the Zenodo repository.

OpenAIRE and Eudat services and tools to support FAIR DMP implementation

Research Data Alliance

Similar to Text Data Mining: Unlocking the hidden potential from scholarly content. (20)

OSFair2017 training | Machine accessibility of Open Access scientific publica...

How can we ensure research data is re-usable? The role of Publishers in Resea...

UKSG 2018 Breakout - Setting your cites to open I4OC - Maccallum

Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)

Data, Data Everywhere: What's A Publisher to Do?

ALAMW14 Altmetrics Panel: Redefining Research Impact

Elsevier - Smart Data and Algorithms for the Publishing Industry

A scalable hybrid research paper recommender system for micro

Engaging Information Professionals in the Process of Authoritative Interlinki...

CrossRef Text and Data Mining

NISO April 30th RA21 Webinar

Better together: building services for public good on top of content from the...

Supporting the ref5

A Pragmatic Approach to Facilitating Text and Data Mining

From Open Access to Open Data

Research Data Publishing

Simons orcid forum canberra 2018-PIDs in research

OpenAIRE and Eudat services and tools to support FAIR DMP implementation

Recently uploaded

RESUME BUILDER APPLICATION Project for students

KAMESHS29

“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...

Edge AI and Vision Alliance

For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/ Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit. In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing. van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

Albert Hoitingh

Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...

James Anderson

Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management. The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM). Speakers: Bob Boule Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle. Gopinath Rebala Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.

Large Language Model (LLM) and it’s Geospatial Applications

Rohit Gautam

Artificial Intelligence for XMLDevelopment

Octavian Nadolu

In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject. We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup. Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved. The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring. The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise. By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.

Climate Impact of Software Testing at Nordic Testing Days

Kari Kakkonen

My slides at Nordic Testing Days 6.6.2024 Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.

National Security Agency - NSA mobile device best practices

Quotidiano Piemontese

Pushing the limits of ePRTC: 100ns holdover for 100 days

Adtran

GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...

Neo4j

Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.

Securing your Kubernetes cluster_ a step-by-step guide to success !

KatiaHIMEUR1

Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster. However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks. In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.

Data structures and Algorithms in Python.pdf

TIPNGVN2

Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf

Paige Cruz

Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack. While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack. I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:

GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...

Neo4j

Dr. Sean Tan, Head of Data Science, Changi Airport Group Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.

Communications Mining Series - Zero to Hero - Session 1

DianaGray10

This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered: • Communication Mining Overview • Why is it important? • How can it help today’s business and the benefits • Phases in Communication Mining • Demo on Platform overview • Q/A

Essentials of Automations: The Art of Triggers and Actions in FME

Safe Software

In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation. We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios. Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!

Mind map of terminologies used in context of Generative AI

Kumud Singh

Building RAG with self-deployed Milvus vector database and Snowpark Container...

Zilliz

Uni Systems Copilot event_05062024_C.Vlachos.pdf

Uni Systems S.M.S.A.

みなさんこんにちはこれ何文字まで入るの？40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの？えこ...

名前です男

Recently uploaded (20)

RESUME BUILDER APPLICATION Project for students

“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...

Large Language Model (LLM) and it’s Geospatial Applications

Artificial Intelligence for XMLDevelopment

Climate Impact of Software Testing at Nordic Testing Days

National Security Agency - NSA mobile device best practices

Pushing the limits of ePRTC: 100ns holdover for 100 days

GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...

Securing your Kubernetes cluster_ a step-by-step guide to success !

Data structures and Algorithms in Python.pdf

Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf

GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...

Communications Mining Series - Zero to Hero - Session 1

Essentials of Automations: The Art of Triggers and Actions in FME

Mind map of terminologies used in context of Generative AI

Building RAG with self-deployed Milvus vector database and Snowpark Container...

Uni Systems Copilot event_05062024_C.Vlachos.pdf

Text Data Mining: Unlocking the hidden potential from scholarly content.

1. 1 TDM: Unlocking the hidden potential from scholarly content

2. 2 Until recently, text mining has mostly been restricted to post-publication PDFs and has proved slow and difficult. The focus for scholarly content has often been limited to metadata and abstracts. TDM is evolving to extract a wealth of information that can support the entire scholarly community – from authors to publishers. Making sense of unstructured content

3. 3 Landscape

4. 4 6% YoY growth in manuscript submissions 42% authors post their preprint before journal submission 300% increase in the number of preprint servers since 2015 The research keeps growing Published work and preprints 6% 300% 42%

5. 5 Too many manuscripts. Not enough time. Submission to publication time expanding. 48 Hours First review round Submission to publication Screening 13 Weeks 400 Days

6. 6 XML often made available for Open Access articles, but not all publishers make XML available to TDM services (API). Rise of preprint servers and number of journals inviting article submission via these servers increases need to mine non-XML content. Most authors still submit manuscripts to publishers & preprint servers in Word or PDF. Some servers convert content into XML, but majority of platforms only allow for the preprint to be downloaded in the same format it was uploaded in. The format challenge

7. 7 Software used by authors Word still the preferred format Writing software used by authors submitting to bioRxiv. Source: Sever et al (2019) bioRxiv: the preprint server for biology. https://dx.doi.org/10.1101/833400

8. 8 Format shouldn’t matter

9. 9 Extracting structured content from any document Dixon WG, Beukenhorst AL, Yimer BB et al. 2019. doi:10.1038/s41746-019- 0180-3 Content extracted to a structured format

10. 10 Distilling research into headlines and key information Rosyadi S, Haryanto A. 2019. doi:10.31124/advance.9989639.v1 Distillation to unified format

11. 11 Opportunities

12. 12 Manuscript submission Manuscript screening Peer review Promotion TDM: What are the opportunities? TDM can work at any stage of the publishing process, opening up a huge number of opportunities from manuscript drafting and screening to promoting the published article.

13. 13 • Metadata extraction to automate population of submissions system (Title, author, affiliations, abstract, keywords). • Reduces author friction / duplication of effort. • Previous work in this area has focused on the biomedical domain, but this opportunity can apply to any domain. Automating submissions process

14. 14 • Data extraction for manuscript screening (key methods, results, sample size, participants, ethical compliance etc.) • Clear article context/overview for reviewers. • One-click access of cited sources & main findings. • Table extraction for analysis of statistical calculations. Speeding up peer review

15. 15 Surfacing cited sources & their main findings Krohn L, Ruskey JA, Rudakou U et al. 2019. doi:10.1101/19010991 Cited sources and their main findings surfaced

16. 16 • Extract, parse and link citations from archives dating back hundreds of years. • Large scale reference population of open citation networks (BMJ Case study) • Improve exposure/discovery of older research. Exposing more content through citation networks

17. 17 What’s needed?

18. 18 How publishers can help. Make XML available for all Open Access articles rather than just the final PDF for text mining. Enrich citation networks with additional content (e.g. abstract, highlights) in a machine-readable format. Make all cited sources more easily verifiable for authors and researchers. Converting articles & preprints into a universally structured format for more effective TDM. Allow authors to write articles natively in a machine-readable format. 1 2 3 4

19. 19 …equal rights for friendly bots! And finally…

Text Data Mining: Unlocking the hidden potential from scholarly content.

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Text Data Mining: Unlocking the hidden potential from scholarly content.

Similar to Text Data Mining: Unlocking the hidden potential from scholarly content. (20)

Recently uploaded

Recently uploaded (20)

Text Data Mining: Unlocking the hidden potential from scholarly content.