SlideShare a Scribd company logo
1 of 28
Download to read offline
Mendeley
22 September 2015
An Introduction to Text Mining
Research Papers
Petr Knoth
Phil Gooch
What is text mining?
‘the process of deriving high-quality information from text’ (Wikipedia)
‘the discovery by computer of new, previously unknown information, by
automatically extracting information from different written resources. A key
element is the linking ... of the extracted information ... to form new facts or
new hypotheses to be explored further’ (Hearst, 2003)
‘a burgeoning new field that attempts to glean meaningful information from
natural language text ... the process of analyzing text to extract information
that is useful for particular purposes’
(Witten, 2004)
Why text mine research papers?
“Research papers are the most complete representation of human knowledge.“
Why is it so much talked about now?
The idea is not new, but up until recently access to large amounts of research
papers was controlled by a handful of companies having bespoke
arrangements with publishers.
The Open Access movement has recently largely contributed to decreasing
the barriers to text-mining of research papers.
The availability of tools, developments in machine learning, and reduction in
the costs of computing power and storage, has removed some of the technical
and financial barriers.
What are the opportunities of text
mining research papers?
• Literature Based Discovery (LBD)
• Undiscovered Public Knowledge (Swanson, 1986).
• Mining relationships for which there is “hidden” evidence
in the research literature, yet they are not explicitly
stated, such as magnesium deficiency and migraine, fish
oil and Raynaud's disease.
• Swanson's discoveries simulated by automated
techniques (Weeber et al., 2001).
• Supporting exploratory access to research literature
• staying up-to-date with research
• analysing, comparing and contrasting research findings
• Summarisation of research findings
• Systematic literature review automation
• ‘snowballing’
• Question answering and semantic search from papers
• Understanding the research impact of articles, individuals,
institutions, countries, …
• Monitoring research trends
• Understanding how to direct research funding …
• Evidence of reuse and plagiarism detection …
Other use cases
Generic text mining workflow
Papers
Extraction and
text cleaning
Tokenisation
Sentence
splitting
Stemming,
lemmatisation
Knowledge
base
Parsing
Part of Speech
Tagging (POS)
Named Entity
Recognition (NER)
Relation extraction
Analysis
and
visualisation
Semantic
search,
reasoning, ...
Example 1: Literature Based Discovery
● A range of techniques (Smalheiser, 2012)
● A typical approach: ABC method (e.g. Hristovski et. al. 2008):
○ A affects/binds/regulates/interacts with B
○ B affects/binds/regulates/interacts with C
○ A and C are not explicitly linked in any article
=> There might be an undiscovered relationship between A and C
Various reasons why A and C might not be connected, e.g. B is a rare
term.
Example 1: Literature Based Discovery
Papers on
small-cell lung
carcinoma
Term and relation
extraction
Knowledge
base
Recognise entities,
such as symptoms,
conditions, therapeutic
agents (drugs)
A B
search engine
Papers
resulting from
term queriesList of terms
Term and
relation
extraction
List of terms
C
Lists B and C
with semantic
annotations
Ranker
Ranked list of
hypotheses
connecting A
and C
Medical
researcher
Example 1: Visualising the A-C
hypotheses
Reproduced with permission from FastVac and Public Health England.
Example 2: modeling research arguments
using citation contexts
Example 2: modeling research arguments
using citation contexts
● Label the citation network according to the context of how
they are cited
● Pioneering work by Simone Teufel in this area
Green: Contrastive or
comparative statements
Pink: Statements about
the aim of the paper
Source: Simone Teufel
(1999) Argumentative
Zoning: Information
Extraction from Scientific
Text, Phd Thesis,
University of Edinburgh.
Example 3: exploratory search over
research papers
Source: Herrmannova & Knoth (2012) Visual Search for Supporting Content Exploration in Large Document
Collections, DLib 18(8).
How can I get started?
• Get the data
• Design/build your workflow
• Select a framework, tools, services to be used in implementing the
workflow
• Understand how to evaluate the performance of each component
Full-text Open Access article sources
• Subject repositories:
• arXiv bulk data
• PubMed OA Subset
• Institutional repositories:
• ~3k across the world, see Directory of Open Access Repositories
• Open Access journals:
• >10k OA journals, see Directory of Open Access Journals (DOAJ)
• Open Access subsets from publishers:
• Elsevier OA STM Corpus
• SpringerOpen
• Aggregators:
• CORE (API, Data dumps)
Other full-text sources
Typically for non-commercial research only.
• Publisher full-text APIs:
• Elsevier’s Text and Data Mining API:
• (subscribed content on ScienceDirect)
• Domain specific corpuses:
• JSTOR Data For Research
• Full-text content from multiple publishers:
• CrossRef Text and Data Mining API (full-texts)
• Mendeley API (abstracts)
• Other useful scholarly resources:
• Microsoft Academic Graph (MAG)
• Clinical records
• i2b2 NLP Challenge Data Sets
Building a text mining application
Papers
Extraction and
text cleaning
Tokenisation
Sentence
splitting
Stemming,
lemmatisation
Knowledge
base
Parsing
Part of Speech
Tagging (POS)
Named Entity
Recognition (NER)
Relation extraction
Analysis
and
visualisation
Semantic
search,
reasoning, ...
Example: Part of speech tagging
[Acute]JJ
[lymphoblastic]JJ
[leukemia]NN
([ALL]NN
) [leads]VB
[to]IN
[an]DT
[accumulation]NN
[of]IN
[immature]JJ
[lymphoid]JJ
[cells]NN
[into]IN
[the]DT
[bone]NN
[marrow]NN
, [blood]NN
[and]CC
[other]JJ
[organs]NN
.
[A]DT
[young]JJ
[patient]NN
[was]VB
[treated]VB
[unconventionally]RB
[for]IN
[Philadelphia]NN
[positive]NN
[ALL]NN
[and]CC
[Mucormycosis]NN
[with]IN
[Amphotericin]NN
[B]NN
.
[The]DT
[patient]NN
[achieved]VB
[a]DT
[disease-free]JJ
[survival]NN
[of]IN
[12]CD
[months]NN
[with]IN
[good]JJ
[quality]NN
[of]IN
[life]NN
.
Example: Parsing
[Acute lymphoblastic leukemia]NP
([ALL]NP
) [leads to]VP
[an]DT
[accumulation]NN
[of]IN
[immature lymphoid cells]NP
[into]IN
[the]DT
[bone marrow]NP
, [blood]NP
[and]CC
[other organs]NP
.
[A]DT
[young patient]NP
[was treated unconventionally for]VP
[Philadelphia
positive ALL]NP
[and]CC
[Mucormycosis]NP
[with]IN
[Amphotericin B]NP
.
[The]DT
[patient]NP
[achieved]VP
[a]DT
[disease-free survival]NP
[of]IN
[12 months]
NP
[with]IN
[good]JJ
[quality of life]PP
.
Example: Named entity recognition
[Acute lymphoblastic leukemia]Disease
([ALL]Disease
) [RESULT]VP
[accumulation]
NN
[of]IN
[immature lymphoid cells]AnatomicalSite
[into]IN
[the]DT
[bone marrow]
AnatomicalSite
, [blood]AnatomicalSite
[and]CC
[other organs]AnatomicalSite
.
[A]DT
[young patient]Person
[TREAT]VP
[Philadelphia positive ALL]Disease
[and]CC
[Mucormycosis]Disease
[with]IN
[Amphotericin B]Treatment
.
[The]DT
[patient]Person
[RESULT]VP
[disease-free survival]Outcome
[of]IN
[12 months]
Duration
[with]IN
[good quality of life]Outcome
.
Example: Named entity recognition
● Here, ALL is automatically expanded to its full form: acute lymphoblastic leukemia
● ALL is automatically labelled as a DiseaseOrSyndrome. as the initial mention was also labelled
as DiseaseOrSyndrome
Example: Named entity recognition
● Philadelphia on its own can be labelled as a Location
● In this context it is part of a longer noun phrase ‘Philadelphia positive ALL’, of which ALL is the
Disease acute lymphoblastic leukemia
● ‘Philadelphia positive’ is also a term in a domain-specific dictionary
● Which label is chosen can be determined by a rule, either handwritten or learnt by a machine
learning algorithm
Frameworks
● General Architecture for Text Engineering (GATE)
● Apache UIMA
● Natural Language Toolkit (NLTK)
● OpenNLP
Tools and Services
● Open Calais: general purpose tagging of people, places, companies, facts, events and
relationships
● AlchemyAPI: similar to OpenCalais
● National Centre for Text Mining: terms, entities, semantic search
○ Cafetiere: text mine your own documents online
○ EventMine: relations between terms: protein binding, synthesis inhibition
● ContentMine: Fact extraction
● ReVerb: open-domain relation extraction
● XIP Parser: linguistic analysis
● Linguamatics
● Metadata and citation extraction:
○ ParsCit
○ Grobid
○ Cermine
Open source demos
● Mimir
○ analysing, comparing and contrasting research findings
● EEXCESS
○ Recommend related resources
○ Narrative paths between resources
Existing challenges
● Harmonising metadata formats across publishers, repositories, etc.
● Agreeing on standards/ontologies/formats used to share the outputs of
research publications text-mining tools and all their components.
● Integration of text-mining tools with content providers’ systems
● Building and maintaining text-mining web-services for research (building
blocks)
● Promoting and adopting end-user tools utilising text-mining in
researchers’ daily workflows
● Building gold standards for various text-mining tasks and sharing them
across researchers (issue of credit)
Events and initiatives
● Events
○ International Workshop on Mining Scientific Publications (WOSP)
○ Joint Conference on Digital Libraries (JCDL)
○ ACL, COLING, IJC-NLP, CoNLL, LREC, etc.
● Projects
○ OpenMinTeD (EC funded)
○ EEXCESS
○ OpenAIRE
Thanks for listening ...
datascience@mendeley.com

More Related Content

What's hot

What I wish I’d known at the start!
What I wish I’d known at the start!What I wish I’d known at the start!
What I wish I’d known at the start!Jisc RDM
 
RDAP 16 Lightning: An Open Science Framework for Solving Institutional Challe...
RDAP 16 Lightning: An Open Science Framework for Solving Institutional Challe...RDAP 16 Lightning: An Open Science Framework for Solving Institutional Challe...
RDAP 16 Lightning: An Open Science Framework for Solving Institutional Challe...ASIS&T
 
Rubrics for DMPs
Rubrics for DMPsRubrics for DMPs
Rubrics for DMPsJisc RDM
 
The Kaleidoscope of Impact: same data, different perspectives, constantly cha...
The Kaleidoscope of Impact: same data, different perspectives, constantly cha...The Kaleidoscope of Impact: same data, different perspectives, constantly cha...
The Kaleidoscope of Impact: same data, different perspectives, constantly cha...Kudos
 
Research data support: a growth area for academic libraries?
Research data support: a growth area for academic libraries?Research data support: a growth area for academic libraries?
Research data support: a growth area for academic libraries? Robin Rice
 
Capturing and Analyzing Publication, Citation and Usage Data for Contextual C...
Capturing and Analyzing Publication, Citation and Usage Data for Contextual C...Capturing and Analyzing Publication, Citation and Usage Data for Contextual C...
Capturing and Analyzing Publication, Citation and Usage Data for Contextual C...NASIG
 
Has anyone seen my data? Incentivising #opendata sharing with altmetrics
Has anyone seen my data? Incentivising #opendata sharing with altmetricsHas anyone seen my data? Incentivising #opendata sharing with altmetrics
Has anyone seen my data? Incentivising #opendata sharing with altmetricsNick Sheppard
 
From Open Access to Open Data: Collaborative Work in the University Libraries...
From Open Access to Open Data: Collaborative Work in the University Libraries...From Open Access to Open Data: Collaborative Work in the University Libraries...
From Open Access to Open Data: Collaborative Work in the University Libraries...LIBER Europe
 
Researcher engagement
Researcher engagementResearcher engagement
Researcher engagementJisc RDM
 
Recognising data sharing
Recognising data sharingRecognising data sharing
Recognising data sharingJisc RDM
 
Rachel Proudfoot JIBS-RLUK event July 2012
Rachel Proudfoot JIBS-RLUK event July 2012Rachel Proudfoot JIBS-RLUK event July 2012
Rachel Proudfoot JIBS-RLUK event July 2012sherif user group
 

What's hot (20)

Lafferty "Supporting Research Data Management: Perceptions from a Library Pra...
Lafferty "Supporting Research Data Management: Perceptions from a Library Pra...Lafferty "Supporting Research Data Management: Perceptions from a Library Pra...
Lafferty "Supporting Research Data Management: Perceptions from a Library Pra...
 
What I wish I’d known at the start!
What I wish I’d known at the start!What I wish I’d known at the start!
What I wish I’d known at the start!
 
RDAP 16 Lightning: An Open Science Framework for Solving Institutional Challe...
RDAP 16 Lightning: An Open Science Framework for Solving Institutional Challe...RDAP 16 Lightning: An Open Science Framework for Solving Institutional Challe...
RDAP 16 Lightning: An Open Science Framework for Solving Institutional Challe...
 
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
 
Maxwell "Lessons Learned from Developing a Predictive Analytics Data Model"
Maxwell "Lessons Learned from Developing a Predictive Analytics Data Model"Maxwell "Lessons Learned from Developing a Predictive Analytics Data Model"
Maxwell "Lessons Learned from Developing a Predictive Analytics Data Model"
 
Butler "Building Data Science Skills: Enhancing Core Capabilities and Underst...
Butler "Building Data Science Skills: Enhancing Core Capabilities and Underst...Butler "Building Data Science Skills: Enhancing Core Capabilities and Underst...
Butler "Building Data Science Skills: Enhancing Core Capabilities and Underst...
 
Rubrics for DMPs
Rubrics for DMPsRubrics for DMPs
Rubrics for DMPs
 
The Kaleidoscope of Impact: same data, different perspectives, constantly cha...
The Kaleidoscope of Impact: same data, different perspectives, constantly cha...The Kaleidoscope of Impact: same data, different perspectives, constantly cha...
The Kaleidoscope of Impact: same data, different perspectives, constantly cha...
 
Goldman "Collaboratively Build Data Science Services and Skills"
Goldman "Collaboratively Build Data Science Services and Skills"Goldman "Collaboratively Build Data Science Services and Skills"
Goldman "Collaboratively Build Data Science Services and Skills"
 
Research data support: a growth area for academic libraries?
Research data support: a growth area for academic libraries?Research data support: a growth area for academic libraries?
Research data support: a growth area for academic libraries?
 
Attribution From Res Lib Perspective - Micah Altman, MIT
Attribution From Res Lib Perspective - Micah Altman, MITAttribution From Res Lib Perspective - Micah Altman, MIT
Attribution From Res Lib Perspective - Micah Altman, MIT
 
Lee "Supporting Research Data is a Group Effort"
Lee "Supporting Research Data is a Group Effort"Lee "Supporting Research Data is a Group Effort"
Lee "Supporting Research Data is a Group Effort"
 
Capturing and Analyzing Publication, Citation and Usage Data for Contextual C...
Capturing and Analyzing Publication, Citation and Usage Data for Contextual C...Capturing and Analyzing Publication, Citation and Usage Data for Contextual C...
Capturing and Analyzing Publication, Citation and Usage Data for Contextual C...
 
Putnam Data Quality and the IR
Putnam Data Quality and the IRPutnam Data Quality and the IR
Putnam Data Quality and the IR
 
Has anyone seen my data? Incentivising #opendata sharing with altmetrics
Has anyone seen my data? Incentivising #opendata sharing with altmetricsHas anyone seen my data? Incentivising #opendata sharing with altmetrics
Has anyone seen my data? Incentivising #opendata sharing with altmetrics
 
From Open Access to Open Data: Collaborative Work in the University Libraries...
From Open Access to Open Data: Collaborative Work in the University Libraries...From Open Access to Open Data: Collaborative Work in the University Libraries...
From Open Access to Open Data: Collaborative Work in the University Libraries...
 
Researcher engagement
Researcher engagementResearcher engagement
Researcher engagement
 
Recognising data sharing
Recognising data sharingRecognising data sharing
Recognising data sharing
 
NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Wor...
NISO/NFAIS Joint Virtual Conference:  Connecting the Library to the Wider Wor...NISO/NFAIS Joint Virtual Conference:  Connecting the Library to the Wider Wor...
NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Wor...
 
Rachel Proudfoot JIBS-RLUK event July 2012
Rachel Proudfoot JIBS-RLUK event July 2012Rachel Proudfoot JIBS-RLUK event July 2012
Rachel Proudfoot JIBS-RLUK event July 2012
 

Similar to UKSG webinar - Introduction to Text-Mining Research Papers with Petr Knoth and Phil Gooch, both Mendeley Ltd.

OpenMinTeD: Making Sense of Large Volumes of Data
OpenMinTeD: Making Sense of Large Volumes of DataOpenMinTeD: Making Sense of Large Volumes of Data
OpenMinTeD: Making Sense of Large Volumes of Dataopenminted_eu
 
Elsevier - Smart Data and Algorithms for the Publishing Industry
Elsevier - Smart Data and Algorithms for the Publishing IndustryElsevier - Smart Data and Algorithms for the Publishing Industry
Elsevier - Smart Data and Algorithms for the Publishing IndustryAntonio Gulli
 
How to write an academic paper.pptx
How to write an academic paper.pptxHow to write an academic paper.pptx
How to write an academic paper.pptxTitiA3
 
The Rhetoric of Research Objects
The Rhetoric of Research ObjectsThe Rhetoric of Research Objects
The Rhetoric of Research ObjectsCarole Goble
 
A Pragmatic Approach to Facilitating Text and Data Mining
A Pragmatic Approach to Facilitating Text and Data Mining A Pragmatic Approach to Facilitating Text and Data Mining
A Pragmatic Approach to Facilitating Text and Data Mining Chris Shillum
 
Auto Mapping Texts for Human-Machine Analysis and Sensemaking
Auto Mapping Texts for Human-Machine Analysis and SensemakingAuto Mapping Texts for Human-Machine Analysis and Sensemaking
Auto Mapping Texts for Human-Machine Analysis and SensemakingShalin Hai-Jew
 
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...William Gunn
 
information-skills-for-researchers-v3
information-skills-for-researchers-v3information-skills-for-researchers-v3
information-skills-for-researchers-v3Jacqueline Thomas
 
OpenMinTeD - Repositories in the centre of new scientific knowledge
OpenMinTeD - Repositories in the centre of new scientific knowledgeOpenMinTeD - Repositories in the centre of new scientific knowledge
OpenMinTeD - Repositories in the centre of new scientific knowledgeopenminted_eu
 
Annotation examples--Fribourg--2019-09-03
Annotation examples--Fribourg--2019-09-03Annotation examples--Fribourg--2019-09-03
Annotation examples--Fribourg--2019-09-03jodischneider
 
Institutional Repositories
Institutional RepositoriesInstitutional Repositories
Institutional RepositoriesSridhar Gutam
 
11 - qualitative research data analysis ( Dr. Abdullah Al-Beraidi - Dr. Ibrah...
11 - qualitative research data analysis ( Dr. Abdullah Al-Beraidi - Dr. Ibrah...11 - qualitative research data analysis ( Dr. Abdullah Al-Beraidi - Dr. Ibrah...
11 - qualitative research data analysis ( Dr. Abdullah Al-Beraidi - Dr. Ibrah...Rasha
 
Bythebook 2016 basaglia
Bythebook 2016 basagliaBythebook 2016 basaglia
Bythebook 2016 basagliaCERN
 

Similar to UKSG webinar - Introduction to Text-Mining Research Papers with Petr Knoth and Phil Gooch, both Mendeley Ltd. (20)

OpenMinTeD: Making Sense of Large Volumes of Data
OpenMinTeD: Making Sense of Large Volumes of DataOpenMinTeD: Making Sense of Large Volumes of Data
OpenMinTeD: Making Sense of Large Volumes of Data
 
Elsevier - Smart Data and Algorithms for the Publishing Industry
Elsevier - Smart Data and Algorithms for the Publishing IndustryElsevier - Smart Data and Algorithms for the Publishing Industry
Elsevier - Smart Data and Algorithms for the Publishing Industry
 
How to write an academic paper.pptx
How to write an academic paper.pptxHow to write an academic paper.pptx
How to write an academic paper.pptx
 
Swimming upstream: libraries and open scholarship
Swimming upstream: libraries and open scholarshipSwimming upstream: libraries and open scholarship
Swimming upstream: libraries and open scholarship
 
The Rhetoric of Research Objects
The Rhetoric of Research ObjectsThe Rhetoric of Research Objects
The Rhetoric of Research Objects
 
A Pragmatic Approach to Facilitating Text and Data Mining
A Pragmatic Approach to Facilitating Text and Data Mining A Pragmatic Approach to Facilitating Text and Data Mining
A Pragmatic Approach to Facilitating Text and Data Mining
 
Ontology
OntologyOntology
Ontology
 
Auto Mapping Texts for Human-Machine Analysis and Sensemaking
Auto Mapping Texts for Human-Machine Analysis and SensemakingAuto Mapping Texts for Human-Machine Analysis and Sensemaking
Auto Mapping Texts for Human-Machine Analysis and Sensemaking
 
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
 
Text Mining
Text MiningText Mining
Text Mining
 
information-skills-for-researchers-v3
information-skills-for-researchers-v3information-skills-for-researchers-v3
information-skills-for-researchers-v3
 
OpenMinTeD - Repositories in the centre of new scientific knowledge
OpenMinTeD - Repositories in the centre of new scientific knowledgeOpenMinTeD - Repositories in the centre of new scientific knowledge
OpenMinTeD - Repositories in the centre of new scientific knowledge
 
Ontology at Manchester
Ontology at ManchesterOntology at Manchester
Ontology at Manchester
 
EDS for IFLA
EDS for IFLAEDS for IFLA
EDS for IFLA
 
Annotation examples--Fribourg--2019-09-03
Annotation examples--Fribourg--2019-09-03Annotation examples--Fribourg--2019-09-03
Annotation examples--Fribourg--2019-09-03
 
Paul Groth
Paul GrothPaul Groth
Paul Groth
 
Institutional Repositories
Institutional RepositoriesInstitutional Repositories
Institutional Repositories
 
11 - qualitative research data analysis ( Dr. Abdullah Al-Beraidi - Dr. Ibrah...
11 - qualitative research data analysis ( Dr. Abdullah Al-Beraidi - Dr. Ibrah...11 - qualitative research data analysis ( Dr. Abdullah Al-Beraidi - Dr. Ibrah...
11 - qualitative research data analysis ( Dr. Abdullah Al-Beraidi - Dr. Ibrah...
 
Bythebook 2016 basaglia
Bythebook 2016 basagliaBythebook 2016 basaglia
Bythebook 2016 basaglia
 
Semantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including AstrophysicsSemantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including Astrophysics
 

More from UKSG: connecting the knowledge community

UKSG 2024 Plenary 4 - Combining Open Access research and large language model...
UKSG 2024 Plenary 4 - Combining Open Access research and large language model...UKSG 2024 Plenary 4 - Combining Open Access research and large language model...
UKSG 2024 Plenary 4 - Combining Open Access research and large language model...UKSG: connecting the knowledge community
 
UKSG 2024 Plenary 3 - There is No List: (How) Can We Combat “Predatory” Publi...
UKSG 2024 Plenary 3 - There is No List: (How) Can We Combat “Predatory” Publi...UKSG 2024 Plenary 3 - There is No List: (How) Can We Combat “Predatory” Publi...
UKSG 2024 Plenary 3 - There is No List: (How) Can We Combat “Predatory” Publi...UKSG: connecting the knowledge community
 
UKSG 2024 Plenary 2 - Are we there yet? A review of transitional agreements i...
UKSG 2024 Plenary 2 - Are we there yet? A review of transitional agreements i...UKSG 2024 Plenary 2 - Are we there yet? A review of transitional agreements i...
UKSG 2024 Plenary 2 - Are we there yet? A review of transitional agreements i...UKSG: connecting the knowledge community
 
UKSG 2024 Plenary 2 - What did we Read, What did we Publish: Distilling the d...
UKSG 2024 Plenary 2 - What did we Read, What did we Publish: Distilling the d...UKSG 2024 Plenary 2 - What did we Read, What did we Publish: Distilling the d...
UKSG 2024 Plenary 2 - What did we Read, What did we Publish: Distilling the d...UKSG: connecting the knowledge community
 
UKSG 2024 Lightning 2 - How GetFTR Supports Discovery and Access of OA Content
UKSG 2024 Lightning 2 - How GetFTR Supports Discovery and Access of OA ContentUKSG 2024 Lightning 2 - How GetFTR Supports Discovery and Access of OA Content
UKSG 2024 Lightning 2 - How GetFTR Supports Discovery and Access of OA ContentUKSG: connecting the knowledge community
 
UKSG 2024 Lightning 2 - Advocating for data sharing: messaging frameworks for...
UKSG 2024 Lightning 2 - Advocating for data sharing: messaging frameworks for...UKSG 2024 Lightning 2 - Advocating for data sharing: messaging frameworks for...
UKSG 2024 Lightning 2 - Advocating for data sharing: messaging frameworks for...UKSG: connecting the knowledge community
 
UKSG 2024 Lightning 2 - All Watched Over By Machines That Love Open Research
UKSG 2024 Lightning 2 - All Watched Over By Machines That Love Open ResearchUKSG 2024 Lightning 2 - All Watched Over By Machines That Love Open Research
UKSG 2024 Lightning 2 - All Watched Over By Machines That Love Open ResearchUKSG: connecting the knowledge community
 
UKSG 2024 Lightning 1 - Responding to the UN SDG Publishers Compact – Bristol...
UKSG 2024 Lightning 1 - Responding to the UN SDG Publishers Compact – Bristol...UKSG 2024 Lightning 1 - Responding to the UN SDG Publishers Compact – Bristol...
UKSG 2024 Lightning 1 - Responding to the UN SDG Publishers Compact – Bristol...UKSG: connecting the knowledge community
 
UKSG 2024 Lightning 1 - Practical steps towards an open research culture: Bui...
UKSG 2024 Lightning 1 - Practical steps towards an open research culture: Bui...UKSG 2024 Lightning 1 - Practical steps towards an open research culture: Bui...
UKSG 2024 Lightning 1 - Practical steps towards an open research culture: Bui...UKSG: connecting the knowledge community
 
UKSG 2024 - Reckoning or Retreat? A Longitudinal Look at DEIA in Scholarly Co...
UKSG 2024 - Reckoning or Retreat? A Longitudinal Look at DEIA in Scholarly Co...UKSG 2024 - Reckoning or Retreat? A Longitudinal Look at DEIA in Scholarly Co...
UKSG 2024 - Reckoning or Retreat? A Longitudinal Look at DEIA in Scholarly Co...UKSG: connecting the knowledge community
 
UKSG 2024 - You don't know what you've got till it's gone: Future directions ...
UKSG 2024 - You don't know what you've got till it's gone: Future directions ...UKSG 2024 - You don't know what you've got till it's gone: Future directions ...
UKSG 2024 - You don't know what you've got till it's gone: Future directions ...UKSG: connecting the knowledge community
 
UKSG 2024 - Vision, mission, passion: how UK University Presses collaborate t...
UKSG 2024 - Vision, mission, passion: how UK University Presses collaborate t...UKSG 2024 - Vision, mission, passion: how UK University Presses collaborate t...
UKSG 2024 - Vision, mission, passion: how UK University Presses collaborate t...UKSG: connecting the knowledge community
 
UKSG - 2024 - Fostering an Open Research culture: ARU's Graduate Trainee Seco...
UKSG - 2024 - Fostering an Open Research culture: ARU's Graduate Trainee Seco...UKSG - 2024 - Fostering an Open Research culture: ARU's Graduate Trainee Seco...
UKSG - 2024 - Fostering an Open Research culture: ARU's Graduate Trainee Seco...UKSG: connecting the knowledge community
 
UKSG 2024 - Creating credibility through community: Encouraging high quality ...
UKSG 2024 - Creating credibility through community: Encouraging high quality ...UKSG 2024 - Creating credibility through community: Encouraging high quality ...
UKSG 2024 - Creating credibility through community: Encouraging high quality ...UKSG: connecting the knowledge community
 
UKSG 2024 - Author Identity Metadata: Why a Small Publisher Can Address a Maj...
UKSG 2024 - Author Identity Metadata: Why a Small Publisher Can Address a Maj...UKSG 2024 - Author Identity Metadata: Why a Small Publisher Can Address a Maj...
UKSG 2024 - Author Identity Metadata: Why a Small Publisher Can Address a Maj...UKSG: connecting the knowledge community
 
UKSG 2024 - Captivate, Connect, and Convert: Unlocking the art of Collections...
UKSG 2024 - Captivate, Connect, and Convert: Unlocking the art of Collections...UKSG 2024 - Captivate, Connect, and Convert: Unlocking the art of Collections...
UKSG 2024 - Captivate, Connect, and Convert: Unlocking the art of Collections...UKSG: connecting the knowledge community
 
UKSG 2024 - A critical review of transitional agreements in the UK: why, how,...
UKSG 2024 - A critical review of transitional agreements in the UK: why, how,...UKSG 2024 - A critical review of transitional agreements in the UK: why, how,...
UKSG 2024 - A critical review of transitional agreements in the UK: why, how,...UKSG: connecting the knowledge community
 
UKSG 2024 - What next for sustainable open scholarship? The Cambridge Univers...
UKSG 2024 - What next for sustainable open scholarship? The Cambridge Univers...UKSG 2024 - What next for sustainable open scholarship? The Cambridge Univers...
UKSG 2024 - What next for sustainable open scholarship? The Cambridge Univers...UKSG: connecting the knowledge community
 

More from UKSG: connecting the knowledge community (20)

UKSG 2024 Plenary 4 - Combining Open Access research and large language model...
UKSG 2024 Plenary 4 - Combining Open Access research and large language model...UKSG 2024 Plenary 4 - Combining Open Access research and large language model...
UKSG 2024 Plenary 4 - Combining Open Access research and large language model...
 
UKSG 2024 Plenary 3 - There is No List: (How) Can We Combat “Predatory” Publi...
UKSG 2024 Plenary 3 - There is No List: (How) Can We Combat “Predatory” Publi...UKSG 2024 Plenary 3 - There is No List: (How) Can We Combat “Predatory” Publi...
UKSG 2024 Plenary 3 - There is No List: (How) Can We Combat “Predatory” Publi...
 
UKSG 2024 Plenary 2 - Let's Talk About Green
UKSG 2024 Plenary 2 - Let's Talk About GreenUKSG 2024 Plenary 2 - Let's Talk About Green
UKSG 2024 Plenary 2 - Let's Talk About Green
 
UKSG 2024 Plenary 2 - Are we there yet? A review of transitional agreements i...
UKSG 2024 Plenary 2 - Are we there yet? A review of transitional agreements i...UKSG 2024 Plenary 2 - Are we there yet? A review of transitional agreements i...
UKSG 2024 Plenary 2 - Are we there yet? A review of transitional agreements i...
 
UKSG 2024 Plenary 2 - What did we Read, What did we Publish: Distilling the d...
UKSG 2024 Plenary 2 - What did we Read, What did we Publish: Distilling the d...UKSG 2024 Plenary 2 - What did we Read, What did we Publish: Distilling the d...
UKSG 2024 Plenary 2 - What did we Read, What did we Publish: Distilling the d...
 
UKSG 2024 Lightning 2 - How GetFTR Supports Discovery and Access of OA Content
UKSG 2024 Lightning 2 - How GetFTR Supports Discovery and Access of OA ContentUKSG 2024 Lightning 2 - How GetFTR Supports Discovery and Access of OA Content
UKSG 2024 Lightning 2 - How GetFTR Supports Discovery and Access of OA Content
 
UKSG 2024 Lightning 2 - Advocating for data sharing: messaging frameworks for...
UKSG 2024 Lightning 2 - Advocating for data sharing: messaging frameworks for...UKSG 2024 Lightning 2 - Advocating for data sharing: messaging frameworks for...
UKSG 2024 Lightning 2 - Advocating for data sharing: messaging frameworks for...
 
UKSG 2024 Lightning 2 - All Watched Over By Machines That Love Open Research
UKSG 2024 Lightning 2 - All Watched Over By Machines That Love Open ResearchUKSG 2024 Lightning 2 - All Watched Over By Machines That Love Open Research
UKSG 2024 Lightning 2 - All Watched Over By Machines That Love Open Research
 
UKSG 2024 Lightning 1 - Responding to the UN SDG Publishers Compact – Bristol...
UKSG 2024 Lightning 1 - Responding to the UN SDG Publishers Compact – Bristol...UKSG 2024 Lightning 1 - Responding to the UN SDG Publishers Compact – Bristol...
UKSG 2024 Lightning 1 - Responding to the UN SDG Publishers Compact – Bristol...
 
UKSG 2024 Lightning 1 - Practical steps towards an open research culture: Bui...
UKSG 2024 Lightning 1 - Practical steps towards an open research culture: Bui...UKSG 2024 Lightning 1 - Practical steps towards an open research culture: Bui...
UKSG 2024 Lightning 1 - Practical steps towards an open research culture: Bui...
 
UKSG 2024 - Open infrastructure and standards: small bodies, big impact
UKSG 2024 - Open infrastructure and standards: small bodies, big impactUKSG 2024 - Open infrastructure and standards: small bodies, big impact
UKSG 2024 - Open infrastructure and standards: small bodies, big impact
 
UKSG 2024 - Reckoning or Retreat? A Longitudinal Look at DEIA in Scholarly Co...
UKSG 2024 - Reckoning or Retreat? A Longitudinal Look at DEIA in Scholarly Co...UKSG 2024 - Reckoning or Retreat? A Longitudinal Look at DEIA in Scholarly Co...
UKSG 2024 - Reckoning or Retreat? A Longitudinal Look at DEIA in Scholarly Co...
 
UKSG 2024 - You don't know what you've got till it's gone: Future directions ...
UKSG 2024 - You don't know what you've got till it's gone: Future directions ...UKSG 2024 - You don't know what you've got till it's gone: Future directions ...
UKSG 2024 - You don't know what you've got till it's gone: Future directions ...
 
UKSG 2024 - Vision, mission, passion: how UK University Presses collaborate t...
UKSG 2024 - Vision, mission, passion: how UK University Presses collaborate t...UKSG 2024 - Vision, mission, passion: how UK University Presses collaborate t...
UKSG 2024 - Vision, mission, passion: how UK University Presses collaborate t...
 
UKSG - 2024 - Fostering an Open Research culture: ARU's Graduate Trainee Seco...
UKSG - 2024 - Fostering an Open Research culture: ARU's Graduate Trainee Seco...UKSG - 2024 - Fostering an Open Research culture: ARU's Graduate Trainee Seco...
UKSG - 2024 - Fostering an Open Research culture: ARU's Graduate Trainee Seco...
 
UKSG 2024 - Creating credibility through community: Encouraging high quality ...
UKSG 2024 - Creating credibility through community: Encouraging high quality ...UKSG 2024 - Creating credibility through community: Encouraging high quality ...
UKSG 2024 - Creating credibility through community: Encouraging high quality ...
 
UKSG 2024 - Author Identity Metadata: Why a Small Publisher Can Address a Maj...
UKSG 2024 - Author Identity Metadata: Why a Small Publisher Can Address a Maj...UKSG 2024 - Author Identity Metadata: Why a Small Publisher Can Address a Maj...
UKSG 2024 - Author Identity Metadata: Why a Small Publisher Can Address a Maj...
 
UKSG 2024 - Captivate, Connect, and Convert: Unlocking the art of Collections...
UKSG 2024 - Captivate, Connect, and Convert: Unlocking the art of Collections...UKSG 2024 - Captivate, Connect, and Convert: Unlocking the art of Collections...
UKSG 2024 - Captivate, Connect, and Convert: Unlocking the art of Collections...
 
UKSG 2024 - A critical review of transitional agreements in the UK: why, how,...
UKSG 2024 - A critical review of transitional agreements in the UK: why, how,...UKSG 2024 - A critical review of transitional agreements in the UK: why, how,...
UKSG 2024 - A critical review of transitional agreements in the UK: why, how,...
 
UKSG 2024 - What next for sustainable open scholarship? The Cambridge Univers...
UKSG 2024 - What next for sustainable open scholarship? The Cambridge Univers...UKSG 2024 - What next for sustainable open scholarship? The Cambridge Univers...
UKSG 2024 - What next for sustainable open scholarship? The Cambridge Univers...
 

Recently uploaded

What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 

Recently uploaded (20)

What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 

UKSG webinar - Introduction to Text-Mining Research Papers with Petr Knoth and Phil Gooch, both Mendeley Ltd.

  • 1. Mendeley 22 September 2015 An Introduction to Text Mining Research Papers Petr Knoth Phil Gooch
  • 2. What is text mining? ‘the process of deriving high-quality information from text’ (Wikipedia) ‘the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources. A key element is the linking ... of the extracted information ... to form new facts or new hypotheses to be explored further’ (Hearst, 2003) ‘a burgeoning new field that attempts to glean meaningful information from natural language text ... the process of analyzing text to extract information that is useful for particular purposes’ (Witten, 2004)
  • 3. Why text mine research papers? “Research papers are the most complete representation of human knowledge.“
  • 4. Why is it so much talked about now? The idea is not new, but up until recently access to large amounts of research papers was controlled by a handful of companies having bespoke arrangements with publishers. The Open Access movement has recently largely contributed to decreasing the barriers to text-mining of research papers. The availability of tools, developments in machine learning, and reduction in the costs of computing power and storage, has removed some of the technical and financial barriers.
  • 5. What are the opportunities of text mining research papers? • Literature Based Discovery (LBD) • Undiscovered Public Knowledge (Swanson, 1986). • Mining relationships for which there is “hidden” evidence in the research literature, yet they are not explicitly stated, such as magnesium deficiency and migraine, fish oil and Raynaud's disease. • Swanson's discoveries simulated by automated techniques (Weeber et al., 2001).
  • 6. • Supporting exploratory access to research literature • staying up-to-date with research • analysing, comparing and contrasting research findings • Summarisation of research findings • Systematic literature review automation • ‘snowballing’ • Question answering and semantic search from papers • Understanding the research impact of articles, individuals, institutions, countries, … • Monitoring research trends • Understanding how to direct research funding … • Evidence of reuse and plagiarism detection … Other use cases
  • 7. Generic text mining workflow Papers Extraction and text cleaning Tokenisation Sentence splitting Stemming, lemmatisation Knowledge base Parsing Part of Speech Tagging (POS) Named Entity Recognition (NER) Relation extraction Analysis and visualisation Semantic search, reasoning, ...
  • 8. Example 1: Literature Based Discovery ● A range of techniques (Smalheiser, 2012) ● A typical approach: ABC method (e.g. Hristovski et. al. 2008): ○ A affects/binds/regulates/interacts with B ○ B affects/binds/regulates/interacts with C ○ A and C are not explicitly linked in any article => There might be an undiscovered relationship between A and C Various reasons why A and C might not be connected, e.g. B is a rare term.
  • 9. Example 1: Literature Based Discovery Papers on small-cell lung carcinoma Term and relation extraction Knowledge base Recognise entities, such as symptoms, conditions, therapeutic agents (drugs) A B search engine Papers resulting from term queriesList of terms Term and relation extraction List of terms C Lists B and C with semantic annotations Ranker Ranked list of hypotheses connecting A and C Medical researcher
  • 10. Example 1: Visualising the A-C hypotheses Reproduced with permission from FastVac and Public Health England.
  • 11. Example 2: modeling research arguments using citation contexts
  • 12. Example 2: modeling research arguments using citation contexts ● Label the citation network according to the context of how they are cited ● Pioneering work by Simone Teufel in this area Green: Contrastive or comparative statements Pink: Statements about the aim of the paper Source: Simone Teufel (1999) Argumentative Zoning: Information Extraction from Scientific Text, Phd Thesis, University of Edinburgh.
  • 13. Example 3: exploratory search over research papers Source: Herrmannova & Knoth (2012) Visual Search for Supporting Content Exploration in Large Document Collections, DLib 18(8).
  • 14. How can I get started? • Get the data • Design/build your workflow • Select a framework, tools, services to be used in implementing the workflow • Understand how to evaluate the performance of each component
  • 15. Full-text Open Access article sources • Subject repositories: • arXiv bulk data • PubMed OA Subset • Institutional repositories: • ~3k across the world, see Directory of Open Access Repositories • Open Access journals: • >10k OA journals, see Directory of Open Access Journals (DOAJ) • Open Access subsets from publishers: • Elsevier OA STM Corpus • SpringerOpen • Aggregators: • CORE (API, Data dumps)
  • 16. Other full-text sources Typically for non-commercial research only. • Publisher full-text APIs: • Elsevier’s Text and Data Mining API: • (subscribed content on ScienceDirect) • Domain specific corpuses: • JSTOR Data For Research • Full-text content from multiple publishers: • CrossRef Text and Data Mining API (full-texts) • Mendeley API (abstracts) • Other useful scholarly resources: • Microsoft Academic Graph (MAG) • Clinical records • i2b2 NLP Challenge Data Sets
  • 17. Building a text mining application Papers Extraction and text cleaning Tokenisation Sentence splitting Stemming, lemmatisation Knowledge base Parsing Part of Speech Tagging (POS) Named Entity Recognition (NER) Relation extraction Analysis and visualisation Semantic search, reasoning, ...
  • 18. Example: Part of speech tagging [Acute]JJ [lymphoblastic]JJ [leukemia]NN ([ALL]NN ) [leads]VB [to]IN [an]DT [accumulation]NN [of]IN [immature]JJ [lymphoid]JJ [cells]NN [into]IN [the]DT [bone]NN [marrow]NN , [blood]NN [and]CC [other]JJ [organs]NN . [A]DT [young]JJ [patient]NN [was]VB [treated]VB [unconventionally]RB [for]IN [Philadelphia]NN [positive]NN [ALL]NN [and]CC [Mucormycosis]NN [with]IN [Amphotericin]NN [B]NN . [The]DT [patient]NN [achieved]VB [a]DT [disease-free]JJ [survival]NN [of]IN [12]CD [months]NN [with]IN [good]JJ [quality]NN [of]IN [life]NN .
  • 19. Example: Parsing [Acute lymphoblastic leukemia]NP ([ALL]NP ) [leads to]VP [an]DT [accumulation]NN [of]IN [immature lymphoid cells]NP [into]IN [the]DT [bone marrow]NP , [blood]NP [and]CC [other organs]NP . [A]DT [young patient]NP [was treated unconventionally for]VP [Philadelphia positive ALL]NP [and]CC [Mucormycosis]NP [with]IN [Amphotericin B]NP . [The]DT [patient]NP [achieved]VP [a]DT [disease-free survival]NP [of]IN [12 months] NP [with]IN [good]JJ [quality of life]PP .
  • 20. Example: Named entity recognition [Acute lymphoblastic leukemia]Disease ([ALL]Disease ) [RESULT]VP [accumulation] NN [of]IN [immature lymphoid cells]AnatomicalSite [into]IN [the]DT [bone marrow] AnatomicalSite , [blood]AnatomicalSite [and]CC [other organs]AnatomicalSite . [A]DT [young patient]Person [TREAT]VP [Philadelphia positive ALL]Disease [and]CC [Mucormycosis]Disease [with]IN [Amphotericin B]Treatment . [The]DT [patient]Person [RESULT]VP [disease-free survival]Outcome [of]IN [12 months] Duration [with]IN [good quality of life]Outcome .
  • 21. Example: Named entity recognition ● Here, ALL is automatically expanded to its full form: acute lymphoblastic leukemia ● ALL is automatically labelled as a DiseaseOrSyndrome. as the initial mention was also labelled as DiseaseOrSyndrome
  • 22. Example: Named entity recognition ● Philadelphia on its own can be labelled as a Location ● In this context it is part of a longer noun phrase ‘Philadelphia positive ALL’, of which ALL is the Disease acute lymphoblastic leukemia ● ‘Philadelphia positive’ is also a term in a domain-specific dictionary ● Which label is chosen can be determined by a rule, either handwritten or learnt by a machine learning algorithm
  • 23. Frameworks ● General Architecture for Text Engineering (GATE) ● Apache UIMA ● Natural Language Toolkit (NLTK) ● OpenNLP
  • 24. Tools and Services ● Open Calais: general purpose tagging of people, places, companies, facts, events and relationships ● AlchemyAPI: similar to OpenCalais ● National Centre for Text Mining: terms, entities, semantic search ○ Cafetiere: text mine your own documents online ○ EventMine: relations between terms: protein binding, synthesis inhibition ● ContentMine: Fact extraction ● ReVerb: open-domain relation extraction ● XIP Parser: linguistic analysis ● Linguamatics ● Metadata and citation extraction: ○ ParsCit ○ Grobid ○ Cermine
  • 25. Open source demos ● Mimir ○ analysing, comparing and contrasting research findings ● EEXCESS ○ Recommend related resources ○ Narrative paths between resources
  • 26. Existing challenges ● Harmonising metadata formats across publishers, repositories, etc. ● Agreeing on standards/ontologies/formats used to share the outputs of research publications text-mining tools and all their components. ● Integration of text-mining tools with content providers’ systems ● Building and maintaining text-mining web-services for research (building blocks) ● Promoting and adopting end-user tools utilising text-mining in researchers’ daily workflows ● Building gold standards for various text-mining tasks and sharing them across researchers (issue of credit)
  • 27. Events and initiatives ● Events ○ International Workshop on Mining Scientific Publications (WOSP) ○ Joint Conference on Digital Libraries (JCDL) ○ ACL, COLING, IJC-NLP, CoNLL, LREC, etc. ● Projects ○ OpenMinTeD (EC funded) ○ EEXCESS ○ OpenAIRE
  • 28. Thanks for listening ... datascience@mendeley.com