This document discusses how natural language processing (NLP) can be used to support drug development. NLP involves computational analysis and generation of natural language text and speech. It summarizes how NLP can be applied to extract and link information from the vast amount of biomedical literature and data. Examples of NLP applications include named entity recognition, relation extraction, question answering systems, and summarization. The document argues that NLP is now essential to process the huge amount of accumulated medical knowledge and can support many steps of drug development, from discovery to clinical trials to pharmacovigilance.
4. Natural Language Processing (NLP)
= Computational analysis and generation of natural language (text or
speech)
BioNLP
= NLP related to medicine and life sciences
5. Biomedical language data
• Scientific literature
• Reports
• Research notes
• Database content
• Electronic health records
• Patents
• Doctor-Patient conversations
• Social media
• Image/video captions
7. Humans can no
longer process
the accumulated
medical
knowledge
0
5000000
10000000
15000000
20000000
25000000
30000000
35000000
40000000
1880 1920 1960 2000
>34 million in total
Research articles in PubMed
8. Drug A Gene B
Article 1
Gene B Gene C
Article 2
SARS-CoV2
Gene C
Article 3
Pieces of information are scattered across many
research articles
9.
10. NLP can be used to find and combine
knowledge fragments
Drug A Gene B
Article 1
Gene B Gene C
Article 2
Gene C
Article 3
Text mining
with NLP
Drug A
Gene B
Gene C
www.aitslab.org
SARS-CoV-2
SARS-CoV-2
12. Example NLP workflow for information
extraction
TEXT
Named entity
recognition
Named entity
linking
Relation
extraction
Relationship
extraction
13. Example NLP workflow for information
extraction
TEXT
Named entity
recognition
Named entity
linking
Relation
extraction
Relationship
extraction
14. PLoS One. 2012;7(10):e45381. doi: 10.1371/journal.pone.0045381. Epub 2012 Oct 11.
Identification of cytoskeleton-associated proteins essential for lysosomal stability and survival of human cancer cells.
Groth-Pedersen L1, Aits S, Corcelle-Termeau E, Petersen NH, Nylandsted J, Jäättelä M.
Author information
Abstract
Microtubule-disturbing drugs inhibit lysosomal trafficking and induce lysosomal membrane permeabilization followed by
cathepsin-dependent cell death. To identify specific trafficking-related proteins that control cell survival and lysosomal
stability, we screened a molecular motor siRNA library in human MCF7 breast cancer cells. SiRNAs targeting four kinesins
(KIF11/Eg5, KIF20A, KIF21A, KIF25), myosin 1G (MYO1G), myosin heavy chain 1 (MYH1) and tropomyosin 2 (TPM2) were
identified as effective inducers of non-apoptotic cell death. The cell death induced by KIF11, KIF21A, KIF25, MYH1 or TPM2
siRNAs was preceded by lysosomal membrane permeabilization, and all identified siRNAs induced several changes in the
endo-lysosomal compartment, i.e. increased lysosomal volume (KIF11, KIF20A, KIF25, MYO1G, MYH1), increased cysteine
cathepsin activity (KIF20A, KIF25), altered lysosomal localization (KIF25, MYH1, TPM2), increased dextran accumulation
(KIF20A), or reduced autophagic flux (MYO1G, MYH1). Importantly, all seven siRNAs also killed human cervix cancer (HeLa)
and osteosarcoma (U-2-OS) cells and sensitized cancer cells to other lysosome-destabilizing treatments, i.e. photo-
oxidation, siramesine, etoposide or cisplatin.
Disease Drug/treatment Gene/protein Process/location Relation
www.aitslab.org
Step 1: Named entity recognition
“This is a protein”
15. Step 2: Named entity linking
“This protein matches UniProt ID P52732”
16. Step 3: Relationship extraction
“(SARS-CoV-2)-binds to-(ACE2)”
SARS-CoV2 binding to ACE2 can lead to
excessive angiotensin II signaling, which
activates the STING pathway in mice.
Relation: binds to
22. Biomedical NLP has many applications
• Information extraction
• Question answering systems
• Semantic search
• Topic modelling
• Dialogue systems (chat bots)
• Speech recognition
• Translation
• Text classification
• Sentiment analysis
• Text summarization
• Image/Video captioning
NLP can support all tasks that
involve analysis and generation
of text or speech!
23. NLP can support many steps of drug development
Drug discovery
Gene-disease mapping
Biomarker discovery
Drug repurposing
Drug target identification
Lead ranking
Adverse event prediction
Clinical trials
Patient identification
Patient monitoring
Pharmacogenomics
Regulatory document preparation
Pharmacovigilance
Adverse event detection
Drug-drug interaction mapping
Patient data anonymization
28. Incorporating existing knowledge can improve NLP
Entity candidates Linked entities
A
F
B
G
C
H
D
I
E
A
B
C
D
A
B
C
D
Entity ranking
29. NLP is a difficult task!
The inspectors sent restaurant employees
home because they suspected a disease
outbreak.
30. Why is NLP difficult?
• Ambiguity
• The inspectors sent restaurant employees home because they suspected a disease outbreak.
• We ran a Western blot to measure RAN levels.
• Co-reference
COVID19 has spread all over the world. This disease…
• Synonymous expressions
• Synonyms: SARS-CoV2, Wuhan seafood market virus, 2019 novel coronavirus
• Synonymous expressions: This caused cell death./This led to cellular demise./This killed the
cells./The viability was greatly reduced./The cells were eradicated.
• Abbreviations
• Negations
• Uncertainty
• This could potentially result from SARS-CoV2 binding to protein X.
38. Contact me
for projects
or collaborations!
Contact me
for projects
or collaborations!
Cell Death, Lysosomes and AI Group
Computer vision
Natural language processing
Data science in medicine and sustainability
http://aitslab.org/
http://research.med.lu.se/sonja-aits
https://www.youtube.com/AitsLab
https://www.linkedin.com/in/sonjaaits/
Twitter: @Aitslab
sonja.aits@med.lu.se