PharmaCoNER: Pharmacological Substances, Compounds and proteins Named Entity Recognition track (at BioNLP-OST Workshop, November 4, SkyCity RM 2, Hong Kong EMNLP2019)

PharmaCoNER: Pharmacological
Substances, Compounds and
proteins Named Entity Recognition
track
BioNLP-OST Workshop, November 4, SkyCity RM 2, Hong Kong EMNLP2019
Martin Krallinger
Head of Biological Text Mining Unit
Spanish National Cancer Research Centre
martin.krallinger@bsc.es

PharmaCoNER session at BioNLP-OST

The Plan TL aims to promote the
development of NLP and machine
translation in Spanish and Spain’s
co-official languages.
The Plan for the Advancement of Language Technology (Plan TL)
aims to promote development of NLP and machine translation in
Spanish and Spain’s co-official languages.
One of the flagship projects of the Plan TL is related to Healthcare
and Biomedical domain.
• Increase the amount, quality and availability of linguistic
infrastructure.
• Transfer knowledge from the research field to the industry.
• Improving the quality and capacity of public services,
integrating NLP and machine translation technologies.
Identify use cases in public administration
Corpora
• Bilingual corpora (MeSpEN)
• Clinical cases corpus (SPACC): chemicals,
drugs, genes, diseases, treatments, symptoms,
abbreviations, procedures
• Medical literature corpus
Terminological resources
• CUTEXT: medical term recognition (incl. negation
& certainty, UMLS mapping)
• Medical Bilingual glossary, abbreviation-
definition pairs
Components
• Medical tokenizer, sentence splitter,
lemmatizer, PoS tagger, clincal sectionizer,
HeidelTime adap. NegEx adapt.
Evaluation campaigns
• Spanish medical abbreviations (IberEval2017,
2018): BARR (abstracts), BARR2 (clinical cases)
• De-identification Spanish (IberLEF 2019):
MEDOCCAN
• Indexing articles using DeCS-MeSH
• Apply NLP to allow for new services
Libraries
• Secondary use of EHR:
• ICD 10 codification
• Anonymization
R&D funding
agencies … AEMPS
• Technological supervision and
competitive intelligence: tools for
strategic decision-making
• Access to resources and services to
be integrated in their portfolio
• Pharmacoepidemiology studies
Plan for the Advancement of Language Technology
http://temu.bsc.es

Motivation & application domains
• Need for efficient access to mentions of drugs and chemical entities in clinical texts, articles, patents.
• NER of drugs & chemical entities is critical for subsequent detection of relations like: medication-
related allergies, drug-drug interactions, disease-drug relations, drug safety-related issues (adverse
effects), drug dosage, duration of treatments or drug repurposing,…
• Chemical and drug name recognition several-shared tasks: CHEMDNER tracks or i2b2 medication
challenge (for English).
• A lot of bio/medically relevant content in other languages (EHRs, clinical texts, also literature, social
media)
• Spanish spoken by > 572 million people worldwide, either as a native, second or foreign language (477
million native speakers).
• WHO statistics, just in Spain there are > 180k practicing physicians, >247k nursing and midwifery
personnel or 55k pharmaceutical personnel.
• Growing nr. of Spanish articles in PubMed, even it only contains a fraction of medical literature in
Spanish (also in other resources e.g. Scielo, Ibecs, MEDES or Cuiden,...).
Krallinger, et al. (2014). CHEMDNER: The drugs and chemical names extraction challenge. J Cheminform.
Krallinger, et al. (2017). Information retrieval and text mining technologies for chemistry. Chemical reviews, 117(12), 7673-7761.

Description of the corpus: Selection &
preprocessing
• Manually classified clinical case sections derived from Open access Spanish medical
publications, named the Spanish Clinical Case Corpus (SPACCC)
• Preprocessed, clinical case section extraction, removal of embedded figure references or
citations: plain text in UTF8 encoding, where each clinical case would be stored as a single
file
• Manual classification by a practicing oncologist and revision by clinical documentalist to
assure that records were relevant/representative and resembled structure and content
relevant to process clinical content.
• The final corpus: 1000 clinical cases 16,504 sentences
• The SPACCC corpus contains a total of 396,988 words, with an average of 396.2 words per
clinical case.
• This kind of narrative shows properties of both, medical literature and clinical records.
• Covers a range of medical disciplines including oncology, urology, cardiology, pneumology
or infections diseases

Description of the corpus
• More granular annotation scheme covering four mention types:
• Entity type 1 (NORMALIZABLES): mentions of chemicals that can be manually normalized to a
unique concept identifier (primarily SNOMED-CT)
• Entity type 2 (NO_NORMALIZABLES): mentions of chemicals that could not be normalized
manually to a unique concept identifier.
• Entity type 3 (PROTEINAS): mentions of proteins/genes following an adaptation of the
BioCreative GPRO track annotation guidelines (includes peptides, peptide hormones &
antibodies).
• Entity type 4 (UNCLEAR ): cases of general substance class mentions of clinical relevance,
including certain pharmaceutical formulations, general treatments, chemotherapy programs,
and vaccines.
• Mentions class “UNCLEAR” and will not be evaluated for track

Manual annotation process
• The annotation process inspired by schemes used for BioCreative CHEMDNER & GPRO tracks,
translating guidelines into Spanish and adapting them to specificities/needs of clinically oriented
documents.
• Adaptation was carried out by practicing physicians and medicinal chemistry experts.
• The adaptation carried out on sample of the corpus & connected to an iterative process of
annotation consistency analysis through inter-annotator agreement (IAA) calculation until a high
annotation quality on terms of IAA was reached.
• A link to the final version of the used 34 pages annotation guidelines can be found at:
• http://zope.bioinfo.cnio.es/pharmaconer/Spanish_chemical_NE_guidelines.pdf
• This iterative refinement: direct interaction between annotators to resolve discrepancies, using a side-
by-side visualization with the high lightened discrepancies
• Refine to exclude therapeutic application types that actually did not correspond to a chemical entity
per se.
• IAA measure obtained on a set of 50 records that were double annotated (blinded) by two different
expert annotators: a pairwise agreement of 93% (exact entity mention comparison).

Iterative annotation refinement process
overview
Normalization
PharmaCoNER
Entities

Manual annotation process
Entity normalization was carried out primarily against the SNOMED-CT knowledgebase.
The manual annotation of the entire corpus was carried out in a multi-step approach.
(1) Initial annotation process, an adapted version of the AnnotateIt tool.
(2) The annotations exported, trailing whitespaces removed, double annotations of the same string were
send as an alert to the human annotators for revision/correction.
(3) The annotations uploaded into the BRAT annotation tool. The annotators performed final revision of
the annotation, to correct mistakes, add missing annotation mentions.
(4) Senior annotator last round of revision of the entire corpus
A sample set of this corpus (plain text clinical cases and their corresponding annotation in BRAT format):
http://zope.bioinfo.cnio.es/pharmaconer/sample-set.zip

Manual mapping of entities to SNOMED
CT
T1 NORMALIZABLES 2548 2554 aminas
#1 AnnotatorNotes T1 43201005
T2 NORMALIZABLES 2223 2234 Gentamicina
T3 NORMALIZABLES 2208 2220 Clindamicina
T4 PROTEINAS 2034 2048 tromboplastina
T5 PROTEINAS 1993 2004 protrombina
T6 NORMALIZABLES 1866 1870 urea
T7 NORMALIZABLES 1827 1837 creatinina
T8 NORMALIZABLES 1477 1491 Ciprofloxacino
T9 NORMALIZABLES 1435 1445 Furosemida
T11 PROTEINAS 1029 1032 GOT
T12 PROTEINAS 1016 1019 GPT
T13 PROTEINAS 1003 1006 CPK
T14 PROTEINAS 989 992 LDH
T15 PROTEINAS 960 977 Proteinas totales
T16 NORMALIZABLES 948 949 K
T17 NORMALIZABLES 934 936 Na
T18 NORMALIZABLES 914 924 Creatinina
T19 NORMALIZABLES 897 901 Urea
T20 PROTEINAS 873 882 Dimeros D
T21 PROTEINAS 758 769 Hemoglobina
T22 UNCLEAR 72 79 alcohol
T23 NORMALIZABLES 36 46 Penicilina

Evaluation methodology
Evaluation in two different scenarios or sub-tracks:
• Mentions: classical entity-based or instanced-based evaluation that requires that system outputs
match exactly the beginning and end locations of each entity tag, as well as match the entity
annotation type of the gold standard annotations.
• Indexing: concept indexing task where for each document, the list of unique SNOMED concept
identifiers have to be generated by participating teams, which will be compared to the manually
annotated concept ids corresponding to chemical compounds and pharmacological substances.
The primary evaluation metrics will consist of micro-averaged precision, recall and F1-scores:

PharmaCoNER track additional resources
Together with the corpus we will release also the following resources:
• Spanish medical text tokenizer, sentence splitter, lemmatizer and POS tagger
• Dictionary of chemicals, compounds and drugs in Spanish
• Sense inventory of Spanish medical abbreviation and their long forms
• Spanish drug naming file with prefixes and suffixes rules
• Large background set of medical and health documents in Spanish
• NeuroNER trained tagger on the corpus
See on Github and Zenodo:
• https://github.com/PlanTL-SANIDAD
• https://zenodo.org/communities/medicalnlp
Baselines: a) simple vocabulary transfer and b) competitive PharmaCoNER Tagger
(Armengol-Estape et al., 2019) deep learning-based, default parameters, a hidden
layer of size 300, models were trained using GloVe embeddings and Medical Word
Embeddings for Spanish (Soares et al., 2019)

PharmaCoNER participanting teams
Sub-track 1: 22 teams (77 runs), Sub-track 2 : 7 teams (19 runs)

Results PharmaCoNER
Track 1 (mentions)
Top scoring system by xiongying:
F-score 0.91052
Second FSL:
F-score 0.90968
Third mstoeckel:
F-score 0.89888
Combined systems using voting
scenario: F-score 0.92355

Results PharmaCoNER Track 1: mention
types
Performance was systematically better for NORMALIZABLES category, 4-9 points
better with respect to the PROTEINAS category.

Results PharmaCoNER Track 2 (SNOMED CT
concept indexing )
Top scoring system FSL:
F-score of 0.91593
Second ixamed:
F-score 0.85347
Third xiongying:
F-score 0.83914
Combined systems
no improvement

Participating system:
one line summary
Bi Directional LSTM CRF with convolution feature maps
BERT model
Bi Directional LSTM with linguistic features
Two Bi-LSTM layers for character and token embeddings and a CRF layer for sequence labeling
Bi Directional LSTM with different subword embeddings, attention for embedding selection, training on noisy data with noisy
channel
Neural networks for NER and edit distance methods for normalization
Token classification based on BERT
Fine-tuning BERT with CRF, dynamic programming to transform result format
Resource-based approach with approximate string matching over our own set of resources
Bi Directional LSTM -CRF
Multilingual BERT fine-tuned for the PharmacoNER data
BERT+feature
Pipeline Module of Deep Neural Exhaustive Approach
Creation of an own method for indexing concepts.
Bi Directional LSTM -CRF Sequence Tagger using Pooled Contexutalized Embeedings
Ft BERT
W2V[FastText]+LSTM+CR

PharmaCoNER participanting teams

Discussion
• Encouraging results in terms of performance and participation
• System results already reaching a level of performance making very valuable
resources for processing the vast amount of medical data generated worldwide in
Spanish.
• Future tasks building on these results: detection of medication duration, dosage,
drug-drug-interactions, therapeutic target relations and drug/chemical induced
adverse effects
• Certain abbreviations are still difficult
• Collaborative generation of larger Silver Standard corpus generated through
predictions of all participating teams of an additional (background) set.
• Need to cover a wider range of types of documents/text, e.g. social media (SMM4H
2020 COLING, Barcelona)

Thanks!
• Martin Krallinger
• Marta Villegas
• Siamak Barzegar
• Aitor Gonzalez
• Montse Marimon
• Felipe Soares
• Alfonso Valencia (BSC Life)
• Obdulia Rabal
• Julen Oyarzabal
David Perez (SEAD)
• Analia Lourenço
• Martin Perez Perez
• Gael Perez Rodriguez
• Florentino Fernández
Riverola
• AQuAS (Miguel Gallofre López)
• AEMPS-BIFAP (Julio Bonis Sanz)
• AEMPS-FTM (JM Simarro)
• FID-Salud/MSSSI (Elena García)
• FISEVI/Hosp. Virgen del Rocio
(Carlos Parra)
• Hospital 12 de Octubre (Pablo
Serrano)
• IBECS/Carlos III (Elena Primo)
• Informática Médica Hosp. Clínic
(Raimundo Lozano)
• MSSSI (Maribel García Fajardo)
• RANM (Cristina V. González)
• BioCreative organizers
• Cecilia Arighi/Cathy Wu
(Uni. Delaware)
• Lynette Hirschman
(MITRE)

PharmaCoNER: Pharmacological Substances, Compounds and proteins Named Entity Recognition track (at BioNLP-OST Workshop, November 4, SkyCity RM 2, Hong Kong EMNLP2019)

Recommended

Recommended

More Related Content

Similar to PharmaCoNER: Pharmacological Substances, Compounds and proteins Named Entity Recognition track (at BioNLP-OST Workshop, November 4, SkyCity RM 2, Hong Kong EMNLP2019)

Similar to PharmaCoNER: Pharmacological Substances, Compounds and proteins Named Entity Recognition track (at BioNLP-OST Workshop, November 4, SkyCity RM 2, Hong Kong EMNLP2019) (20)

More from Martin Krallinger

More from Martin Krallinger (6)

Recently uploaded

Recently uploaded (20)

PharmaCoNER: Pharmacological Substances, Compounds and proteins Named Entity Recognition track (at BioNLP-OST Workshop, November 4, SkyCity RM 2, Hong Kong EMNLP2019)