PharmaCoNER: Pharmacological Substances, Compounds and proteins Named Entity Recognition track (at BioNLP-OST Workshop, November 4, SkyCity RM 2, Hong Kong EMNLP2019)
PharmaCoNER: Pharmacological Substances, Compounds and proteins Named Entity Recognition track
Talk at: BioNLP-OST Workshop, November 4, SkyCity RM 2, Hong Kong EMNLP2019
One of the biomedical entity types of relevance for medicine or biosciences are chem-
ical compounds and drugs. The correct detection these entities is critical for other text
mining applications building on them, such as adverse drug-reaction detection, medication-related fake news or drug-target extraction.
Although a significant effort was made to detect mentions of drugs/chemicals in English texts, so far only very limited attempts were made to recognize them in medical documents in other languages. Taking into account the growing amount of medical publications and clinical records written in Spanish, we have organized the first shared task on detecting drug and chemical entities in Spanish medical documents. Additionally, we included a clinical concept-indexing sub-track asking teams to return SNOMED-CT identifiers related to drugs/chemicals for a collection of documents. For this task, named PharmaCoNER, we generated annotation guidelines together with a corpus of 1,000 manually annotated clinical case studies. A total of 22 teams participated in the sub-track 1, (77 system runs), and 7 teams in the sub-track 2 (19 system runs). Top scoring teams used sophisticated deep learning approaches yielding very competitive results with F-measures above 0.91. These results indicate that there is a real interest in promoting biomedical text mining efforts beyond English. We foresee that the PharmaCoNER annotation guidelines, corpus and participant systems will foster the development of new resources for clinical and biomedical text mining systems of Spanish medical data.
MedProcNER/ProcTEMIST Shared Task on Clinical Procedure Detection and Normali...Martin Krallinger
Â
More Related Content
Similar to PharmaCoNER: Pharmacological Substances, Compounds and proteins Named Entity Recognition track (at BioNLP-OST Workshop, November 4, SkyCity RM 2, Hong Kong EMNLP2019)
Similar to PharmaCoNER: Pharmacological Substances, Compounds and proteins Named Entity Recognition track (at BioNLP-OST Workshop, November 4, SkyCity RM 2, Hong Kong EMNLP2019) (20)
Vip Model Call Girls (Delhi) Karol Bagh 9711199171âď¸Body to body massage wit...
Â
PharmaCoNER: Pharmacological Substances, Compounds and proteins Named Entity Recognition track (at BioNLP-OST Workshop, November 4, SkyCity RM 2, Hong Kong EMNLP2019)
1. PharmaCoNER: Pharmacological
Substances, Compounds and
proteins Named Entity Recognition
track
BioNLP-OST Workshop, November 4, SkyCity RM 2, Hong Kong EMNLP2019
Martin Krallinger
Head of Biological Text Mining Unit
Spanish National Cancer Research Centre
martin.krallinger@bsc.es
3. The Plan TL aims to promote the
development of NLP and machine
translation in Spanish and Spainâs
co-official languages.
The Plan for the Advancement of Language Technology (Plan TL)
aims to promote development of NLP and machine translation in
Spanish and Spainâs co-official languages.
One of the flagship projects of the Plan TL is related to Healthcare
and Biomedical domain.
⢠Increase the amount, quality and availability of linguistic
infrastructure.
⢠Transfer knowledge from the research field to the industry.
⢠Improving the quality and capacity of public services,
integrating NLP and machine translation technologies.
Identify use cases in public administration
Corpora
⢠Bilingual corpora (MeSpEN)
⢠Clinical cases corpus (SPACC): chemicals,
drugs, genes, diseases, treatments, symptoms,
abbreviations, procedures
⢠Medical literature corpus
Terminological resources
⢠CUTEXT: medical term recognition (incl. negation
& certainty, UMLS mapping)
⢠Medical Bilingual glossary, abbreviation-
definition pairs
Components
⢠Medical tokenizer, sentence splitter,
lemmatizer, PoS tagger, clincal sectionizer,
HeidelTime adap. NegEx adapt.
Evaluation campaigns
⢠Spanish medical abbreviations (IberEval2017,
2018): BARR (abstracts), BARR2 (clinical cases)
⢠De-identification Spanish (IberLEF 2019):
MEDOCCAN
⢠Indexing articles using DeCS-MeSH
⢠Apply NLP to allow for new services
Libraries
⢠Secondary use of EHR:
⢠ICD 10 codification
⢠Anonymization
R&D funding
agencies ⌠AEMPS
⢠Technological supervision and
competitive intelligence: tools for
strategic decision-making
⢠Access to resources and services to
be integrated in their portfolio
⢠Pharmacoepidemiology studies
Plan for the Advancement of Language Technology
http://temu.bsc.es
4. Motivation & application domains
⢠Need for efficient access to mentions of drugs and chemical entities in clinical texts, articles, patents.
⢠NER of drugs & chemical entities is critical for subsequent detection of relations like: medication-
related allergies, drug-drug interactions, disease-drug relations, drug safety-related issues (adverse
effects), drug dosage, duration of treatments or drug repurposing,âŚ
⢠Chemical and drug name recognition several-shared tasks: CHEMDNER tracks or i2b2 medication
challenge (for English).
⢠A lot of bio/medically relevant content in other languages (EHRs, clinical texts, also literature, social
media)
⢠Spanish spoken by > 572 million people worldwide, either as a native, second or foreign language (477
million native speakers).
⢠WHO statistics, just in Spain there are > 180k practicing physicians, >247k nursing and midwifery
personnel or 55k pharmaceutical personnel.
⢠Growing nr. of Spanish articles in PubMed, even it only contains a fraction of medical literature in
Spanish (also in other resources e.g. Scielo, Ibecs, MEDES or Cuiden,...).
Krallinger, et al. (2014). CHEMDNER: The drugs and chemical names extraction challenge. J Cheminform.
Krallinger, et al. (2017). Information retrieval and text mining technologies for chemistry. Chemical reviews, 117(12), 7673-7761.
5. Description of the corpus: Selection &
preprocessing
⢠Manually classified clinical case sections derived from Open access Spanish medical
publications, named the Spanish Clinical Case Corpus (SPACCC)
⢠Preprocessed, clinical case section extraction, removal of embedded figure references or
citations: plain text in UTF8 encoding, where each clinical case would be stored as a single
file
⢠Manual classification by a practicing oncologist and revision by clinical documentalist to
assure that records were relevant/representative and resembled structure and content
relevant to process clinical content.
⢠The final corpus: 1000 clinical cases 16,504 sentences
⢠The SPACCC corpus contains a total of 396,988 words, with an average of 396.2 words per
clinical case.
⢠This kind of narrative shows properties of both, medical literature and clinical records.
⢠Covers a range of medical disciplines including oncology, urology, cardiology, pneumology
or infections diseases
6. Description of the corpus
⢠More granular annotation scheme covering four mention types:
⢠Entity type 1 (NORMALIZABLES): mentions of chemicals that can be manually normalized to a
unique concept identifier (primarily SNOMED-CT)
⢠Entity type 2 (NO_NORMALIZABLES): mentions of chemicals that could not be normalized
manually to a unique concept identifier.
⢠Entity type 3 (PROTEINAS): mentions of proteins/genes following an adaptation of the
BioCreative GPRO track annotation guidelines (includes peptides, peptide hormones &
antibodies).
⢠Entity type 4 (UNCLEAR ): cases of general substance class mentions of clinical relevance,
including certain pharmaceutical formulations, general treatments, chemotherapy programs,
and vaccines.
⢠Mentions class âUNCLEARâ and will not be evaluated for track
8. Manual annotation process
⢠The annotation process inspired by schemes used for BioCreative CHEMDNER & GPRO tracks,
translating guidelines into Spanish and adapting them to specificities/needs of clinically oriented
documents.
⢠Adaptation was carried out by practicing physicians and medicinal chemistry experts.
⢠The adaptation carried out on sample of the corpus & connected to an iterative process of
annotation consistency analysis through inter-annotator agreement (IAA) calculation until a high
annotation quality on terms of IAA was reached.
⢠A link to the final version of the used 34 pages annotation guidelines can be found at:
⢠http://zope.bioinfo.cnio.es/pharmaconer/Spanish_chemical_NE_guidelines.pdf
⢠This iterative refinement: direct interaction between annotators to resolve discrepancies, using a side-
by-side visualization with the high lightened discrepancies
⢠Refine to exclude therapeutic application types that actually did not correspond to a chemical entity
per se.
⢠IAA measure obtained on a set of 50 records that were double annotated (blinded) by two different
expert annotators: a pairwise agreement of 93% (exact entity mention comparison).
10. Manual annotation process
Entity normalization was carried out primarily against the SNOMED-CT knowledgebase.
The manual annotation of the entire corpus was carried out in a multi-step approach.
(1) Initial annotation process, an adapted version of the AnnotateIt tool.
(2) The annotations exported, trailing whitespaces removed, double annotations of the same string were
send as an alert to the human annotators for revision/correction.
(3) The annotations uploaded into the BRAT annotation tool. The annotators performed final revision of
the annotation, to correct mistakes, add missing annotation mentions.
(4) Senior annotator last round of revision of the entire corpus
A sample set of this corpus (plain text clinical cases and their corresponding annotation in BRAT format):
http://zope.bioinfo.cnio.es/pharmaconer/sample-set.zip
12. Evaluation methodology
Evaluation in two different scenarios or sub-tracks:
⢠Mentions: classical entity-based or instanced-based evaluation that requires that system outputs
match exactly the beginning and end locations of each entity tag, as well as match the entity
annotation type of the gold standard annotations.
⢠Indexing: concept indexing task where for each document, the list of unique SNOMED concept
identifiers have to be generated by participating teams, which will be compared to the manually
annotated concept ids corresponding to chemical compounds and pharmacological substances.
The primary evaluation metrics will consist of micro-averaged precision, recall and F1-scores:
13. PharmaCoNER track additional resources
Together with the corpus we will release also the following resources:
⢠Spanish medical text tokenizer, sentence splitter, lemmatizer and POS tagger
⢠Dictionary of chemicals, compounds and drugs in Spanish
⢠Sense inventory of Spanish medical abbreviation and their long forms
⢠Spanish drug naming file with prefixes and suffixes rules
⢠Large background set of medical and health documents in Spanish
⢠NeuroNER trained tagger on the corpus
See on Github and Zenodo:
⢠https://github.com/PlanTL-SANIDAD
⢠https://zenodo.org/communities/medicalnlp
Baselines: a) simple vocabulary transfer and b) competitive PharmaCoNER Tagger
(Armengol-Estape et al., 2019) deep learning-based, default parameters, a hidden
layer of size 300, models were trained using GloVe embeddings and Medical Word
Embeddings for Spanish (Soares et al., 2019)
15. Results PharmaCoNER
Track 1 (mentions)
Top scoring system by xiongying:
F-score 0.91052
Second FSL:
F-score 0.90968
Third mstoeckel:
F-score 0.89888
Combined systems using voting
scenario: F-score 0.92355
16. Results PharmaCoNER Track 1: mention
types
Performance was systematically better for NORMALIZABLES category, 4-9 points
better with respect to the PROTEINAS category.
17. Results PharmaCoNER Track 2 (SNOMED CT
concept indexing )
Top scoring system FSL:
F-score of 0.91593
Second ixamed:
F-score 0.85347
Third xiongying:
F-score 0.83914
Combined systems
no improvement
18. Participating system:
one line summary
Bi Directional LSTM CRF with convolution feature maps
BERT model
Bi Directional LSTM with linguistic features
Two Bi-LSTM layers for character and token embeddings and a CRF layer for sequence labeling
Bi Directional LSTM with different subword embeddings, attention for embedding selection, training on noisy data with noisy
channel
Neural networks for NER and edit distance methods for normalization
Token classification based on BERT
Fine-tuning BERT with CRF, dynamic programming to transform result format
Resource-based approach with approximate string matching over our own set of resources
Bi Directional LSTM -CRF
Multilingual BERT fine-tuned for the PharmacoNER data
BERT+feature
Pipeline Module of Deep Neural Exhaustive Approach
Creation of an own method for indexing concepts.
Bi Directional LSTM -CRF Sequence Tagger using Pooled Contexutalized Embeedings
Ft BERT
W2V[FastText]+LSTM+CR
22. Discussion
⢠Encouraging results in terms of performance and participation
⢠System results already reaching a level of performance making very valuable
resources for processing the vast amount of medical data generated worldwide in
Spanish.
⢠Future tasks building on these results: detection of medication duration, dosage,
drug-drug-interactions, therapeutic target relations and drug/chemical induced
adverse effects
⢠Certain abbreviations are still difficult
⢠Collaborative generation of larger Silver Standard corpus generated through
predictions of all participating teams of an additional (background) set.
⢠Need to cover a wider range of types of documents/text, e.g. social media (SMM4H
2020 COLING, Barcelona)
23. Thanks!
⢠Martin Krallinger
⢠Marta Villegas
⢠Siamak Barzegar
⢠Aitor Gonzalez
⢠Montse Marimon
⢠Felipe Soares
⢠Alfonso Valencia (BSC Life)
⢠Obdulia Rabal
⢠Julen Oyarzabal
David Perez (SEAD)
⢠Analia Lourenço
⢠Martin Perez Perez
⢠Gael Perez Rodriguez
⢠Florentino Fernåndez
Riverola
⢠AQuAS (Miguel Gallofre López)
⢠AEMPS-BIFAP (Julio Bonis Sanz)
⢠AEMPS-FTM (JM Simarro)
⢠FID-Salud/MSSSI (Elena GarcĂa)
⢠FISEVI/Hosp. Virgen del Rocio
(Carlos Parra)
⢠Hospital 12 de Octubre (Pablo
Serrano)
⢠IBECS/Carlos III (Elena Primo)
⢠InformĂĄtica MĂŠdica Hosp. ClĂnic
(Raimundo Lozano)
⢠MSSSI (Maribel GarcĂa Fajardo)
⢠RANM (Cristina V. Gonzålez)
⢠BioCreative organizers
⢠Cecilia Arighi/Cathy Wu
(Uni. Delaware)
⢠Lynette Hirschman
(MITRE)