Named Entity Recognition, Concept Normalization and Clinical Coding: Overview of the Cantemist Track for Cancer Text Mining in Spanish, Corpus, Guidelines, Methods and Results (Cantemist task overview (Talk at IberLEF workshop of SEPLN 2020)
Cancer still represents one of the leading causes of death worldwide, resulting in a considerable healthcare
impact. Recent research efforts from the clinical and molecular oncology scientific communities were able to
increase considerably life expectancy of patients for some cancer types. Most of the current cancer diagnoses
are primarily determined by pathology laboratories, providing an essential source for information to guide the
treatment of patients with cancer. Pathology observations essentially characterize the results of microscopic or
macroscopic studies of cells or tissues following a biopsy or surgery. Clinicians and researchers alike, require
systems that automatically detect, read and generate structured data representations from pathology examina-
tions. The resulting structured or coded clinical information, normalized using controlled vocabularies like the
ICD-O or SNOMED-CT is critical for large-scale analysis of specific tumor types or to determine response to
specific treatments or prognosis. Text mining and NLP approaches are showing promising results to transform
medical text into useful clinical information, bridging the gap between free-text and structured representation
of clinical information. Nonetheless, in the case of cancer text mining approaches, most efforts were exclusively
focused on medical records in English. Moreover, due to the lack of high quality manually labeled clinical texts
annotated by oncology experts most previous efforts, even for English relied mainly on customized dictionar-
ies of names or rules to recognize clinical concept mentions despite the promising results of advanced deep
learning technologies. To address these issues we have organized the Cantemist (CANcer TExt Mining Shared
Task) track at IberLEF 2020. It represents the first community effort to evaluate and promote the development
of resources for named entity recognition, concept normalization and clinical coding specifically focusing on
cancer data in Spanish. Evaluation of participating systems was done using the Cantemist corpus, a publicly
accessible dataset (together with annotation consistency analysis and guidelines) of manually annotated men-
tions of tumor morphology entities and their mappings to the Spanish version of ICD-O. We received a total of
121 systems or runs from 25 teams for one of the three Cantemist sub-tasks, obtaining very competitive results.
Most participants implemented sophisticated AI approaches; mainly deep learning algorithms based on Long-
Short Term Memory Units and language models (BERT, BETO, RoBERTa, etc) with a classifier layer such as a
Conditional Random Field. In addition to using pre-trained language models, word and character embeddings
were also explored. Cantemist corpus: https://doi.org/10.5281/zenodo.3773228
SympTEMIST Shared Task on Symptoms, Signs and Findings Detection and Normaliz...Martin Krallinger
More Related Content
Similar to Named Entity Recognition, Concept Normalization and Clinical Coding: Overview of the Cantemist Track for Cancer Text Mining in Spanish, Corpus, Guidelines, Methods and Results (Cantemist task overview (Talk at IberLEF workshop of SEPLN 2020)
El nuevo superordenador Mare Nostrum y el futuro procesador europeoAMETIC
Similar to Named Entity Recognition, Concept Normalization and Clinical Coding: Overview of the Cantemist Track for Cancer Text Mining in Spanish, Corpus, Guidelines, Methods and Results (Cantemist task overview (Talk at IberLEF workshop of SEPLN 2020) (20)
Named Entity Recognition, Concept Normalization and Clinical Coding: Overview of the Cantemist Track for Cancer Text Mining in Spanish, Corpus, Guidelines, Methods and Results (Cantemist task overview (Talk at IberLEF workshop of SEPLN 2020)
1. Named Entity Recognition, Concept Normalization and Clinical Coding:
Overview of the Cantemist Track for Cancer Text Mining in Spanish,
Corpus, Guidelines, Methods and Results
Antonio Miranda-Escalada, Eulàlia Farré, Martin Krallinger, Barcelona Supercomputing Center
José Antonio López Martín, Hospital 12 Octubre
antonio.miranda@bsc.es
Cantemist task overview at IberLEF workshop (SEPLN 2020)
temu.bsc.es/cantemist
tinyurl.com/yxdazqfm
doi.org/10.5281/zenodo.3878178
Cantemist
2. Cantemist Scientific Committee
➢ Ashish Tendulkar, Google Research
➢ Tristan Naumann, Microsoft Research Healthcare NExT, USA
➢ Parminder Bhatia, Amazon Health AI, USA
➢ Kirk Roberts, School of Biomedical Informatics, University of Texas Health Science Center, USA
➢ Irene Spasic, School of Computer Science & Informatics, co-Director of the Data Innovation Research Institute, Cardiff University, UK
➢ Alfonso Valencia Herrera, Barcelona Supercomputing Center (BSC-CNS), Spain
➢ Hercules Dalianis, Department of Computer and Systems Sciences, Stockholm University, Sweden
➢ Kevin Bretonnel Cohen, Colorado School of Medicine, USA; LIMSI, CNRS, Université Paris-Saclay, France
➢ Karin Verspoor, School of Computing and Information Systems, Health and Biomedical Informatics Centre, University of Melbourne,
Australia
➢ Aurélie Névéol, LIMSI-CNRS, Université Paris-Sud, France
➢ Goran Nenadic, Department of Computer Science, University of Manchester
➢ Zhiyong Lu, Deputy Director for Literature Search, National Center for Biotechnology Information (NCBI)
➢ Antonio Martinez, Head Pathology, Director National EQAS GCP, Spanish Society of Pathology, SEAP-IAP
➢ Mauro Oruezabal, Head of Medical Oncology Service, Hospital Universitario Rey Juan Carlos, Spain
➢ Carlos Luis Parra Calderón, Head of Technological Innovation, Virgen del Rocío University Hospital, Institute of Biomedicine of Seville,
Spain
Cantemist task overview at IberLEF workshop (SEPLN 2020)
Biomedical Text Mining - Cantemist:
tinyurl.com/yxdazqfm 2
4. Past medical shared tasks in Spanish
Cantemist task overview at IberLEF workshop (SEPLN 2020)
Shared task Conference Year Description Links
BARR IberEval 2017 Abbreviations in medical documents
BARR2 IberEval 2018 Abbreviations in medical documents https://temu.bsc.es/BARR2/organization.htm
l
DIANN IberEval 2018 Disability annotation on biomedical domain documents http://nlp.uned.es/diann/
eHealth-KD IberLEF 2018-20 Semantic relations in health-related sentences knowledge-learning.github.io/ehealthkd-2020/
knowledge-learning.github.io/ehealthkd-2019/
WMT19 WMT19 - ACL 2019 Biomedical Translation Task doi.org/10.5281/zenodo.3562535
statmt.org/wmt19/biomedical-translation-
task.html
MEDDOCAN IberLEF 2019 Anonymization of medical documents temu.bsc.es/meddocan/
PharmaCoNER BioNLP-EMNLP 2019 Recognition of drugs, medications and chemical substances
in medical texts
temu.bsc.es/pharmaconer/
MESINESP CLEF BioASQ 2020 Automatic indexing of medical literature summaries temu.bsc.es/mesinesp/
CodiEsp CLEF eHealth 2020 Clinical case coding temu.bsc.es/codiesp/
Cantemist IberLEF 2020 Tumor morphology named entity recognition, normalization
and coding
temu.bsc.es/cantemist
4
5. Current scenario
Cantemist task overview at IberLEF workshop (SEPLN 2020)
Cancer causes 1 in 6 deaths
worldwide
There are many unstructured text sources
in oncology
Scientific
literature
Patents
Clinical case
reports
Biobanks free
text metadata
Pathology
reports
Oncology
reports
5
6. Clinical NLP and cancer
Cantemist task overview at IberLEF workshop (SEPLN 2020)
Needs
➢ Annotated corpora
➢ Controlled terminologies
3
NLP
systems
(pioneers)
2
Use cases
➢ Create databases from information in
cancer literature
➢ Carry on population-level epidemiologic
studies
➢ Identify treatment/diagnostic gaps
➢ Precision oncology
1
6
7. International Classification of Diseases for Oncology
➢ Statistical Classification of tumor topography and morphology
➢ Domain-specific extension of the International Classification of Diseases (ICD)
➢ Created originally in 1976 [last update 2013 - heavily used worldwide, also in Spain]
➢ eCIE-O is the Spanish edition
➢ Lingua franca of pathologists with an extensive use within tumor registries
Cantemist task overview at IberLEF workshop (SEPLN 2020)
histology behaviour degree
Example: adenocarcinoma well differentiated
/
8140 / 3 1
Tumor/cell type
(adeno-)
Behaviour
(carcinoma)
Differentiation
(well differentiated)
7
8. Cantemist subtasks
Cantemist task overview at IberLEF workshop (SEPLN 2020)
➢ Teams may submit up to 5 runs for each subtask
Finding tumor morphology
mentions
Named Entity
Recognition NER subtask
● Prediction example:
“Carcinoma” (position 3332 -
3341)
Finding and normalizing tumor
morphology mentions to ICD-O
Normalization subtask
● Prediction example:
“Carcinoma” (position 3332 -
3341) - 8010/3
Returning for each of document a
ranked list codes
Clinical coding: indexing documents
ICD-O coding subtask
● Prediction example: 8010/3
8
9. Evaluation
Clinical case
Manual Gold Standard Evaluation
Automatic Prediction
Incorrect
micro-average
F1 = 0
character offset: 36 - 51
character offset: 36 - 63
Incorrect
micro-average
F1 = 0
character offset: 36 - 63
CIE-O code: 8720/3
character offset: 36 - 51
CIE-O code: 8720/3
Antecedente de
haber presentado un
melanoma maligno
en el muslo derecho
History of malignant
melanoma in the
right thigh
Correct
MAP = 1
CIE-O code: 8720/3 CIE-O code: 8720/3
NER
Normali-
zation
Coding
Cantemist task overview at IberLEF workshop (SEPLN 2020)
9
https://github.com/TeMU-BSC/cantemist-evaluation-library/
10. Generated resources
Cantemist task overview at IberLEF workshop (SEPLN 2020)
https://doi.org/10.5281
/zenodo.3878178
Gold Standard
Cantemist Corpus
● Spanish oncology clinical cases
● Annotated by clinical experts
● Currently extending it to 1,900
documents
● Brat and TSV format
https://doi.org/10.5281/zenodo.3
773228
Cantemist guidelines
● Annotating morphology neoplasms
● Mapping annotations to eCIE-O
https://doi.org/10.5281/zenodo.4
010899
Cantemist Silver
Standard
● Automatic predictions of
Cantemist participants on a corpus
of additional clinical case reports
● Documents: 1,301 (501 + 500 + 300)
● Tokens: 1,093,501 tokens
● Manual annotations: 16,030
● Unique codes: 850
Documents: 4,932
10
11. Gold Standard Cantemist corpus example
Brat:
NER subtask &
Normalization subtask
TSV: ICD-O coding subtask
Cantemist task overview at IberLEF workshop (SEPLN 2020)
11
12. Participation
Cantemist task overview at IberLEF workshop (SEPLN 2020)
Cantemist track difficulty Adaptation of previous system Participation future Cantemist
Software product/startup
Multilinguality of systems Release of systems
● 66 registrations
● 25 submissions
● 19 papers
● 121 novel systems
● 16 countries
Many teams, multilingual and
diverse
12
13. Results
Cantemist task overview at IberLEF workshop (SEPLN 2020)
● NER subtask: 11 teams
with F1 > 0.80
● Norm subtask: 6 teams
with F1 > 0.75
● ICD-O coding subtask:
highly competitive
results
13
Top participants runs
14. Generated software - team systems
Team Code link
LasigeBioTM https://github.com/lasigeBioTM/CANTEMIST-Participation
Hulat-UC3M https://github.com/ssantamaria94/CANTEMIST-Participation
ICB-UMA https://github.com/guilopgar/CANTEMIST-2020
Tong Wang https://github.com/18720936539/CANTEMIST
Kathrync https://github.com/kathrynchapman/CANTEMIST2020
Recognai https://github.com/recognai/cantemist-ner
Biomedical Text Mining -
Cantemist:
tinyurl.com/yxdazqfm
Cantemist task overview at IberLEF workshop (SEPLN 2020)
Generated NER, normalization and coding software for Spanish
temu.bsc.es/cantemist
14
15. Conclusions
9 research, 3 industry & 3
clinical authorities in the
committee
9+3+3
● Cross-disciplinary
community involvement
countries
registered in a shared task on
clinical coding of Spanish
documents
16
● Global community
involvement
documents
post-workshop Gold Standard
with clinical cases joining
oncology and Covid
1900
● https://doi.org/10.5281/
zenodo.3773228
of participants
that used machine learning,
reported employed deep
learning
100%
● The shift in paradigm
has settled down
● Data-hungry methods
Cantemist task overview at IberLEF workshop (SEPLN 2020)
15
Methods
Data
Participants
Scientific committee
16. Need of HPC for NLP
Cantemist task overview at IberLEF workshop (SEPLN 2020)
Joint efforts, synergies & collaborations: antonio.miranda@bsc.es ; martin.krallinger@gmail.com
16
17. Thank you!
All task participants
Cantemist organizers
● Martin Krallinger, Eulàlia Farré
IberLEF organizers
Jose Antonio Lopez-Martin (Hospital 12 de
Octubre) and the Sociedad Española de Oncología Médica
(SEOM).
Cantemist Scientific Committee
Plan de Tecnologías del Lenguaje
BITAC
● Gloria González, Toni Mas
Cantemist task overview at IberLEF workshop (SEPLN 2020)
● Kirk Roberts
● Parminder Bhatia
● Irene Spasic
● Tristan Naumann
● Carlos Luis Parra
● Ashish Tendulkar
● Antonio Martinez
● Alfonso Valencia
● Hercules Dalianis
● Kevin Bretonnel Cohen
● Karin Verspoor
● Aurélie Névéol
● Goran Nenadic
● Zhiyong Lu
● Mauro Oruezabal
antonio.miranda@bsc.es ; martin.krallinger@gmail.com
17
18. Antonio Miranda-Escalada, Eulàlia Farré, Martin Krallinger, Barcelona Supercomputing Center
José Antonio López Martín, Hospital 12 Octubre
antonio.miranda@bsc.es
Cantemist task overview at IberLEF workshop (SEPLN 2020)
Cantemist
temu.bsc.es/cantemist
tinyurl.com/yxdazqfm
doi.org/10.5281/zenodo.3878178
Cite: Antonio Miranda-Escalada, Eulàlia Farré and Martin Krallinger. Named Entity Recognition, Concept Normalization and Clinical Coding:
Overview of the Cantemist Track for Cancer Text Mining in Spanish, Corpus, Guidelines, Methods and Results, in: Proceedings of the Iberian
Languages Evaluation Forum (IberLEF 2020), CEUR Workshop Proceedings, 2020.
Editor's Notes
welcome to session,
last session of IberLEF,
task organized by BSC, meaning me, Eulalia and Martin, together with clinician from H12O, José Antonio López Martín
We had a consider number of participants, but we had only time for 6 talks. Then we are collecting the rest of participants talks on the YouTube playlist
first before starting thank to scientific committee they are experts from industry, academia, hospitalary etc
helped us in defining, evaluation proceedings, etc
Massive volume of clinical data
Most of it is unstructured
We want to use them to solve clinical problems
We want interoperability
COVID example: need of efficient search, retrieval, analysis, integration as well as exploitation strategies for a diversity of medical content types
que si estructuras ese 80% lo metes en el pull de big data para encontrar patrones, usarlo para precision medicine
Spanish spoken by > 572 million people (with 477 na@ve speakers)
Large healthcare professional community communicating in Spanish (incl. practicing physicians and nursing, midwifery or pharmaceutical personnel)
Ac@ve produc@on of medical publica8ons in Spanish worldwide hosted in several databases like PubMed, IBECS, SCIELO, LILACS, MEDES,...
Also other health-related content in Spanish: Clinical trial databses (e.g REEC), na@onal health projects (ISCIII-FIS), patents, clinical prac@ce guidelines, social media,
There is a lot of unstructured data in oncology and pathology
spain one of the leading oncology research countries
one of the general conclusions is the need of high quality manually annotated corpora para fomentar la evaluación y el desarrollo de sistemas
Terminar con: necesidad de estructurar datos clínicos oncológicos, para hacerlo bien, necesidad de recursos anotados. Para hacerlo aún mejor, debemos tener resultados interoperables, así que necesitamos terminologías
cite:
https://pubmed.ncbi.nlm.nih.gov/19135551/
https://www.sciencedirect.com/science/article/pii/S1532046412001712
https://pubmed.ncbi.nlm.nih.gov/22822041/
The World Health Organization maintains the International Classification of Diseases for Oncology, or ICD-O. It is key to retrieve structured information from clinical texts in oncology
Some tumor mentions contain a relevant modifier not included in the terminology for this concept. Then, we append /H to the code.
For example, in the file cc_onco158, we have the codes 8000/1 and 8000/1/H.
8000/1 corresponds to a mention of neoplasm (“neoplasia”, in Spanish).
In the 8000/1/H case, the mention is (in Spanish) “neoplasia de estirpe epitelial”. The modifier “estirpe epitelial” is present in the ICD-O terminology for many tumors. However, it is not present to modify specifically the code 8000/1. Then, we consider it a relevant modifier and add the /H.
It is a manually generated corpus with tumor mentions labelled by clinical experts following guidelines. So the process is reproducible and has quality control.
This corpus contains clinical case reports only in Spanish language.
It has 1300 annotated documents. Clinical experts have found mentions of tumor morphologies present in the ICD-O terminology in these 1300 documents.
despite pandemic, submission period over summer
increasing interest on NLP over health
66 registrations
25 teams
19 proceedings papers
121 novel runs
dont say anythin about the graphs -> no time
cross-disciplinary: industria de ML, etc
global
post-workshop gold standard: we are going to extender el corpus. Que va a incluir casos clínicos que unan oncología y covid -> y enlace
deep learning
BSC Cantemist tagger -> haremos un release de nuestro propio tagger
Cantemist involved finding tumor histology mentions in clinical case reports. If focused on Spanish documents and introduced a new terminology in NLP shared tasks, ICD-O.
Solving a quite domain-specific task in a language other than English, using a terminology not that known in NLP, could seem too minoritary. However, Cantemist has proven us wrong. We had a strong support from the community. Starting from the scientific committee, which united authorities from the top Spanish hospitals, industry leaders and renowned researchers. Participants have come from countries all around the globe and 20% of them came from industry.
Considering the complexity of clinical texts and the scarcity of clinical-specific resources, the results were quite high. These results are obtained using Deep Learning. In particular, most successful approaches were (1) combining the latest word embeddings with LSTMs and CRFs and (2) incorporating language models (BERT-like). These approaches are data-hungry. Then, if we want to successfully apply our methods to the clinical world, we need to continue generating datasets and resources in Spanish and with the characteristics of clinical texts.
Indeed, when we explored the difficult annotations, we observed that they corresponded to highly specific mentions, complex mentions that are not even included in the terminology, mentions not frequent in the training data, etc. We need to continue generating these resources.
In the subset of missed annotations, 8% of the codes contain an “H”. This percentage is as low as 2% in the entire test set.
13.2% of the missed annotations include the sixth differentiation digit in their code. In contrast, this percentage is 5.6% in the entire test set
The median of appearances of the missed codes in the training and development set is 1, whereas for the test set codes is 3
Finally, 20.8% of the missed annotations have the metastasis code (8000/6), while this code accounts for 34.6% of the complete test set
Conclusions -> queremos que la gente colabore de cara a proyectos futuros -> meter algo del uso de computational intensive AI approaches (deep learning) > muchos systems successful usan methodology DL que necesita alto cómputo and one of the
one of the conclus we can see is that success method rely on compt intensive ai approaches. One of the future ... is to provide HPC support for participants and we are open in exploring syngergies and collaborations for supporting HPC resources
take home message _> successful methods se beneficiarían del uso de HPC -> estamos open para collaboration -> foto del MN
as we are part of bsc and spanish supercomputing network we are very open to collaborate