Named Entity Recognition, Concept Normalization and Clinical Coding: Overview of the Cantemist Track for Cancer Text Mining in Spanish, Corpus, Guidelines, Methods and Results (Cantemist task overview (Talk at IberLEF workshop of SEPLN 2020)

Named Entity Recognition, Concept Normalization and Clinical Coding:
Overview of the Cantemist Track for Cancer Text Mining in Spanish,
Corpus, Guidelines, Methods and Results
Antonio Miranda-Escalada, Eulàlia Farré, Martin Krallinger, Barcelona Supercomputing Center
José Antonio López Martín, Hospital 12 Octubre
antonio.miranda@bsc.es
Cantemist task overview at IberLEF workshop (SEPLN 2020)
temu.bsc.es/cantemist
tinyurl.com/yxdazqfm
doi.org/10.5281/zenodo.3878178
Cantemist

Cantemist Scientific Committee
➢ Ashish Tendulkar, Google Research
➢ Tristan Naumann, Microsoft Research Healthcare NExT, USA
➢ Parminder Bhatia, Amazon Health AI, USA
➢ Kirk Roberts, School of Biomedical Informatics, University of Texas Health Science Center, USA
➢ Irene Spasic, School of Computer Science & Informatics, co-Director of the Data Innovation Research Institute, Cardiff University, UK
➢ Alfonso Valencia Herrera, Barcelona Supercomputing Center (BSC-CNS), Spain
➢ Hercules Dalianis, Department of Computer and Systems Sciences, Stockholm University, Sweden
➢ Kevin Bretonnel Cohen, Colorado School of Medicine, USA; LIMSI, CNRS, Université Paris-Saclay, France
➢ Karin Verspoor, School of Computing and Information Systems, Health and Biomedical Informatics Centre, University of Melbourne,
Australia
➢ Aurélie Névéol, LIMSI-CNRS, Université Paris-Sud, France
➢ Goran Nenadic, Department of Computer Science, University of Manchester
➢ Zhiyong Lu, Deputy Director for Literature Search, National Center for Biotechnology Information (NCBI)
➢ Antonio Martinez, Head Pathology, Director National EQAS GCP, Spanish Society of Pathology, SEAP-IAP
➢ Mauro Oruezabal, Head of Medical Oncology Service, Hospital Universitario Rey Juan Carlos, Spain
➢ Carlos Luis Parra Calderón, Head of Technological Innovation, Virgen del Rocío University Hospital, Institute of Biomedicine of Seville,
Spain
Biomedical Text Mining - Cantemist:
tinyurl.com/yxdazqfm 2

Precision
medicine
3

Past medical shared tasks in Spanish
Shared task Conference Year Description Links
BARR IberEval 2017 Abbreviations in medical documents
BARR2 IberEval 2018 Abbreviations in medical documents https://temu.bsc.es/BARR2/organization.htm
l
DIANN IberEval 2018 Disability annotation on biomedical domain documents http://nlp.uned.es/diann/
eHealth-KD IberLEF 2018-20 Semantic relations in health-related sentences knowledge-learning.github.io/ehealthkd-2020/
knowledge-learning.github.io/ehealthkd-2019/
WMT19 WMT19 - ACL 2019 Biomedical Translation Task doi.org/10.5281/zenodo.3562535
statmt.org/wmt19/biomedical-translation-
task.html
MEDDOCAN IberLEF 2019 Anonymization of medical documents temu.bsc.es/meddocan/
PharmaCoNER BioNLP-EMNLP 2019 Recognition of drugs, medications and chemical substances
in medical texts
temu.bsc.es/pharmaconer/
MESINESP CLEF BioASQ 2020 Automatic indexing of medical literature summaries temu.bsc.es/mesinesp/
CodiEsp CLEF eHealth 2020 Clinical case coding temu.bsc.es/codiesp/
Cantemist IberLEF 2020 Tumor morphology named entity recognition, normalization
and coding
4

Current scenario
Cancer causes 1 in 6 deaths
worldwide
There are many unstructured text sources
in oncology
Scientific
literature
Patents
Clinical case
reports
Biobanks free
text metadata
Pathology
reports
Oncology
reports
5

Clinical NLP and cancer
Needs
➢ Annotated corpora
➢ Controlled terminologies
3
NLP
systems
(pioneers)
2
Use cases
➢ Create databases from information in
cancer literature
➢ Carry on population-level epidemiologic
studies
➢ Identify treatment/diagnostic gaps
➢ Precision oncology
1
6

International Classification of Diseases for Oncology
➢ Statistical Classification of tumor topography and morphology
➢ Domain-specific extension of the International Classification of Diseases (ICD)
➢ Created originally in 1976 [last update 2013 - heavily used worldwide, also in Spain]
➢ eCIE-O is the Spanish edition
➢ Lingua franca of pathologists with an extensive use within tumor registries
histology behaviour degree
Example: adenocarcinoma well differentiated
/
8140 / 3 1
Tumor/cell type
(adeno-)
Behaviour
(carcinoma)
Differentiation
(well differentiated)
7

Cantemist subtasks
➢ Teams may submit up to 5 runs for each subtask
Finding tumor morphology
mentions
Named Entity
Recognition NER subtask
● Prediction example:
“Carcinoma” (position 3332 -
3341)
Finding and normalizing tumor
morphology mentions to ICD-O
Normalization subtask
● Prediction example:
“Carcinoma” (position 3332 -
3341) - 8010/3
Returning for each of document a
ranked list codes
Clinical coding: indexing documents
ICD-O coding subtask
● Prediction example: 8010/3
8

Evaluation
Clinical case
Manual Gold Standard Evaluation
Automatic Prediction
Incorrect
micro-average
F1 = 0
character offset: 36 - 51
Incorrect
micro-average
F1 = 0
CIE-O code: 8720/3
CIE-O code: 8720/3
Antecedente de
haber presentado un
melanoma maligno
en el muslo derecho
History of malignant
melanoma in the
right thigh
Correct
MAP = 1
CIE-O code: 8720/3 CIE-O code: 8720/3
NER
Normali-
zation
Coding
9
https://github.com/TeMU-BSC/cantemist-evaluation-library/

Generated resources
https://doi.org/10.5281
/zenodo.3878178
Gold Standard
Cantemist Corpus
● Spanish oncology clinical cases
● Annotated by clinical experts
● Currently extending it to 1,900
documents
● Brat and TSV format
https://doi.org/10.5281/zenodo.3
773228
Cantemist guidelines
● Annotating morphology neoplasms
● Mapping annotations to eCIE-O
https://doi.org/10.5281/zenodo.4
010899
Cantemist Silver
Standard
● Automatic predictions of
Cantemist participants on a corpus
of additional clinical case reports
● Documents: 1,301 (501 + 500 + 300)
● Tokens: 1,093,501 tokens
● Manual annotations: 16,030
● Unique codes: 850
Documents: 4,932
10

Gold Standard Cantemist corpus example
Brat:
NER subtask &
Normalization subtask
TSV: ICD-O coding subtask
11

Participation
Cantemist track difficulty Adaptation of previous system Participation future Cantemist
Software product/startup
Multilinguality of systems Release of systems
● 66 registrations
● 25 submissions
● 19 papers
● 121 novel systems
● 16 countries
Many teams, multilingual and
diverse
12

Results
● NER subtask: 11 teams
with F1 > 0.80
● Norm subtask: 6 teams
with F1 > 0.75
● ICD-O coding subtask:
highly competitive
results
13
Top participants runs

Generated software - team systems
Team Code link
LasigeBioTM https://github.com/lasigeBioTM/CANTEMIST-Participation
Hulat-UC3M https://github.com/ssantamaria94/CANTEMIST-Participation
ICB-UMA https://github.com/guilopgar/CANTEMIST-2020
Tong Wang https://github.com/18720936539/CANTEMIST
Kathrync https://github.com/kathrynchapman/CANTEMIST2020
Recognai https://github.com/recognai/cantemist-ner
Biomedical Text Mining -
Cantemist:
Generated NER, normalization and coding software for Spanish
14

Conclusions
9 research, 3 industry & 3
clinical authorities in the
committee
9+3+3
● Cross-disciplinary
community involvement
countries
registered in a shared task on
clinical coding of Spanish
documents
16
● Global community
involvement
documents
post-workshop Gold Standard
with clinical cases joining
oncology and Covid
1900
● https://doi.org/10.5281/
zenodo.3773228
of participants
that used machine learning,
reported employed deep
learning
100%
● The shift in paradigm
has settled down
● Data-hungry methods
15
Methods
Data
Participants
Scientific committee

Need of HPC for NLP
Joint efforts, synergies & collaborations: antonio.miranda@bsc.es ; martin.krallinger@gmail.com
16

Thank you!
All task participants
Cantemist organizers
● Martin Krallinger, Eulàlia Farré
IberLEF organizers
Jose Antonio Lopez-Martin (Hospital 12 de
Octubre) and the Sociedad Española de Oncología Médica
(SEOM).
Cantemist Scientific Committee
Plan de Tecnologías del Lenguaje
BITAC
● Gloria González, Toni Mas
● Kirk Roberts
● Parminder Bhatia
● Irene Spasic
● Tristan Naumann
● Carlos Luis Parra
● Ashish Tendulkar
● Antonio Martinez
● Alfonso Valencia
● Hercules Dalianis
● Kevin Bretonnel Cohen
● Karin Verspoor
● Aurélie Névéol
● Goran Nenadic
● Zhiyong Lu
● Mauro Oruezabal
antonio.miranda@bsc.es ; martin.krallinger@gmail.com
17

Antonio Miranda-Escalada, Eulàlia Farré, Martin Krallinger, Barcelona Supercomputing Center
José Antonio López Martín, Hospital 12 Octubre
antonio.miranda@bsc.es
Cantemist
doi.org/10.5281/zenodo.3878178
Cite: Antonio Miranda-Escalada, Eulàlia Farré and Martin Krallinger. Named Entity Recognition, Concept Normalization and Clinical Coding:
Overview of the Cantemist Track for Cancer Text Mining in Spanish, Corpus, Guidelines, Methods and Results, in: Proceedings of the Iberian
Languages Evaluation Forum (IberLEF 2020), CEUR Workshop Proceedings, 2020.

Named Entity Recognition, Concept Normalization and Clinical Coding: Overview of the Cantemist Track for Cancer Text Mining in Spanish, Corpus, Guidelines, Methods and Results (Cantemist task overview (Talk at IberLEF workshop of SEPLN 2020)

Recommended

Recommended

More Related Content

Similar to Named Entity Recognition, Concept Normalization and Clinical Coding: Overview of the Cantemist Track for Cancer Text Mining in Spanish, Corpus, Guidelines, Methods and Results (Cantemist task overview (Talk at IberLEF workshop of SEPLN 2020)

Similar to Named Entity Recognition, Concept Normalization and Clinical Coding: Overview of the Cantemist Track for Cancer Text Mining in Spanish, Corpus, Guidelines, Methods and Results (Cantemist task overview (Talk at IberLEF workshop of SEPLN 2020) (20)

More from Martin Krallinger

More from Martin Krallinger (6)

Recently uploaded

Recently uploaded (20)

Named Entity Recognition, Concept Normalization and Clinical Coding: Overview of the Cantemist Track for Cancer Text Mining in Spanish, Corpus, Guidelines, Methods and Results (Cantemist task overview (Talk at IberLEF workshop of SEPLN 2020)

Editor's Notes