1. Fine Tuned Large Language Models for Symptom Recognition
from Spanish Clinical Text
Mai A. Shaaban1
, Abbas Akkasi2, *
, Adnan Khan2
, Majid Komeili2
, Mohammad Yaqub1
1
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE, 2
School of
Computer Science, Carleton University, Ottawa, CA.
*Corresponding author: +1- 613 222 8134, E-mail: abbasakkasi@cunet.carleton.ca
Abstract
The accurate recognition of symptoms in clinical reports is significantly important in the fields of
healthcare and biomedical natural language processing. These entities serve as essential building
blocks for clinical information extraction, enabling retrieval of critical medical insights from vast
amounts of textual data. Furthermore, the ability to identify and categorize these entities is
fundamental to development of advanced clinical decision support systems, aiding healthcare
professionals in diagnosis and treatment planning. In this study, we participated in
SympTEMIST – a shared task on detection of symptoms, signs and findings in medical
documents in Spanish- by combining a set of large language models finetuned with the data
released by the task’s organizers.
Introduction†
Biomedical Named Entity Recognition (BioNER) has emerged as a critical component in the
realm of biomedical and clinical informatics, with the primary goal of extracting meaningful
entities from vast repositories of textual clinical data. Among the diverse categories of entities in
the biomedical domain, recognizing the mention of symptoms, including signs and findings from
clinical reports, stands out as an indispensable and profoundly significant task [1].The
recognition of symptoms, signs, and findings in clinical reports holds profound implications for
various facets of healthcare and research, making it an essential area of focus in the BioNER
domain. Clinical narratives, often recorded in the form of electronic health records (EHR),
radiology reports, pathology reports, and medical literature, are a treasure trove of information
regarding the health status of patients. These reports contain a wealth of information about
observed clinical phenomena, the manifestations of diseases, and diagnostic clues essential for
patient care, clinical decision-making, and medical research. Key drivers for the importance of
recognizing these entities include symptoms, signs, and findings that are fundamental
components of diagnosing and managing a patient's health condition [2]. Timely and accurate
recognition of these entities can significantly enhance clinical decision support systems, enabling
healthcare providers to make more informed decisions and improve patient outcomes. However,
the early identification of symptoms and signs, especially in the context of infectious diseases
and outbreaks, is critical for public health surveillance. In addition, recognizing symptoms and
findings empowers researchers to systematically extract structured information, supporting large-
scale analyses, epidemiological studies, and the development of novel medical insights [2].
Identifying relevant symptoms and findings is pivotal in identifying candidates for clinical trials,
monitoring drug safety, and evaluating the efficacy of pharmaceutical interventions. In an era of
2. burgeoning artificial intelligence (AI) applications in healthcare, accurate recognition of
symptoms and findings is foundational for the development of AI models for diagnostic and
decision support purposes [3].
The SympTEMIST [4] track is coordinated by the natural language processing (NLP) for
Biomedical Information Analysis team at the Barcelona Supercomputing Center and is endorsed
by Spanish and European initiatives, including DataTools4Heart, AI4HF, BARITONE, and
AI4ProfHealth. This event is scheduled to take place within the framework of the BioCreative
2023 conference. The SympTEMIST challenge encompasses three distinct subtasks, namely
SymptomNER, SymptomNorm, and SymptomMultiNorm. In this study, we focused our
participation exclusively on the initial subtask, which centers on the identification of symptoms
within clinical reports. We addressed the challenge by combining the Language Model Models
(LLMs) that had been finetuned using data specifically tailored to this task. The outcomes
achieved may not be comparable to NER systems designed for general text domains. However,
they offer valuable insights into the potential efficiency of ensemble models based on LLMs.
This is especially true when these models are equipped with a diverse set of classifiers that have
been finetuned for increased efficiency.
The subsequent sections of this paper are structured as follows: In the following section, we
provide a brief overview of recently published research pertinent to the task. Section ‘Material
and Methods” elaborates on the approach and data employed to address the problem in detail.
Following this, Section "Results and Discussions” is dedicated to the discussion of the results
obtained. Finally, the paper is concluded by proposing a few future directions.
Related Work †
Recent progress in NLP has led to significant breakthroughs in Biomedical Named Entity
Recognition (BioNER). Large Language Models (LLMs) including Chat-GPT [5], BERT[6] and
its derivatives [7]– [10], have significantly impacted Named Entity Recognition (NER) tasks. In
[11], authors introduce an innovative deep language model that incorporates both syntactic and
semantic analysis to extract patient-reported symptoms from clinical text. Transformers [12]
based named entity recognition model was devised to discern neurological signs and symptoms
within text, subsequently mapping them to clinical concepts [13]. LLMs such as RoBERTa[14]
and DeBERTa[15], have pushed the state-of-the-art in BioNER.
Despite the progress made in LLMs, there are still significant hurdles when it comes to handling
non-English languages. Regardless of Spanish being the fourth most widely spoken language in
the world with 559 million speakers, there is a shortage of pertinent NLP resources available in
Spanish [16]. Recent years have witnessed a surge in publications related to BioNER in Spanish
[17]. Significant advancements have primarily manifested through collaborative initiatives such
as the IberLEF [18] and BioNLP open shared tasks [19]. However, despite this growing interest,
handling the Spanish language, especially in clinical settings poses numerous challenges.
Spanish, with its rich inflectional morphology indicating various syntactic, semantic, and
grammatical aspects (such as gender, number, etc.), presents unique complexities [17]. From a
syntactic perspective, Spanish texts feature more subordinate clauses and boast lengthy sentences
with flexible word order, allowing the subject to appear anywhere, not limited to preceding the
verb.
3. Material and Methods†
In this section. Our approach entails finetuning the LLMs, which include XLM-RL1
and XLM-
RB2
[16], BBES3
[21], BBS4
[22], E5-B5
and E5-L6
[23]. We employ a simple ensemble
technique called majority voting (MV) [24]. [here].
XLM-RL (561M parameters) and XLM-RB (279M parameters), both based on the Transformer
architecture [12], are masked language models trained on a vast dataset of over two terabytes
from filtered CommonCrawl [25]. We leverage BBS and BBES both pretrained on Spanish
clinical data. BBS exclusively leverages biomedical resources and BBES incorporates both
biomedical and EHR data. Extensive finetuning and evaluation on clinical NER tasks
demonstrated their superior performance over other general-domain and domain-specific Spanish
models. Lastly, we leverage E5-B (278M parameters) and E5-L (560M parameters) which are
advanced text embedding models, trained through a contrastive approach using weak
supervision.
Using MV we favor entity labels that receive the most frequent consensus among the ensemble
of models. By using this approach we mitigate the risks associated with relying on a single
model.
Experiments †
In this section. The SympTEMIST corpus [4] encompasses Spanish medical records, each
uniquely identified by a filename. Each record is meticulously annotated, featuring an annotation
ID, label, start and end positions within the text, and the corresponding text snippet. For
preprocessing, we use the SpaCy tool for tokenization following the inside–outside–beginning
(IOB) tagging scheme. This dataset comprises a total of 744 examples. We split the data in 95%
training and 5% validation for our experiments. [Comments on using test set].
For testing we were provided with 250 examples by the SympTEMIST organizers. We use all
the training data (744 examples) for finetuning the models for submission by including the 5%
validation data into the training data. The hyperparameters of all six models were configured as
follows: a batch size of four, a total of 70 training epochs, and a learning rate of 5e-05,
employing a linear scheduler. Subsequently, we assessed the performance of these models using
a distinct test set containing 250 text samples released by [4]. ADD TEST DATA SPECS
1 XLM-RL: XLM-RoBERTa Large
2 XLM-RL: XLM-RoBERTa Base
3 bsc-bio-ehr-es
4 bsc-bio-es
5 multilingual-e5-base
6 multilingual-e5-large
Commented [AK1]: ? @ Mai citing few works here and adding
3-4 sentences about why MV is effective will be great I guess
4. Results and Discussion † (This section also needs a bit change! You must also
report the results on validation set!)
This section. As depicted in Figure 1, a consistent upward trend in performance was observed for
the validation set when the number of training epochs increased for all six models. Notably, the
XLM-RL model stood out by achieving the highest F1-score of 0.7043. Although validation data
has a relatively smaller size than test data and may not fully capture the diversity and complexity
inherent in the test data,we chose the top three performing models for evaluation on test data in
Table 1.
Insights from Table 1 suggest that if the ensemble of models in the majority voting method
generates diverse predictions, it can lead to some level of vote dilution. In other words, the
majority voting may average out correct predictions with incorrect ones, resulting in a less
accurate overall prediction. To mitigate this problem, a weighted majority voting [26] could be
considered, where higher weights are assigned to top-performing models and lower weights to
lower-performing models. To sum up, the observed results with domain-specific models on
Spanish clinical data highlight the pressing necessity for ongoing research and advancement in
this area, as they outperform general-domain models.
Model
Variant
Train data Validation data Test data
Precision Recall F-1 Score Precision Recall F-1 Score Precision Recall F-1 Score
BBES 0.70 0.57 0.63
BBS 0.69 0.62 0.65
XLM-RL 0.62 0.50 0.56
MV 0.68 0.60 0.64
Table 1: Models performance on the train, validation, and test data
Conclusion†
This study shows the imperative need for continued exploration and development of domain-
specific models in the context of Spanish clinical data. In future work,
Commented [ma2]: I will remove this and add a table
Commented [AK3R2]: I suggest having one or a couple of
them table where we can mention train, test and valid accuracy for
all model variants? How does that sound @Abbas? @ Mai I can
start working on tables and you can fill in the data.
Commented [AK4R2]: how does the new table look? less
space more content
Commented [AK5R2]: looks more better with two decimal
places tbh, feel free to chabnge it back if you don’t like it
Commented [AK6R2]: also, let’s try to keep the figure too
because it shows the overall trend at different epochs and shows
that models are improving. We do have space.
Commented [AK7]: to do
5. References
[1] T. Liang, C. Xia, Z. Zhao, Y. Jiang, Y. Yin, and P. S. Yu, “Transferring From Textual Entailment to
Biomedical Named Entity Recognition,” IEEE/ACM Trans Comput Biol Bioinform, vol. 20, no. 04,
pp. 2577–2586, Jul. 2023, doi: 10.1109/TCBB.2023.3236477.
[2] J. Li et al., “A comparative study of pre-trained language models for named entity recognition in
clinical trial eligibility criteria from multiple corpora,” BMC Med Inform Decis Mak, vol. 22, no. 3,
pp. 1–10, Sep. 2022, doi: 10.1186/S12911-022-01967-7/TABLES/8.
[3] X. Liu, G. L. Hersch, I. Khalil, and M. Devarakonda, “Clinical Trial Information Extraction with
BERT,” Proceedings - 2021 IEEE 9th International Conference on Healthcare Informatics, ISCHI
2021, pp. 505–506, Sep. 2021, doi: 10.1109/ICHI52183.2021.00092.
[4] “SympTEMIST Corpus: Gold Standard annotations for clinical symptoms, signs and findings
information extraction”, doi: 10.5281/ZENODO.8413866.
[5] T. B. Brown et al., “Language Models are Few-Shot Learners,” Adv Neural Inf Process Syst, vol.
2020-December, May 2020, Accessed: Oct. 18, 2023. [Online]. Available:
https://arxiv.org/abs/2005.14165v4
[6] J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding,” NAACL HLT 2019 - 2019 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language
Technologies - Proceedings of the Conference, vol. 1, pp. 4171–4186, Oct. 2018, Accessed: Oct.
18, 2023. [Online]. Available: https://arxiv.org/abs/1810.04805v2
[7] G. Wang, G. Yang, Z. Du, L. Fan, and X. Li, “ClinicalGPT: Large Language Models Finetuned with
Diverse Medical Data and Comprehensive Evaluation,” Jun. 2023, Accessed: Oct. 19, 2023.
[Online]. Available: https://arxiv.org/abs/2306.09968v1
[8] R. Luo et al., “BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and
Mining,” Brief Bioinform, vol. 23, no. 6, Oct. 2022, doi: 10.1093/bib/bbac409.
[9] Y. Gu et al., “Domain-Specific Language Model Pretraining for Biomedical Natural Language
Processing,” ACM Trans Comput Healthc, vol. 3, no. 1, p. 24, Jul. 2020, doi: 10.1145/3458754.
[10] Y. Labrak et al., “DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical
domains,” pp. 16207–16221, Apr. 2023, doi: 10.18653/v1/2023.acl-long.896.
6. [11] X. Luo, P. Gandhi, S. Storey, and K. Huang, “A Deep Language Model for Symptom Extraction from
Clinical Text and Its Application to Extract COVID-19 symptoms from Social Media,” IEEE J Biomed
Health Inform, vol. 26, no. 4, p. 1737, Apr. 2022, doi: 10.1109/JBHI.2021.3123192.
[12] A. Vaswani et al., “Attention is All you Need,” Adv Neural Inf Process Syst, vol. 30, 2017.
[13] S. Azizi, D. B. Hier, and D. C. Wunsch, “Enhanced neurologic concept recognition using a named
entity recognition model based on transformers,” Front Digit Health, vol. 4, Dec. 2022, doi:
10.3389/FDGTH.2022.1065581.
[14] Y. Liu et al., “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” Jul. 2019, Accessed:
Oct. 18, 2023. [Online]. Available: https://arxiv.org/abs/1907.11692v1
[15] P. He, X. Liu, J. Gao, and W. Chen, “DeBERTa: Decoding-enhanced BERT with Disentangled
Attention,” ICLR 2021 - 9th International Conference on Learning Representations, Jun. 2020,
Accessed: Oct. 18, 2023. [Online]. Available: https://arxiv.org/abs/2006.03654v6
[16] M. Paul. Lewis and Summer Institute of Linguistics., “Ethnologue : languages of the world,” p.
1248, 2009, Accessed: Oct. 19, 2023. [Online]. Available:
https://books.google.com/books/about/Ethnologue.html?id=FVVFPgAACAAJ
[17] G. G. Subies, Á. B. Jiménez, and P. M. Fernández, “A Survey of Spanish Clinical Language Models,”
Aug. 2023, Accessed: Oct. 19, 2023. [Online]. Available: https://arxiv.org/abs/2308.02199v1
[18] J. Gonzalo, M. Montes-Y-Gómez, and F. Rangel, “Overview of IberLEF 2022: Natural Language
Processing Challenges for Spanish and other Iberian Languages,” 2022, Accessed: Oct. 18, 2023.
[Online]. Available: https://nlp.uned.es/
[19] L. Akhtyamova, P. Martínez, K. Verspoor, and J. Cardiff, “Testing contextualized word
embeddings to improve NER in Spanish clinical case narratives,” IEEE Access, vol. 8, pp. 164717–
164726, 2020, doi: 10.1109/ACCESS.2020.3018688.
[20] A. Conneau et al., “Unsupervised Cross-lingual Representation Learning at Scale,” Proceedings of
the Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451, Nov. 2019,
doi: 10.18653/v1/2020.acl-main.747.
[21] C. P. Carrino et al., “Pretrained Biomedical Language Models for Clinical NLP in Spanish,”
Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 193–199,
2022, doi: 10.18653/V1/2022.BIONLP-1.19.
[22] P. Lewis, M. Ott, J. Du, and V. Stoyanov, “Pretrained Language Models for Biomedical and Clinical
Tasks: Understanding and Extending the State-of-the-Art,” pp. 146–157, Nov. 2020, doi:
10.18653/V1/2020.CLINICALNLP-1.17.
[23] L. Wang et al., “Text Embeddings by Weakly-Supervised Contrastive Pre-training,” Dec. 2022,
Accessed: Oct. 15, 2023. [Online]. Available: https://arxiv.org/abs/2212.03533v1
7. [24] R. Atallah and A. Al-Mousa, “Heart Disease Detection Using Machine Learning Majority Voting
Ensemble Method,” 2019 2nd International Conference on New Trends in Computing Sciences,
ICTCS 2019 - Proceedings, Oct. 2019, doi: 10.1109/ICTCS.2019.8923053.
[25] G. Wenzek et al., “CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data,”
LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference
Proceedings, pp. 4003–4012, Nov. 2019, Accessed: Oct. 18, 2023. [Online]. Available:
https://arxiv.org/abs/1911.00359v2
[26] A. Dogan and D. Birant, “A Weighted Majority Voting Ensemble Approach for Classification,”
UBMK 2019 - Proceedings, 4th International Conference on Computer Science and Engineering,
pp. 366–371, Sep. 2019, doi: 10.1109/UBMK.2019.8907028.