SlideShare a Scribd company logo
1 of 7
Download to read offline
Fine Tuned Large Language Models for Symptom Recognition
from Spanish Clinical Text
Mai A. Shaaban1
, Abbas Akkasi2, *
, Adnan Khan2
, Majid Komeili2
, Mohammad Yaqub1
1
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE, 2
School of
Computer Science, Carleton University, Ottawa, CA.
*Corresponding author: +1- 613 222 8134, E-mail: abbasakkasi@cunet.carleton.ca
Abstract
The accurate recognition of symptoms in clinical reports is significantly important in the fields of
healthcare and biomedical natural language processing. These entities serve as essential building
blocks for clinical information extraction, enabling retrieval of critical medical insights from vast
amounts of textual data. Furthermore, the ability to identify and categorize these entities is
fundamental to development of advanced clinical decision support systems, aiding healthcare
professionals in diagnosis and treatment planning. In this study, we participated in
SympTEMIST – a shared task on detection of symptoms, signs and findings in medical
documents in Spanish- by combining a set of large language models finetuned with the data
released by the task’s organizers.
Introduction†
Biomedical Named Entity Recognition (BioNER) has emerged as a critical component in the
realm of biomedical and clinical informatics, with the primary goal of extracting meaningful
entities from vast repositories of textual clinical data. Among the diverse categories of entities in
the biomedical domain, recognizing the mention of symptoms, including signs and findings from
clinical reports, stands out as an indispensable and profoundly significant task [1].The
recognition of symptoms, signs, and findings in clinical reports holds profound implications for
various facets of healthcare and research, making it an essential area of focus in the BioNER
domain. Clinical narratives, often recorded in the form of electronic health records (EHR),
radiology reports, pathology reports, and medical literature, are a treasure trove of information
regarding the health status of patients. These reports contain a wealth of information about
observed clinical phenomena, the manifestations of diseases, and diagnostic clues essential for
patient care, clinical decision-making, and medical research. Key drivers for the importance of
recognizing these entities include symptoms, signs, and findings that are fundamental
components of diagnosing and managing a patient's health condition [2]. Timely and accurate
recognition of these entities can significantly enhance clinical decision support systems, enabling
healthcare providers to make more informed decisions and improve patient outcomes. However,
the early identification of symptoms and signs, especially in the context of infectious diseases
and outbreaks, is critical for public health surveillance. In addition, recognizing symptoms and
findings empowers researchers to systematically extract structured information, supporting large-
scale analyses, epidemiological studies, and the development of novel medical insights [2].
Identifying relevant symptoms and findings is pivotal in identifying candidates for clinical trials,
monitoring drug safety, and evaluating the efficacy of pharmaceutical interventions. In an era of
burgeoning artificial intelligence (AI) applications in healthcare, accurate recognition of
symptoms and findings is foundational for the development of AI models for diagnostic and
decision support purposes [3].
The SympTEMIST [4] track is coordinated by the natural language processing (NLP) for
Biomedical Information Analysis team at the Barcelona Supercomputing Center and is endorsed
by Spanish and European initiatives, including DataTools4Heart, AI4HF, BARITONE, and
AI4ProfHealth. This event is scheduled to take place within the framework of the BioCreative
2023 conference. The SympTEMIST challenge encompasses three distinct subtasks, namely
SymptomNER, SymptomNorm, and SymptomMultiNorm. In this study, we focused our
participation exclusively on the initial subtask, which centers on the identification of symptoms
within clinical reports. We addressed the challenge by combining the Language Model Models
(LLMs) that had been finetuned using data specifically tailored to this task. The outcomes
achieved may not be comparable to NER systems designed for general text domains. However,
they offer valuable insights into the potential efficiency of ensemble models based on LLMs.
This is especially true when these models are equipped with a diverse set of classifiers that have
been finetuned for increased efficiency.
The subsequent sections of this paper are structured as follows: In the following section, we
provide a brief overview of recently published research pertinent to the task. Section ‘Material
and Methods” elaborates on the approach and data employed to address the problem in detail.
Following this, Section "Results and Discussions” is dedicated to the discussion of the results
obtained. Finally, the paper is concluded by proposing a few future directions.
Related Work †
Recent progress in NLP has led to significant breakthroughs in Biomedical Named Entity
Recognition (BioNER). Large Language Models (LLMs) including Chat-GPT [5], BERT[6] and
its derivatives [7]– [10], have significantly impacted Named Entity Recognition (NER) tasks. In
[11], authors introduce an innovative deep language model that incorporates both syntactic and
semantic analysis to extract patient-reported symptoms from clinical text. Transformers [12]
based named entity recognition model was devised to discern neurological signs and symptoms
within text, subsequently mapping them to clinical concepts [13]. LLMs such as RoBERTa[14]
and DeBERTa[15], have pushed the state-of-the-art in BioNER.
Despite the progress made in LLMs, there are still significant hurdles when it comes to handling
non-English languages. Regardless of Spanish being the fourth most widely spoken language in
the world with 559 million speakers, there is a shortage of pertinent NLP resources available in
Spanish [16]. Recent years have witnessed a surge in publications related to BioNER in Spanish
[17]. Significant advancements have primarily manifested through collaborative initiatives such
as the IberLEF [18] and BioNLP open shared tasks [19]. However, despite this growing interest,
handling the Spanish language, especially in clinical settings poses numerous challenges.
Spanish, with its rich inflectional morphology indicating various syntactic, semantic, and
grammatical aspects (such as gender, number, etc.), presents unique complexities [17]. From a
syntactic perspective, Spanish texts feature more subordinate clauses and boast lengthy sentences
with flexible word order, allowing the subject to appear anywhere, not limited to preceding the
verb.
Material and Methods†
In this section. Our approach entails finetuning the LLMs, which include XLM-RL1
and XLM-
RB2
[16], BBES3
[21], BBS4
[22], E5-B5
and E5-L6
[23]. We employ a simple ensemble
technique called majority voting (MV) [24]. [here].
XLM-RL (561M parameters) and XLM-RB (279M parameters), both based on the Transformer
architecture [12], are masked language models trained on a vast dataset of over two terabytes
from filtered CommonCrawl [25]. We leverage BBS and BBES both pretrained on Spanish
clinical data. BBS exclusively leverages biomedical resources and BBES incorporates both
biomedical and EHR data. Extensive finetuning and evaluation on clinical NER tasks
demonstrated their superior performance over other general-domain and domain-specific Spanish
models. Lastly, we leverage E5-B (278M parameters) and E5-L (560M parameters) which are
advanced text embedding models, trained through a contrastive approach using weak
supervision.
Using MV we favor entity labels that receive the most frequent consensus among the ensemble
of models. By using this approach we mitigate the risks associated with relying on a single
model.
Experiments †
In this section. The SympTEMIST corpus [4] encompasses Spanish medical records, each
uniquely identified by a filename. Each record is meticulously annotated, featuring an annotation
ID, label, start and end positions within the text, and the corresponding text snippet. For
preprocessing, we use the SpaCy tool for tokenization following the inside–outside–beginning
(IOB) tagging scheme. This dataset comprises a total of 744 examples. We split the data in 95%
training and 5% validation for our experiments. [Comments on using test set].
For testing we were provided with 250 examples by the SympTEMIST organizers. We use all
the training data (744 examples) for finetuning the models for submission by including the 5%
validation data into the training data. The hyperparameters of all six models were configured as
follows: a batch size of four, a total of 70 training epochs, and a learning rate of 5e-05,
employing a linear scheduler. Subsequently, we assessed the performance of these models using
a distinct test set containing 250 text samples released by [4]. ADD TEST DATA SPECS
1 XLM-RL: XLM-RoBERTa Large
2 XLM-RL: XLM-RoBERTa Base
3 bsc-bio-ehr-es
4 bsc-bio-es
5 multilingual-e5-base
6 multilingual-e5-large
Commented [AK1]: ? @ Mai citing few works here and adding
3-4 sentences about why MV is effective will be great I guess
Results and Discussion † (This section also needs a bit change! You must also
report the results on validation set!)
This section. As depicted in Figure 1, a consistent upward trend in performance was observed for
the validation set when the number of training epochs increased for all six models. Notably, the
XLM-RL model stood out by achieving the highest F1-score of 0.7043. Although validation data
has a relatively smaller size than test data and may not fully capture the diversity and complexity
inherent in the test data,we chose the top three performing models for evaluation on test data in
Table 1.
Insights from Table 1 suggest that if the ensemble of models in the majority voting method
generates diverse predictions, it can lead to some level of vote dilution. In other words, the
majority voting may average out correct predictions with incorrect ones, resulting in a less
accurate overall prediction. To mitigate this problem, a weighted majority voting [26] could be
considered, where higher weights are assigned to top-performing models and lower weights to
lower-performing models. To sum up, the observed results with domain-specific models on
Spanish clinical data highlight the pressing necessity for ongoing research and advancement in
this area, as they outperform general-domain models.
Model
Variant
Train data Validation data Test data
Precision Recall F-1 Score Precision Recall F-1 Score Precision Recall F-1 Score
BBES 0.70 0.57 0.63
BBS 0.69 0.62 0.65
XLM-RL 0.62 0.50 0.56
MV 0.68 0.60 0.64
Table 1: Models performance on the train, validation, and test data
Conclusion†
This study shows the imperative need for continued exploration and development of domain-
specific models in the context of Spanish clinical data. In future work,
Commented [ma2]: I will remove this and add a table
Commented [AK3R2]: I suggest having one or a couple of
them table where we can mention train, test and valid accuracy for
all model variants? How does that sound @Abbas? @ Mai I can
start working on tables and you can fill in the data.
Commented [AK4R2]: how does the new table look? less
space more content
Commented [AK5R2]: looks more better with two decimal
places tbh, feel free to chabnge it back if you don’t like it
Commented [AK6R2]: also, let’s try to keep the figure too
because it shows the overall trend at different epochs and shows
that models are improving. We do have space.
Commented [AK7]: to do
References
[1] T. Liang, C. Xia, Z. Zhao, Y. Jiang, Y. Yin, and P. S. Yu, “Transferring From Textual Entailment to
Biomedical Named Entity Recognition,” IEEE/ACM Trans Comput Biol Bioinform, vol. 20, no. 04,
pp. 2577–2586, Jul. 2023, doi: 10.1109/TCBB.2023.3236477.
[2] J. Li et al., “A comparative study of pre-trained language models for named entity recognition in
clinical trial eligibility criteria from multiple corpora,” BMC Med Inform Decis Mak, vol. 22, no. 3,
pp. 1–10, Sep. 2022, doi: 10.1186/S12911-022-01967-7/TABLES/8.
[3] X. Liu, G. L. Hersch, I. Khalil, and M. Devarakonda, “Clinical Trial Information Extraction with
BERT,” Proceedings - 2021 IEEE 9th International Conference on Healthcare Informatics, ISCHI
2021, pp. 505–506, Sep. 2021, doi: 10.1109/ICHI52183.2021.00092.
[4] “SympTEMIST Corpus: Gold Standard annotations for clinical symptoms, signs and findings
information extraction”, doi: 10.5281/ZENODO.8413866.
[5] T. B. Brown et al., “Language Models are Few-Shot Learners,” Adv Neural Inf Process Syst, vol.
2020-December, May 2020, Accessed: Oct. 18, 2023. [Online]. Available:
https://arxiv.org/abs/2005.14165v4
[6] J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding,” NAACL HLT 2019 - 2019 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language
Technologies - Proceedings of the Conference, vol. 1, pp. 4171–4186, Oct. 2018, Accessed: Oct.
18, 2023. [Online]. Available: https://arxiv.org/abs/1810.04805v2
[7] G. Wang, G. Yang, Z. Du, L. Fan, and X. Li, “ClinicalGPT: Large Language Models Finetuned with
Diverse Medical Data and Comprehensive Evaluation,” Jun. 2023, Accessed: Oct. 19, 2023.
[Online]. Available: https://arxiv.org/abs/2306.09968v1
[8] R. Luo et al., “BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and
Mining,” Brief Bioinform, vol. 23, no. 6, Oct. 2022, doi: 10.1093/bib/bbac409.
[9] Y. Gu et al., “Domain-Specific Language Model Pretraining for Biomedical Natural Language
Processing,” ACM Trans Comput Healthc, vol. 3, no. 1, p. 24, Jul. 2020, doi: 10.1145/3458754.
[10] Y. Labrak et al., “DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical
domains,” pp. 16207–16221, Apr. 2023, doi: 10.18653/v1/2023.acl-long.896.
[11] X. Luo, P. Gandhi, S. Storey, and K. Huang, “A Deep Language Model for Symptom Extraction from
Clinical Text and Its Application to Extract COVID-19 symptoms from Social Media,” IEEE J Biomed
Health Inform, vol. 26, no. 4, p. 1737, Apr. 2022, doi: 10.1109/JBHI.2021.3123192.
[12] A. Vaswani et al., “Attention is All you Need,” Adv Neural Inf Process Syst, vol. 30, 2017.
[13] S. Azizi, D. B. Hier, and D. C. Wunsch, “Enhanced neurologic concept recognition using a named
entity recognition model based on transformers,” Front Digit Health, vol. 4, Dec. 2022, doi:
10.3389/FDGTH.2022.1065581.
[14] Y. Liu et al., “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” Jul. 2019, Accessed:
Oct. 18, 2023. [Online]. Available: https://arxiv.org/abs/1907.11692v1
[15] P. He, X. Liu, J. Gao, and W. Chen, “DeBERTa: Decoding-enhanced BERT with Disentangled
Attention,” ICLR 2021 - 9th International Conference on Learning Representations, Jun. 2020,
Accessed: Oct. 18, 2023. [Online]. Available: https://arxiv.org/abs/2006.03654v6
[16] M. Paul. Lewis and Summer Institute of Linguistics., “Ethnologue : languages of the world,” p.
1248, 2009, Accessed: Oct. 19, 2023. [Online]. Available:
https://books.google.com/books/about/Ethnologue.html?id=FVVFPgAACAAJ
[17] G. G. Subies, Á. B. Jiménez, and P. M. Fernández, “A Survey of Spanish Clinical Language Models,”
Aug. 2023, Accessed: Oct. 19, 2023. [Online]. Available: https://arxiv.org/abs/2308.02199v1
[18] J. Gonzalo, M. Montes-Y-Gómez, and F. Rangel, “Overview of IberLEF 2022: Natural Language
Processing Challenges for Spanish and other Iberian Languages,” 2022, Accessed: Oct. 18, 2023.
[Online]. Available: https://nlp.uned.es/
[19] L. Akhtyamova, P. Martínez, K. Verspoor, and J. Cardiff, “Testing contextualized word
embeddings to improve NER in Spanish clinical case narratives,” IEEE Access, vol. 8, pp. 164717–
164726, 2020, doi: 10.1109/ACCESS.2020.3018688.
[20] A. Conneau et al., “Unsupervised Cross-lingual Representation Learning at Scale,” Proceedings of
the Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451, Nov. 2019,
doi: 10.18653/v1/2020.acl-main.747.
[21] C. P. Carrino et al., “Pretrained Biomedical Language Models for Clinical NLP in Spanish,”
Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 193–199,
2022, doi: 10.18653/V1/2022.BIONLP-1.19.
[22] P. Lewis, M. Ott, J. Du, and V. Stoyanov, “Pretrained Language Models for Biomedical and Clinical
Tasks: Understanding and Extending the State-of-the-Art,” pp. 146–157, Nov. 2020, doi:
10.18653/V1/2020.CLINICALNLP-1.17.
[23] L. Wang et al., “Text Embeddings by Weakly-Supervised Contrastive Pre-training,” Dec. 2022,
Accessed: Oct. 15, 2023. [Online]. Available: https://arxiv.org/abs/2212.03533v1
[24] R. Atallah and A. Al-Mousa, “Heart Disease Detection Using Machine Learning Majority Voting
Ensemble Method,” 2019 2nd International Conference on New Trends in Computing Sciences,
ICTCS 2019 - Proceedings, Oct. 2019, doi: 10.1109/ICTCS.2019.8923053.
[25] G. Wenzek et al., “CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data,”
LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference
Proceedings, pp. 4003–4012, Nov. 2019, Accessed: Oct. 18, 2023. [Online]. Available:
https://arxiv.org/abs/1911.00359v2
[26] A. Dogan and D. Birant, “A Weighted Majority Voting Ensemble Approach for Classification,”
UBMK 2019 - Proceedings, 4th International Conference on Computer Science and Engineering,
pp. 366–371, Sep. 2019, doi: 10.1109/UBMK.2019.8907028.

More Related Content

Similar to BioCreative2023_proceedings_instructions_authors_template.pdf

PharmaCoNER: Pharmacological Substances, Compounds and proteins Named Entity ...
PharmaCoNER: Pharmacological Substances, Compounds and proteins Named Entity ...PharmaCoNER: Pharmacological Substances, Compounds and proteins Named Entity ...
PharmaCoNER: Pharmacological Substances, Compounds and proteins Named Entity ...
Martin Krallinger
 
KCI_NLP_OHSUResearchWeek2016-NLPatOHSU-final
KCI_NLP_OHSUResearchWeek2016-NLPatOHSU-finalKCI_NLP_OHSUResearchWeek2016-NLPatOHSU-final
KCI_NLP_OHSUResearchWeek2016-NLPatOHSU-final
Deborah Woodcock
 

Similar to BioCreative2023_proceedings_instructions_authors_template.pdf (20)

SBML FOR OPTIMIZING DECISION SUPPORT'S TOOLS
SBML FOR OPTIMIZING DECISION SUPPORT'S TOOLS SBML FOR OPTIMIZING DECISION SUPPORT'S TOOLS
SBML FOR OPTIMIZING DECISION SUPPORT'S TOOLS
 
Biomedical indexing and retrieval system based on language modeling approach
Biomedical indexing and retrieval system based on language modeling approachBiomedical indexing and retrieval system based on language modeling approach
Biomedical indexing and retrieval system based on language modeling approach
 
Clinical Trials Powered By Electronic Health Records
Clinical Trials Powered By Electronic Health RecordsClinical Trials Powered By Electronic Health Records
Clinical Trials Powered By Electronic Health Records
 
Presentation "Spanish Resources in Trendminer Project"
Presentation "Spanish Resources in Trendminer Project"Presentation "Spanish Resources in Trendminer Project"
Presentation "Spanish Resources in Trendminer Project"
 
PharmaCoNER: Pharmacological Substances, Compounds and proteins Named Entity ...
PharmaCoNER: Pharmacological Substances, Compounds and proteins Named Entity ...PharmaCoNER: Pharmacological Substances, Compounds and proteins Named Entity ...
PharmaCoNER: Pharmacological Substances, Compounds and proteins Named Entity ...
 
Tteh.000542
Tteh.000542Tteh.000542
Tteh.000542
 
Instant Answering By Machine Learning Approach
Instant Answering By Machine Learning ApproachInstant Answering By Machine Learning Approach
Instant Answering By Machine Learning Approach
 
Case Retrieval using Bhattacharya Coefficient with Particle Swarm Optimization
Case Retrieval using Bhattacharya Coefficient with Particle Swarm OptimizationCase Retrieval using Bhattacharya Coefficient with Particle Swarm Optimization
Case Retrieval using Bhattacharya Coefficient with Particle Swarm Optimization
 
STANDARDIZATION OF CLINICAL DOCUMENTS THROUGH HL7 - FHIR FOR COLOMBIA
STANDARDIZATION OF CLINICAL DOCUMENTS THROUGH HL7 - FHIR FOR COLOMBIASTANDARDIZATION OF CLINICAL DOCUMENTS THROUGH HL7 - FHIR FOR COLOMBIA
STANDARDIZATION OF CLINICAL DOCUMENTS THROUGH HL7 - FHIR FOR COLOMBIA
 
Standardization of Clinical Documents Through HL7 - FHIR for Colombia
Standardization of Clinical Documents Through HL7 - FHIR for ColombiaStandardization of Clinical Documents Through HL7 - FHIR for Colombia
Standardization of Clinical Documents Through HL7 - FHIR for Colombia
 
Standardization of Clinical Documents Through HL7 - FHIR for Colombia
Standardization of Clinical Documents Through HL7 - FHIR for ColombiaStandardization of Clinical Documents Through HL7 - FHIR for Colombia
Standardization of Clinical Documents Through HL7 - FHIR for Colombia
 
KCI_NLP_OHSUResearchWeek2016-NLPatOHSU-final
KCI_NLP_OHSUResearchWeek2016-NLPatOHSU-finalKCI_NLP_OHSUResearchWeek2016-NLPatOHSU-final
KCI_NLP_OHSUResearchWeek2016-NLPatOHSU-final
 
WP-2013-03-KOLMAT-GL-L
WP-2013-03-KOLMAT-GL-LWP-2013-03-KOLMAT-GL-L
WP-2013-03-KOLMAT-GL-L
 
ext mining for the Vaccine Adverse Event Reporting System: medical text class...
ext mining for the Vaccine Adverse Event Reporting System: medical text class...ext mining for the Vaccine Adverse Event Reporting System: medical text class...
ext mining for the Vaccine Adverse Event Reporting System: medical text class...
 
Bio informatics
Bio informaticsBio informatics
Bio informatics
 
Bio informatics
Bio informaticsBio informatics
Bio informatics
 
Enhancing COVID-19 forecasting through deep learning techniques and fine-tuning
Enhancing COVID-19 forecasting through deep learning techniques and fine-tuningEnhancing COVID-19 forecasting through deep learning techniques and fine-tuning
Enhancing COVID-19 forecasting through deep learning techniques and fine-tuning
 
T-BioInfo Methods and Approaches
T-BioInfo Methods and ApproachesT-BioInfo Methods and Approaches
T-BioInfo Methods and Approaches
 
T-bioinfo overview
T-bioinfo overviewT-bioinfo overview
T-bioinfo overview
 
E-Symptom Analysis System to Improve Medical Diagnosis and Treatment Recommen...
E-Symptom Analysis System to Improve Medical Diagnosis and Treatment Recommen...E-Symptom Analysis System to Improve Medical Diagnosis and Treatment Recommen...
E-Symptom Analysis System to Improve Medical Diagnosis and Treatment Recommen...
 

Recently uploaded

Recently uploaded (20)

Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
Philosophy of china and it's charactistics
Philosophy of china and it's charactisticsPhilosophy of china and it's charactistics
Philosophy of china and it's charactistics
 
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdfFICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
 
How to Add a Tool Tip to a Field in Odoo 17
How to Add a Tool Tip to a Field in Odoo 17How to Add a Tool Tip to a Field in Odoo 17
How to Add a Tool Tip to a Field in Odoo 17
 
Basic Intentional Injuries Health Education
Basic Intentional Injuries Health EducationBasic Intentional Injuries Health Education
Basic Intentional Injuries Health Education
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Tatlong Kwento ni Lola basyang-1.pdf arts
Tatlong Kwento ni Lola basyang-1.pdf artsTatlong Kwento ni Lola basyang-1.pdf arts
Tatlong Kwento ni Lola basyang-1.pdf arts
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 

BioCreative2023_proceedings_instructions_authors_template.pdf

  • 1. Fine Tuned Large Language Models for Symptom Recognition from Spanish Clinical Text Mai A. Shaaban1 , Abbas Akkasi2, * , Adnan Khan2 , Majid Komeili2 , Mohammad Yaqub1 1 Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE, 2 School of Computer Science, Carleton University, Ottawa, CA. *Corresponding author: +1- 613 222 8134, E-mail: abbasakkasi@cunet.carleton.ca Abstract The accurate recognition of symptoms in clinical reports is significantly important in the fields of healthcare and biomedical natural language processing. These entities serve as essential building blocks for clinical information extraction, enabling retrieval of critical medical insights from vast amounts of textual data. Furthermore, the ability to identify and categorize these entities is fundamental to development of advanced clinical decision support systems, aiding healthcare professionals in diagnosis and treatment planning. In this study, we participated in SympTEMIST – a shared task on detection of symptoms, signs and findings in medical documents in Spanish- by combining a set of large language models finetuned with the data released by the task’s organizers. Introduction† Biomedical Named Entity Recognition (BioNER) has emerged as a critical component in the realm of biomedical and clinical informatics, with the primary goal of extracting meaningful entities from vast repositories of textual clinical data. Among the diverse categories of entities in the biomedical domain, recognizing the mention of symptoms, including signs and findings from clinical reports, stands out as an indispensable and profoundly significant task [1].The recognition of symptoms, signs, and findings in clinical reports holds profound implications for various facets of healthcare and research, making it an essential area of focus in the BioNER domain. Clinical narratives, often recorded in the form of electronic health records (EHR), radiology reports, pathology reports, and medical literature, are a treasure trove of information regarding the health status of patients. These reports contain a wealth of information about observed clinical phenomena, the manifestations of diseases, and diagnostic clues essential for patient care, clinical decision-making, and medical research. Key drivers for the importance of recognizing these entities include symptoms, signs, and findings that are fundamental components of diagnosing and managing a patient's health condition [2]. Timely and accurate recognition of these entities can significantly enhance clinical decision support systems, enabling healthcare providers to make more informed decisions and improve patient outcomes. However, the early identification of symptoms and signs, especially in the context of infectious diseases and outbreaks, is critical for public health surveillance. In addition, recognizing symptoms and findings empowers researchers to systematically extract structured information, supporting large- scale analyses, epidemiological studies, and the development of novel medical insights [2]. Identifying relevant symptoms and findings is pivotal in identifying candidates for clinical trials, monitoring drug safety, and evaluating the efficacy of pharmaceutical interventions. In an era of
  • 2. burgeoning artificial intelligence (AI) applications in healthcare, accurate recognition of symptoms and findings is foundational for the development of AI models for diagnostic and decision support purposes [3]. The SympTEMIST [4] track is coordinated by the natural language processing (NLP) for Biomedical Information Analysis team at the Barcelona Supercomputing Center and is endorsed by Spanish and European initiatives, including DataTools4Heart, AI4HF, BARITONE, and AI4ProfHealth. This event is scheduled to take place within the framework of the BioCreative 2023 conference. The SympTEMIST challenge encompasses three distinct subtasks, namely SymptomNER, SymptomNorm, and SymptomMultiNorm. In this study, we focused our participation exclusively on the initial subtask, which centers on the identification of symptoms within clinical reports. We addressed the challenge by combining the Language Model Models (LLMs) that had been finetuned using data specifically tailored to this task. The outcomes achieved may not be comparable to NER systems designed for general text domains. However, they offer valuable insights into the potential efficiency of ensemble models based on LLMs. This is especially true when these models are equipped with a diverse set of classifiers that have been finetuned for increased efficiency. The subsequent sections of this paper are structured as follows: In the following section, we provide a brief overview of recently published research pertinent to the task. Section ‘Material and Methods” elaborates on the approach and data employed to address the problem in detail. Following this, Section "Results and Discussions” is dedicated to the discussion of the results obtained. Finally, the paper is concluded by proposing a few future directions. Related Work † Recent progress in NLP has led to significant breakthroughs in Biomedical Named Entity Recognition (BioNER). Large Language Models (LLMs) including Chat-GPT [5], BERT[6] and its derivatives [7]– [10], have significantly impacted Named Entity Recognition (NER) tasks. In [11], authors introduce an innovative deep language model that incorporates both syntactic and semantic analysis to extract patient-reported symptoms from clinical text. Transformers [12] based named entity recognition model was devised to discern neurological signs and symptoms within text, subsequently mapping them to clinical concepts [13]. LLMs such as RoBERTa[14] and DeBERTa[15], have pushed the state-of-the-art in BioNER. Despite the progress made in LLMs, there are still significant hurdles when it comes to handling non-English languages. Regardless of Spanish being the fourth most widely spoken language in the world with 559 million speakers, there is a shortage of pertinent NLP resources available in Spanish [16]. Recent years have witnessed a surge in publications related to BioNER in Spanish [17]. Significant advancements have primarily manifested through collaborative initiatives such as the IberLEF [18] and BioNLP open shared tasks [19]. However, despite this growing interest, handling the Spanish language, especially in clinical settings poses numerous challenges. Spanish, with its rich inflectional morphology indicating various syntactic, semantic, and grammatical aspects (such as gender, number, etc.), presents unique complexities [17]. From a syntactic perspective, Spanish texts feature more subordinate clauses and boast lengthy sentences with flexible word order, allowing the subject to appear anywhere, not limited to preceding the verb.
  • 3. Material and Methods† In this section. Our approach entails finetuning the LLMs, which include XLM-RL1 and XLM- RB2 [16], BBES3 [21], BBS4 [22], E5-B5 and E5-L6 [23]. We employ a simple ensemble technique called majority voting (MV) [24]. [here]. XLM-RL (561M parameters) and XLM-RB (279M parameters), both based on the Transformer architecture [12], are masked language models trained on a vast dataset of over two terabytes from filtered CommonCrawl [25]. We leverage BBS and BBES both pretrained on Spanish clinical data. BBS exclusively leverages biomedical resources and BBES incorporates both biomedical and EHR data. Extensive finetuning and evaluation on clinical NER tasks demonstrated their superior performance over other general-domain and domain-specific Spanish models. Lastly, we leverage E5-B (278M parameters) and E5-L (560M parameters) which are advanced text embedding models, trained through a contrastive approach using weak supervision. Using MV we favor entity labels that receive the most frequent consensus among the ensemble of models. By using this approach we mitigate the risks associated with relying on a single model. Experiments † In this section. The SympTEMIST corpus [4] encompasses Spanish medical records, each uniquely identified by a filename. Each record is meticulously annotated, featuring an annotation ID, label, start and end positions within the text, and the corresponding text snippet. For preprocessing, we use the SpaCy tool for tokenization following the inside–outside–beginning (IOB) tagging scheme. This dataset comprises a total of 744 examples. We split the data in 95% training and 5% validation for our experiments. [Comments on using test set]. For testing we were provided with 250 examples by the SympTEMIST organizers. We use all the training data (744 examples) for finetuning the models for submission by including the 5% validation data into the training data. The hyperparameters of all six models were configured as follows: a batch size of four, a total of 70 training epochs, and a learning rate of 5e-05, employing a linear scheduler. Subsequently, we assessed the performance of these models using a distinct test set containing 250 text samples released by [4]. ADD TEST DATA SPECS 1 XLM-RL: XLM-RoBERTa Large 2 XLM-RL: XLM-RoBERTa Base 3 bsc-bio-ehr-es 4 bsc-bio-es 5 multilingual-e5-base 6 multilingual-e5-large Commented [AK1]: ? @ Mai citing few works here and adding 3-4 sentences about why MV is effective will be great I guess
  • 4. Results and Discussion † (This section also needs a bit change! You must also report the results on validation set!) This section. As depicted in Figure 1, a consistent upward trend in performance was observed for the validation set when the number of training epochs increased for all six models. Notably, the XLM-RL model stood out by achieving the highest F1-score of 0.7043. Although validation data has a relatively smaller size than test data and may not fully capture the diversity and complexity inherent in the test data,we chose the top three performing models for evaluation on test data in Table 1. Insights from Table 1 suggest that if the ensemble of models in the majority voting method generates diverse predictions, it can lead to some level of vote dilution. In other words, the majority voting may average out correct predictions with incorrect ones, resulting in a less accurate overall prediction. To mitigate this problem, a weighted majority voting [26] could be considered, where higher weights are assigned to top-performing models and lower weights to lower-performing models. To sum up, the observed results with domain-specific models on Spanish clinical data highlight the pressing necessity for ongoing research and advancement in this area, as they outperform general-domain models. Model Variant Train data Validation data Test data Precision Recall F-1 Score Precision Recall F-1 Score Precision Recall F-1 Score BBES 0.70 0.57 0.63 BBS 0.69 0.62 0.65 XLM-RL 0.62 0.50 0.56 MV 0.68 0.60 0.64 Table 1: Models performance on the train, validation, and test data Conclusion† This study shows the imperative need for continued exploration and development of domain- specific models in the context of Spanish clinical data. In future work, Commented [ma2]: I will remove this and add a table Commented [AK3R2]: I suggest having one or a couple of them table where we can mention train, test and valid accuracy for all model variants? How does that sound @Abbas? @ Mai I can start working on tables and you can fill in the data. Commented [AK4R2]: how does the new table look? less space more content Commented [AK5R2]: looks more better with two decimal places tbh, feel free to chabnge it back if you don’t like it Commented [AK6R2]: also, let’s try to keep the figure too because it shows the overall trend at different epochs and shows that models are improving. We do have space. Commented [AK7]: to do
  • 5. References [1] T. Liang, C. Xia, Z. Zhao, Y. Jiang, Y. Yin, and P. S. Yu, “Transferring From Textual Entailment to Biomedical Named Entity Recognition,” IEEE/ACM Trans Comput Biol Bioinform, vol. 20, no. 04, pp. 2577–2586, Jul. 2023, doi: 10.1109/TCBB.2023.3236477. [2] J. Li et al., “A comparative study of pre-trained language models for named entity recognition in clinical trial eligibility criteria from multiple corpora,” BMC Med Inform Decis Mak, vol. 22, no. 3, pp. 1–10, Sep. 2022, doi: 10.1186/S12911-022-01967-7/TABLES/8. [3] X. Liu, G. L. Hersch, I. Khalil, and M. Devarakonda, “Clinical Trial Information Extraction with BERT,” Proceedings - 2021 IEEE 9th International Conference on Healthcare Informatics, ISCHI 2021, pp. 505–506, Sep. 2021, doi: 10.1109/ICHI52183.2021.00092. [4] “SympTEMIST Corpus: Gold Standard annotations for clinical symptoms, signs and findings information extraction”, doi: 10.5281/ZENODO.8413866. [5] T. B. Brown et al., “Language Models are Few-Shot Learners,” Adv Neural Inf Process Syst, vol. 2020-December, May 2020, Accessed: Oct. 18, 2023. [Online]. Available: https://arxiv.org/abs/2005.14165v4 [6] J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, vol. 1, pp. 4171–4186, Oct. 2018, Accessed: Oct. 18, 2023. [Online]. Available: https://arxiv.org/abs/1810.04805v2 [7] G. Wang, G. Yang, Z. Du, L. Fan, and X. Li, “ClinicalGPT: Large Language Models Finetuned with Diverse Medical Data and Comprehensive Evaluation,” Jun. 2023, Accessed: Oct. 19, 2023. [Online]. Available: https://arxiv.org/abs/2306.09968v1 [8] R. Luo et al., “BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining,” Brief Bioinform, vol. 23, no. 6, Oct. 2022, doi: 10.1093/bib/bbac409. [9] Y. Gu et al., “Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing,” ACM Trans Comput Healthc, vol. 3, no. 1, p. 24, Jul. 2020, doi: 10.1145/3458754. [10] Y. Labrak et al., “DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains,” pp. 16207–16221, Apr. 2023, doi: 10.18653/v1/2023.acl-long.896.
  • 6. [11] X. Luo, P. Gandhi, S. Storey, and K. Huang, “A Deep Language Model for Symptom Extraction from Clinical Text and Its Application to Extract COVID-19 symptoms from Social Media,” IEEE J Biomed Health Inform, vol. 26, no. 4, p. 1737, Apr. 2022, doi: 10.1109/JBHI.2021.3123192. [12] A. Vaswani et al., “Attention is All you Need,” Adv Neural Inf Process Syst, vol. 30, 2017. [13] S. Azizi, D. B. Hier, and D. C. Wunsch, “Enhanced neurologic concept recognition using a named entity recognition model based on transformers,” Front Digit Health, vol. 4, Dec. 2022, doi: 10.3389/FDGTH.2022.1065581. [14] Y. Liu et al., “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” Jul. 2019, Accessed: Oct. 18, 2023. [Online]. Available: https://arxiv.org/abs/1907.11692v1 [15] P. He, X. Liu, J. Gao, and W. Chen, “DeBERTa: Decoding-enhanced BERT with Disentangled Attention,” ICLR 2021 - 9th International Conference on Learning Representations, Jun. 2020, Accessed: Oct. 18, 2023. [Online]. Available: https://arxiv.org/abs/2006.03654v6 [16] M. Paul. Lewis and Summer Institute of Linguistics., “Ethnologue : languages of the world,” p. 1248, 2009, Accessed: Oct. 19, 2023. [Online]. Available: https://books.google.com/books/about/Ethnologue.html?id=FVVFPgAACAAJ [17] G. G. Subies, Á. B. Jiménez, and P. M. Fernández, “A Survey of Spanish Clinical Language Models,” Aug. 2023, Accessed: Oct. 19, 2023. [Online]. Available: https://arxiv.org/abs/2308.02199v1 [18] J. Gonzalo, M. Montes-Y-Gómez, and F. Rangel, “Overview of IberLEF 2022: Natural Language Processing Challenges for Spanish and other Iberian Languages,” 2022, Accessed: Oct. 18, 2023. [Online]. Available: https://nlp.uned.es/ [19] L. Akhtyamova, P. Martínez, K. Verspoor, and J. Cardiff, “Testing contextualized word embeddings to improve NER in Spanish clinical case narratives,” IEEE Access, vol. 8, pp. 164717– 164726, 2020, doi: 10.1109/ACCESS.2020.3018688. [20] A. Conneau et al., “Unsupervised Cross-lingual Representation Learning at Scale,” Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451, Nov. 2019, doi: 10.18653/v1/2020.acl-main.747. [21] C. P. Carrino et al., “Pretrained Biomedical Language Models for Clinical NLP in Spanish,” Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 193–199, 2022, doi: 10.18653/V1/2022.BIONLP-1.19. [22] P. Lewis, M. Ott, J. Du, and V. Stoyanov, “Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State-of-the-Art,” pp. 146–157, Nov. 2020, doi: 10.18653/V1/2020.CLINICALNLP-1.17. [23] L. Wang et al., “Text Embeddings by Weakly-Supervised Contrastive Pre-training,” Dec. 2022, Accessed: Oct. 15, 2023. [Online]. Available: https://arxiv.org/abs/2212.03533v1
  • 7. [24] R. Atallah and A. Al-Mousa, “Heart Disease Detection Using Machine Learning Majority Voting Ensemble Method,” 2019 2nd International Conference on New Trends in Computing Sciences, ICTCS 2019 - Proceedings, Oct. 2019, doi: 10.1109/ICTCS.2019.8923053. [25] G. Wenzek et al., “CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data,” LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings, pp. 4003–4012, Nov. 2019, Accessed: Oct. 18, 2023. [Online]. Available: https://arxiv.org/abs/1911.00359v2 [26] A. Dogan and D. Birant, “A Weighted Majority Voting Ensemble Approach for Classification,” UBMK 2019 - Proceedings, 4th International Conference on Computer Science and Engineering, pp. 366–371, Sep. 2019, doi: 10.1109/UBMK.2019.8907028.