The document describes using data programming and weak supervision to extract PICO information from medical documents without labeled training data. It involves developing labeling functions from sources like UMLS, ontologies, dictionaries and regular expressions to label text as participant, intervention, comparator or outcome. A label matrix is generated and labels are combined using majority voting or a label model. Experiments on the EBM-PICO dataset show this approach can outperform fully supervised methods and precision can be improved over recall-focused techniques. Future work includes enhancing labeling sources and functions.
1. Data Programming for PICO
information extraction in
absence of labelled data
Presenter – Anjani Dhrangadhariya
Group Meeting: 05.04.23
0 0 0 0 0 0
0 0 0 0 0 1
0 0 0 0 0 0
1 1 0 1 1 1
1 0 1 1 1 1
1 1 1 1 0 1
2. PICO Information
…A semi-structured interview
was used to obtain qualitative
information on the effect of the
daily aerobics intervention vs.
conventional exercise. The
convenience sample included 15
adult Oncology outpatients, 13
female and 2 male, ranging in
age from 20 to 87. Quality of life
was measured using SF-36
QOLS…
Participant
Intervention
Comparator
Outcome
15 adult Oncology outpatients, 13 female and 2 male, ranging in age from 20 to
87.
daily aerobics intervention
conventional exercise
Quality of life was measured using SF-36 QOLS…
Coarse-grained
information
Clinical trial (study)
3. Significance
Prescribe treatment
Diagnosis decisions
Government’s health policies
Health economic evaluation
…
P
I
C
O
1,139 expert hours (avg.)
Quarter a million dollars
4. PICO Information
P
I
C
O
aerobics intervention
SF-36 QOLS
Quality of life
conventional exercise
15
Adult, age from 20 to 87
Oncology outpatients
13 female and 2 male
Sample size
Age
Condition
Gender
Intervention name
Control
Outcome method
Measurement scale
daily Intervention frequency
…A semi-structured interview
was used to obtain qualitative
information on the effect of the
daily aerobics intervention vs.
conventional exercise. The
convenience sample included 15
adult Oncology outpatients, 13
female and 2 male, ranging in
age from 20 to 87. Quality of life
was measured using SF-36
QOLS…
Spans ↔ Entities
Fine-grained
information
5. PICO extraction: Automation
• Challenging – fuzzy spans and entities
1. Nested
2. Overlapping
3. Highly contextual
• Low resource
• EBM-PICO (Nye et al., 2018)
• Ever extensible subunits
• Annotation task – tough to define
• Annotation personnel – tough to train
P
Sample size
Age
Condition
Gender
Ethnicity
Overall sample size
Subgroup sample size
Disease
Disorder
Sign and symptoms
Social status
Education
…
7. PICO: Manual annotation
1. Errors in the annotated corpus
2. Extract new classes
3. Re-define existing classes with new labels
Re-annotation?
8. Weak Supervision via Data programming
• Data Programming based weak supervision uses “programmatic
labelling” - relies on programmatic labelling sources to obtain training
data.
• Programmatic labelling is quick and allows efficient modifications to
the training data labels per
• the downstream application changes
• error correction
• addition of more entities
9. Programmatic labelling
• Uses a set of labelling sources* S = {s1, s2, … sn} and a set of weak-
labelling functions* Y = 𝜆1, 𝜆2, … , 𝜆𝑛 that could sample from the
unlabelled data and label a subset of them.
(.*) Pattern matching
Boolean search
DB lookup
Heuristics
Legacy systems
Third-party models
Crowd-sourced labels
*Labeling functions (LF)
Ontologies
Regular Expressions
Linguistic grammar
Dictionaries
Manual annotation
Terminologies from KBs
*Labeling sources (LS)
10. Labelling functions
def labelling_function_1(tokens, dictionary):
return [1 if t matches dictionary_term_i else 0 for t in tokens]
def labelling_function_2(tokens, Pattern):
return [1 if t matches re.Pattern else 0 for t in tokens]
def labelling_function_3(tokens, POS, Pattern):
return [1 if t.POS in {pos1, pos2, …} and matches re.Pattern else 0 for t in tokens]
def labelling_function_n(tokens, labelling_source):
return …
text tokens
Token labels
Labelling functions (in bold)
Labelling sources (in italics)
12. Labelling functions: aggregation
• Aggregate labels from multiple labelling functions 𝜆1, 𝜆2, … , 𝜆𝑛 to
obtain a consensus label for your tokens.
Weakly-supervised Transformer Predict
Data
programming
13. Data Programming: Applications
Biomedical text and image classification
Biomedical entity recognition
✘Clinical information extraction, especially PICO
• Highly compositional
• Difficult to define
• Fuzzy boundaries
• Meng, Yu, et al. "Weakly-supervised neural text classification." proceedings of the 27th ACM International Conference on information and knowledge management. 2018.
• Wang, Yanshan, et al. "A clinical text classification paradigm using weak supervision and deep representation." BMC medical informatics and decision making 19.1 (2019): 1-13.
• Mintz, Mike, et al. "Distant supervision for relation extraction without labeled data." Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th
International Joint Conference on Natural Language Processing of the AFNLP. 2009.
Gene Protein Drug Chemical Disease
IGF1 IGF1 Nomenclature and standardizations
14. Objective
• Adapting data programming and weak supervision to extract PICO
information.
• Use as small automatically labelled training data as possible.
• Use the freely-available resources.
15. Method: Dataset
• EBM-PICO
• 5000 abstracts
• Training and test set
• Coarse-grained PICO (spans or sentences)
• Fine-grained PICO (entities or phrases/words)
16. Method: Dataset
• EBM-PICO training set
• Error rectification on 2,960 (~1%) of 1,303,169 training tokens
• EBM-PICO test set
1. Error rectification
2. Re-annotation
17. Method: Dataset preprocessing
• Comes pre-tokenized
• No preprocessing was applied to the EBM-PICO dataset
• Enrichment
• POS tags
• Token Lemma
19. Task: binary labelling
• Input sequence: 𝑿 = 𝑥1, 𝑥2, … , 𝑥𝑛
• Output sequence: 𝒀 = (𝑦1, 𝑦2, … , 𝑦𝑛), where 𝑦𝑖 ⊂ 𝑦; 𝑦 = {1,0}
• Develop m LFs to label n text tokens as
1. P vs OOS (1 vs 0)
2. I vs OOS (1 vs 0)
3. O vs OOS (1 vs 0)
• Without ground truth Y, we estimate by aggregating several
labelling functions.
OOS = Out-of-the-span tokens
20. Method I: Labelling sources
• Labelling sources to label P, I (C) and O
• UMLS
• Non-UMLS ontologies (Table 2)
• Distant supervision dictionaries – clinicaltrials.gov
• Hand-crafted dictionaries
• Regular expressions
• Heuristics
21. Method II: Source preprocessing
• UMLS – 223 vocabularies
• Non-English removed
• Zoonotic removed
• Vocabularies with less than 500 terms removed
• Smart lowercasing to preserve abbreviations
• Removal
• Numbers
• punctuations, and
• trailing spaces
• Any term with less than 3 char
22. Vocabulary 1 Concept 1 Disease
Vocabulary 2 Concept 2 Age group
Vocabulary 3 Concept 3 Pharmaceutical Drug
Vocabulary 4 Concept 4 Sign and Symptoms
Vocabulary … Concept … …
Vocabulary n Concept n Mental dysfunction
Method III: Source to target mapping
Ontologies
P
I
O
Distant Supervision
ReGeX
Heuristics
Dictionaries
Semantic types Targets 𝑻
Task-
specific
rules
Sources 𝑺
UMLS ontology
https://bioportal.bioontology.org/
23. Method IV: Labelling functions
Terminology LF
ReGeX LF
Heuristic LF
{−1, 0, +1}
{0, +1}
{0, +1}
n Text Tokens Token labels
Dictionary sources
ReGeX sources
Heuristics
𝝀𝒎
Stop Words LF {0, −1}
Negative LFs
−1 = Negative class label
+1 = Positive class label
0 = abstain
25. Method VI: Combine noisy labels
1. Majority vote (MV): choose the label chosen by the majority of LFs.
𝑌𝑀𝑉 = max
𝑦 ⊂ 0,1
𝑖=1
𝑚
1 λ𝑖 = 𝑦𝑖
In case of ties or abstains, choose label 0
26. Method VI: Combine noisy labels
LF1
LF2
LF3
LF4
LF5
LF6
0 0 0 0 0 0
0 0 0 0 0 1
0 0 0 0 0 0
1 1 0 1 1 1
1 0 1 1 1 1
1 1 1 1 0 1
1 2 3 4 5 6
Probabilistic labels for words
Via generative model
0
1
n
m
Weakly-supervised
discriminative model
Predict
Label matrix ⋀
• Label model (LM)
• Generative model
31. Results 1: Error rectification
• 2,960 (~1% of training tokens) analysed for errors
Class Total errors
Participant 23.39%
Intervention 18.30%
Outcome 20.21%
32. Results 1: Error rectification
• Error correction on EBM-PICO gold test set
• Cohens knew > Cohens k
Cohens k = Agreement between different
annotator pairs for the EBM-PICO test set
Cohens knew = Agreement between the
original EBM-PICO test set and error-rectified
EBM-PICO test set
39. Results 4: Precision vs. Recall
• LM consistently improves performance in comparison to recall-
oriented MV
40. Conclusion
• Weak supervision via data programming could be adapted to PICO
information extraction.
• Utilize freely-available sources
• Training set ~5000 documents
• It can outperform full supervision but requires careful selection and
design of labelling sources and functions.
• The label model (LM) is better than majority voting.
• Pretrained transformers bring performance improvement, but not
always.
41. Future work
• Gathering labelling sources for the PICO class
• The sources were no representative
• Incorporating class weights into the labelling functions
• LF x n (Label model removes any redundant labelling functions)
42. References
1. Dhrangadhariya, Anjani, and Henning Müller. "Not so weak PICO: leveraging weak supervision for participants,
interventions, and outcomes recognition for systematic review automation." JAMIA open 6.1 (2023): ooac107.
2. Nye B, Jessy Li J, Patel R, et al. A Corpus with Multi-Level Annotations of Patients, Interventions and Outcomes to Support
Language Processing for Medical Literature. Proc Conf Assoc Comput Linguist Meet. 2018;2018:197-207.
3. Lee, Grace E., and Aixin Sun. "A study on agreement in Pico span annotations." Proceedings of the 42nd International ACM
SIGIR Conference on Research and Development in Information Retrieval. 2019.
4. Abaho, Micheal, et al. "Correcting crowdsourced annotations to improve detection of outcome types in evidence-based
medicine." CEUR Workshop Proceedings. Vol. 2429. 2019.
5. Ratner, Alexander, et al. "Snorkel: Rapid training data creation with weak supervision." Proceedings of the VLDB
Endowment. International Conference on Very Large Data Bases. Vol. 11. No. 3. NIH Public Access, 2017.
6. Fries, Jason A., et al. "Ontology-driven weak supervision for clinical entity classification in electronic health
records." Nature communications 12.1 (2021): 1-11.
Majority voting chooses the label agreed by the majority of the labelling functions
Do notice that majority voting does not take into account variable accuracies of each labelling source and function and basically considers them equal.
Next, we use the label model which
on the contrary, it learns a generative model on top of the label matrix to estimate probabilistic labels for each data point.
It takes into account agreements and disagreements between labelling functions for the same data points across the label matrix to estimate how accurate each LF is.
The labelling functions are reweighted according to their accuracies and aggregated to predict a consensus label for the tokens.
These consensus labels are transformed into probabilistic labels, which could then be used to train a weakly supervised discriminative model.
The label model learns a generative model by estimating the probability distribution of true labels given the observed noisy labels by maximizing the likelihood of the label matrix.
During this process, the model also learns the accuracies of different labelling functions by observing the agreements and disagreements between the LFs.
The model is optimized using stochastic gradient descent until the accuracies converge, after which the LFs are reweighted.
Cohens knew between original annotations, and error corrected annotations >> Average Cohens k between test set annotators
- Label model outperforms majority voting for every experiment
The weakly supervised transformer model consistently outperforms the label models well but decreases the performance for specific experiments.
Weakly-supervised model does not bring much improvement for the outcomes class.
On the other hand, a label model estimates the accuracies of individual LFs and aggregates them emitting consensus labels Y hat.
To do this label model trains a generative model to estimate accuracies and iteratively minimizes loss using SGD until convergence.