2. Foundation Model vs. Task-Specific Models
• What are foundation models?
• (Extremely) vast (and diverse) training data
• Unlabelled data (mostly)
• Self-supervised learning (SSL) (usually)
• Often based on large language models (LLM, e.g. BERT, GPT)
• Fine-tuned for specific tasks after initial (emergent) learning
• Generative AI? (potentially, depends on task)
• Contrast with task-specific models
• Single model intended to perform specific task
• Relatively “fragile” – can’t usually be effectively repurposed to other tasks,
breaks easily with different data sources
3. Task-specific Models
Chest x-ray
Chest X-ray Model
Atelectasis
Effusion
Pneumonia
Fibrosis
Abdominal CT scan
Abdominal CT Scan
Model
Ascite
Cyst
Tumour
Stomach Cancer
Retinal Image
Model
Retinal Fundus photo
Diabetic Retinopathy
Age-related Macular Degeneration
Glaucoma
Medical
Notes
• Learns relation
between single
input/modality, and one
(or more) targets
• Task generally
prospectively defined
(inputs known to be
correlated with labels)
(where we started)
4. Foundation AI Model (FAI)
Chest x-ray
Atelectasis
Effusion
Pneumonia
Fibrosis
Abdominal CT scan
Foundation
Model
Ascite
Cyst
Tumour
Stomach Cancer
Retinal Fundus photo
Diabetic Retinopathy
Age-related Macular Degeneration
Glaucoma
Medical
Notes
• Model has underlying
“foundation” of
“general knowledge”
Initial “general/foundational” Self-Supervised Learning (SSL),
on vast amount of image/textual data (possibly related)
Other
inputs
Other predictions (e.g. Age,
Gender, Alzheimer risk, various
Cancer risks, etc.)
Main disadvantage of a
Foundation AI model
compared to task-specific
models would be its
computational requirements
5. Why Build A Foundation?
• When humans “know” something,
• Check against diverse pieces of knowledge (discriminative/deductive)
• Derive (new concepts) from diverse pieces of knowledge (generative, often
inductive/probabilistic, sometimes called “creativity”)
• E.g. No idea if Ivory Coast and Mali are next to each other, not explicitly stated in dataset
• But multiple statements of “crossing border from Ivory Coast to Mali”, or vice-versa
• Can infer that they are adjacent (in theory)
• Since there is often no easy way to figure out if some knowledge is
useful, just learn as much as possible (reflected in ever-increasing size
of GPT/LLM models)!
6. First FAI Model for Ophthalmology
Source: “A foundation model for generalizable disease detection from retinal images”, Zhou et al., Nature 2023
7. RetFound FAI
• Main distinction from previous
(multitask) models appears to be
the initial large-scale self-
supervised (foundation) learning
• 904,170 colour fundus photos,
736,442 OCT scans, all rescaled to
256x256
• A number of SSL models were
tried, masked autoencoder (with
ViT-large encoder) found to be the
best
8. RetFound FAI vs. Supervised Learning
• RetFound was compared against
a supervised learning (SL) model
with the same transformer
architecture, and other SSL pre-
training combinations
• SL-ImageNet actually performs
pretty closely to RetFound on
internal validation
• Value of SSL appears more with
external validation (but generally
still not overwhelming)
9. Masked Autoencoder (Generative SSL)
(trained) ViT
Encoder
FAI Task Adaptation
Test CFP
Encoder
high-level
features
Multilayer
Perceptron
Prediction
• RetFound uses a masked
autoencoder, with the training
objective being to reconstruct
input images from a randomly-
masked version of the image
• Once the autoencoder is trained,
only the (ViT) encoder is used to
generate the high-level features
for task-specific classification
• Comparison against established
SL DCNN models (trained directly
on the [augmented] images,
instead of high-level features)
would have been interesting
Main value of FAI here
lies in producing good
high-level features?
Self-supervised since no
labels are required in
training the autoencoder,
only the (randomly
masked) image itself
Note that actually still
need the task labels, to
train the task-specific
classifiers!
Standard Classification
10. ELIXR – X-ray LLM+Vision FAI
• Embeddings for Language/Image-aligned X-Rays (thus ELIXR)
• Language-aligned image encoder+Fixed LLM (PaLM 2)
• ELIXR-C is first trained using Contrastive
Language-Image Pre-training (CLIP)
• This aligns a vision-only SupCon image
encoder, with a T5 language encoder
(i.e. learns to bring representations
of an image & associated text closer,
in a shared high-dimensional space)
Source: “ELIXR: Towards a general purpose X-ray artificial intelligence system through alignment of large language models and radiology vision encoders”, Xu et al., arXiv 2023
11. ELIXR – X-ray LLM+Vision FAI
• ELIXR-B then uses the trained ELIXR-C image encoder
and a fixed PaLM2-S LLM; only the adapter between
the image encoder and LLM is trained with an
attention mechanism
• Phase 1: A vision-language model (Q-Former) is
trained to understand & represent both images and
text reports in a shared embedding space, by:
• Image-text contrastive learning (ITC)
• Image-grounded text generation (ITG)
• Image-test matching (ITM)
• Phase 2: The Q-Former + extra MLP to the LLM is then
trained to generate the impressions section of the text
report, from the image embeddings (as image-based
LLM token inputs)
12. Hugging Face IDEFICS
• IDEFICS is adapted from the Flamingo architecture, which combines
two frozen models: LLaMA (text, main backbone) & OpenClip (vision)
• Major contribution is the preparation of a (very large) OBELICS
multimodal (text & image) web dataset
Sources: “OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents”, Laurencon et al., arXiv 2023;
“Flamingo: a Visual Language Model for Few-Shot Learning”, Alayrac et al. arXiv 2022;
https://huggingface.co/docs/transformers/model_doc/idefics
13. General FAI for Medicine (MedFAI)
• Obvious extension: from “ophthalmology FAI” to “medical FAI”
• Several desirable attributes going “beyond human physicians”:
• Holistic
Modern medicine is necessarily fragmented into specialties
(too much for any single physician to know/learn)
Cross-specialty boundaries difficult to cross
(e.g. eye [images] as window to heart/brain/cardiovascular health etc.)
• Comprehensive
Can in theory query implications of Variable(s) A → Condition/Outcome B, for
any A and B, with evidence-based justification
• Predictive
Physicians generally can only diagnose current conditions, not future
14. Holistic MedFAI
• Previously, AI models in medicine are generally designed to replicate
existing capabilities, or at least prospectively
• For example, it is known that retinal fundus photographs can be used
to diagnose diabetic retinopathy (DR)
• So we plan to train an AI model to classify DR from retinal photos
• Then just a matter of collecting sufficient labelled data
(both for model development and [external] validation)
• Often encounter delays with data acquisition, model robustness
(if insufficient data)
15. Holistic MedFAI
• For a general foundation model, the idea is instead to
(retrospectively) throw in all available (reasonably valid) data
• Then, gaps, missing labels and (minor) inaccuracies in data can be
addressed by the vast foundational base of knowledge (possibly from
other specialities, or even outside medicine proper)
• Might expect general knowledge (e.g. “Is this a retinal photograph?”,
“Is this a blurred photograph?”) to be answerable by an FAI with
minimal/no specific training
16. Comprehensive MedFAI
• For task-specific AI models, a single model is trained to perform a
single, narrowly-scoped task (relate one set of inputs, to one output)
• For multitask AI models, the single model can perform multiple such
tasks, but typically the tasks are still all predefined during
development
Retinal Image
Model
Retinal Fundus photo
Diabetic Retinopathy
Retinal Image
Model
RFP
Diabetic Retinopathy
Age-related Macular Degeneration
Glaucoma
Heart Attack Stroke Parkinson’s From the Ophthalmology FAI
OCT Images
17. Comprehensive MedFAI
• For a (comprehensive) MedFAI, there are multiple (very many)
possible inputs (images/medical notes/clinical variables), and also
multiple (very many) possible outputs (conditions/diseases/etc.)
• Consider a very conservative model of 100 inputs (with one set of
clinical variables as just one input) and 100 outputs: there are already
10,000 combinations (of course, some more important than others)!
• Then note that the usual task-specific AI (or major journal paper)
covers just one (or a few) of these combinations
• MedFAI Application: in theory, the FAI can systematically go through
all possible combinations, and flag out (discover) promising novel
correlations, for further investigation if necessary
18. Comprehensive MedFAI – Sparsity
• MedFAI Application: from the available (limited) patient data, is it
possible to diagnose for the condition(s) of interest with (reasonable)
accuracy?
• In practice, patient data is limited (tests are
inconvenient/expensive/invasive/painful)
• Thus, physicians do not have complete data
• Often, do not know if available data is relevant to condition of interest
19. Comprehensive MedFAI – Test Optimization
• MedFAI Application: if the available patient data is insufficient, what
data (i.e. medical tests) would be needed, to diagnose the condition
to the desired level of accuracy?
• FAI should in theory be able to present various plausible medical test
options, with different advantages/disadvantages (availability,
accuracy, cost, comfort, reduced side-effects, etc.)
• Both patient needs and organizational/national needs can be taken
into account
• On the organizational side, tests can be planned/administered taking
into account utility vs. costs, with evidential backing
20. Comprehensive MedFAI – (Automatic)
Imputation
• MedFAI Application: from the available (limited) patient data, is it
possible to impute (synthesize/predict) the rest of the patient data?
• In this case, what is predicted is not the ultimate (desired) outcome
itself, but the (unknown) input
• For example, if HbA1c is unknown for a patient, perhaps it might be
imputed to high accuracy given other data such as age, gender, blood
pressure, various imaging scans, etc. in an individualized manner
• The updated patient profile (with imputed data) might then improve
accuracy on the actual desired outcomes
21. Predictive MedFAI
• Existing AI models largely try to replicate existing physician/grader
abilities/workflow, i.e. diagnose an existing condition
• However, for an (F)AI model, future projection vs. current diagnosis is
“just another task”, that may be possible with sufficient data/labels
• Future prediction generally has lower accuracy than current
diagnosis, which is expected to an extent since patient
agency/external circumstances are not fixed in the intervening period
• Therefore, the potential for performance improvement via additional
data (and FAI) may be relatively greater, than for diagnosis tasks
• Especially relevant for mass preventive programs/interventions
22. Local Advantages Towards MedFAI Application
• Patient data is (relatively):
• Centralized (only a few integrated healthcare clusters)
• Comprehensive (developed public health system)
• Digitized (readily available for MedFAI development)
• Diverse (multiple ethnicities)
• Unbiased (high and broad coverage)
• Available computing resources
• Chroma @ Alice @ SGH Campus
• Note that current projects are relatively “deep”, i.e. still prospectively define a
small set of inputs, towards some output (with specific engineering)
• FAI would in contrast be relatively “broad”, i.e. from all available inputs,
discover connections with all available outputs
23. Towards Rapid MedFAI Development
• Data acquisition has often been the major factor delaying past
medical projects (models etc. often standard)
• FAI has no strict data requirements
(can start working with what is available/easily obtained)
• Initial prototype can go forward without complete coverage of all
specialties, then incrementally added when available
• Linking (anonymized) patient data from various sources would
probably be the major issue (when validating)