1. Hidden Markov Models for Detecting
Changes in Health Outcomes and
Comparing Groups of Subjects
ZHANG ZEYANG
SEPTEMBER 2015
2. Acknowledgement
Final project for the degree
MSc. Computational Statistics and Machine Learning at UCL.
Many thanks to my supervisors:
Dr. David Barber (UCL)
Dr. Steven Barrett (GSK) and Dr. Maria Costa (GSK)
3. Background: COPD disease
Chronic Obstructive Pulmonary Disease (COPD) can be summarized as:
• A collection of lung diseases including chronic bronchitis and emphysema etc.
• A long-term condition that causes inflammation in the lungs, damaged lung tissue and
a narrowing of the airways, making breathing difficult.
• A life-threatening respiratory disease that is commonly seen both in the UK and
worldwide.
4. Background: COPD Exacerbations
An (acute) exacerbation of COPD is characterized as:
• Worsening of COPD symptoms ( dyspnea, cough, and/or sputum) beyond day-to-day
variations that usually last for a few days.
• Lack of a standardized, consistent and commonly accepted definition.
• The studies of the efficacies of new therapies on COPD have been hampered by the
difficulty in identifying and quantifying exacerbations.
• New approaches are being sought to better recognize and understand exacerbations.
5. Introduction
A Patient-Reported instrument has been employed to monitor the health of COPD
patients, in which the participating patients are divided into two treatment groups (
Drug A and Drug B) and all required to answer 14 questions in an electronic survey,
reflecting their
• Chest symptoms
• Cough and sputum symptoms
• Breathless symptoms
• General well-being
on a daily basis during the clinical trials (around 6 months). For each question, a patient
has to assign a score where a higher score indicates a more severe symptom.
6. Dataset
The sum of the 14 scores of each study day forms a time-series data for each patient.
Meanwhile, the clinical exacerbations of each patient over the same periods are also
recorded together with other individual information such as number of historical
exacerbations, treatment group etc.
7. Objectives
In a nutshell, this project aims to
1. Construct an accurate yet computationally efficient model for detecting COPD
exacerbations based on patients’ self-reported health scores.
2. Develop a systematic method to evaluate the model and benchmark the
detected results against the clinical exacerbations.
3. Based on the detected exacerbations, compare the treatment efficacies of
Drug A and Drug B at cohort level.
8. Hidden Markov Model (HMM)
The Hidden Markov Model (HMM),
which can be represented by a Direct
Acyclic Graph, is an unsupervised
machine learning model used in this
project to find exacerbations.
The HMM in our model consists of:
• Hidden variable h , that represents the
exacerbation status of each individual.
• Observed variable v , that represents the
reported health scores of each individual.
The most likely exacerbation status of an
individual can be inferred via the Viterbi
algorithm.
9. Evolution of Models & Results
Through restructuring and manipulating the HMM, we can adapt the model to
accommodate various assumptions of exacerbations to generate more satisfying results.
11. Evaluation: Precision, Recall Measures
An instrument from Information Retrieval is borrowed to evaluated the performance of the
HMM in detecting the exacerbations.
Clinical Exacerbation Not Clinical Exacerbation
Detected by HMM a (true positives) b (false positives) a + b
Not Detected by HMM c (false negatives) d (true negatives) c + d
a + c b + d a+b+c+d
In measurement of days, we have
Recall =
𝑎
𝑎+𝑐
= P(detected by model |clinical exacerbation)
Precision =
𝑎
𝑎+𝑏
= P(clinical exacerbation |detected by model)
12. Evaluation: Composite Measure - F
Both high Recall and high Precision values are desired; however, it is easy to see that
there is generally a trade-off between Recall and Precision.
A sensitive model tends to over-identify exacerbations, thus generating a high Recall
and low Precision and vice versa.
Therefore, based on Recall and Precision values, a composite measure is created
F-measure =
1
π
1
𝑅
+(1−π)(
1
𝑃
)
,
to strike a balance between these two indices, where π here represents the weightage
we assigned to Recall. A higher F-measure indicates a better performance of the
model in detecting the exacerbations.
13. Evaluation: Parameters
When restructuring and designing HMM, a series of
parameters were created .
The parameters govern the sensitivity of the model in
detecting the exacerbations and embody some basic
assumptions of an exacerbation.
For instance, α represents our belief of how likely a
patient enters an exacerbation from a non-
exacerbation day in general. A Recall-Precision Curve
can be plotted for various values of α.
14. Evaluation: Parameters Tuning
In this project, we believe that the clinically identified exacerbations are not the
‘universal truth’. A significant amount of exacerbations could have been missed by the
clinicians as the patients may not always approach on time when the symptoms
deteriorate.
Therefore, we are more interested in creating a model that can successfully identify
most clinically found exacerbations (high Recall), while being more tolerant to over-
detection of exacerbations (low Precision).
We thus adjust our HMM to the ‘ideal sensitivity’ by setting the parameters to the
values that generate the highest F-measure for π=0.80 and π=0.85 respectively.
15. Evaluation: Comparison of Performances
When comparing with an existing method (Method X), the performances of our HMM
seem encouraging.
16. Cohort Level Analysis: Regression
Based on the exacerbations detected by our HMM (at π=0.80 and π=0.85 respectively), we can
analyze the impacts different factors (treatment groups, historical exacerbations) have on the
exacerbation frequencies of the COPD patients.
It is natural to assume that a patient’s frequency of exacerbation follows a Poisson distribution –
Poisson(t·λ). Hence, a Poisson Regression is then fitted to the data. Using a log link, we have
θ ’β = log(t) + log(λ)
where θ ’ represents the predictor variables (or independent variables, regressors), such as
historical exacerbations and treatment groups.
19. Further Improvements
Overcome the constraints of the data.
Improve design of the HMM (3-order HMM, heterogeneous transition matrix etc.).
When evaluating, take measurements in events of exacerbations instead of
measurements in days for better accuracy.
People are normally diagnosed in their 40s or 50s.
From point 2 to point 3, it makes it difficult when we test new drugs or monitor the progression of COPD disease. Better identify better quantify exacerbations.
I will avoid the details of the algorithms. The Viterbi considers transition and emission distributions.
However, the diagram above is just a basic HMM. In the context of this project, we adapt redesign the HMM.
Bear in mind, this results are generated based on the assumed values of parameters in Viterbi.
To what extent are they in line (in agreement ) with clinical exacerbations??????
It is up to us how sensitive we want the model to be. A composite measure is then desired.