In today’s age of big data, automatic processing techniques are becoming more important than ever, especially in the field of biology and medicine. Many studies focus on genomic data, following the rise of high throughput sequencing; this project instead analyzes certain blood data parameters taken from rhesus macaques housed in Yerkes Primate Research Center at Emory University. The initial impetus for this study was the Joyner et al. 2016 paper, “Plasmodium cynomolgi infections in rhesus macaques display clinical and parasitological features pertinent to modelling vivax malaria pathology and relapse infections" (Joyner et al., 2016). Joyner and his team follow the infection of malaria strain. cynomolgiin monkeys, taking blood data and other biological information daily. While the paper discusses possible points of difference between monkeys of varying disease severity, we endeavored to find an automatic way to use these “clinical and parasitological features" to characterize and predict aspects of the infection, including severity and stage of malaria. We propose to replicate existing analyses and to add new insights via various computational techniques. Machine learning is traditionally used for very large datasets, and thus this thesis, limited to a small dataset by the exorbitant cost of studying monkeys, intends to provide a proof of concept for automatically analyzing these types of data. The flow of computation is as follows: normalization of data, creation of mathematical models, residual calculation, formation of residual matrices, and lastly the generation of regression models. The aforementioned pro-cedure is then applied to shifted data for comparison, using Bayesian optimization. This study therefore provides a comprehensive framework for automatic analysis of medical data, which can be applied to other datasets.
2. Motivation
★ Why Malaria?
○ 3.2 billion people at risk globally
★ Why begin at E04?
○ To automate analysis in
Joyner 2016, foundation for
other experiments
○ Compare Human vs. Machine
Analysis
3. Motivation: Why Automation?
★ High-dimensional data
★ Precision of data analysis
★ Time: can re-apply same
analysis to new data
★ Discover something new
★ Non-experts can still
make conclusions about
the data
4. Purpose
★ Automatic analysis of E04
○ Come up with a framework to study small
datasets (differently shaped)
★ Apply this framework to other
experiments
○ Generalizable to other parasite strains? How
and why do the results change?
5. Overview of E04 Experiment
★ P. cynomolgi || P. vivax
★ Rhesus macaques ⇒ humans
★ Provide data for better
understanding of the infection =
better treatments!
6. Dataset
★ 5 monkeys: 2 non-severe, 1
severe, 1 very severe, 1 lethal
★ Clinical parameters taken daily
over the course of the experiment -
looked at specifically blood-related
parameters
7. Techniques: Data preprocessing / Normalization
★ Scaling, missing data?
★ Normalize data to equalize computations for
distance metrics
Unit vector Min-max
normalization
8. Normalization reduced important variation
★ Used to reduce noise, but in this case, removed important malaria phase
information
9. Techniques: Iterations on Curve-fitting
★ Two ways of thinking:
○ Guess x # of Gaussians based on x # peaks, then divide that window into x
chunks for fitting (e.g. 3 peaks, so divide window evenly and fit to those 3
ranges) - original logic to code an “n_gaussians” function
○ Guess x # Gaussians based on x # peaks, then specifically divide windows
based on peak ranges instead - code a fitting function over the whole
interval
★ Parameter search space?
10. Techniques: Iterations on Curve-fitting
★ Two ways of thinking:
○ Guess x # of Gaussians based on x # peaks, then divide that window into x
chunks for fitting (e.g. 3 peaks, so divide window evenly and fit to those 3
ranges) - original logic to code an “n_gaussians” function
○ Guess x # Gaussians based on x # peaks, then specifically divide windows
based on peak ranges instead - code a fitting function over the whole
interval
★ Two approaches: minimizing residual between fit and data
vs. relying on built-in function
○ Curve-fitting based on reduction of user-defined loss function: scipy
minimize, scipy leastsq, scipy fmin_slsqp
○ Curve-fitting based on built-in loss function: scipy curve_fit
12. ★ Fitting window? Peakutils, scipy peak-finding...
Result: Curve fitting / peak finding
My peak
function -
includes
plateaux!
Peakutils
results in
poor fit
13. Data is cleaned, scaled and normalized;
mathematical representation via
concatenated Gaussian functions; now
onto analysis...
14. Analysis Roadmap: Goal + Technique
Relationship among clinical
parameters
Regression modeling
Representation of monkeys in vector
space
Residual matrices
Automatic grouping of clinical
parameters in this vector space
Clustering
Minimizing biological noise to
increase similarity among
monkeys of similar phenotype
Bayesian optimization
15. Analyses we can find automatically
Joyner et al. 2016 Automated analyses
# Reticulocytes - possible indicator of lethal
phenotype
✅
Anemic phenotype worsens with severity ✅
Relationship between hemoglobin and:
parasitemia kinetics, mean corpuscular
volume (red blood cell size)
✅
Role of thrombocytopenia (platelet
deficiency) not well understood
✅ + insight?
Lower parasitemia in non-severe
phenotype
✅
16. Techniques: Regression
★ Ridge method★ Combined model: Stochastic Gradient Descent Regressor- weights
for each monkey as measure of predictor significance
17. Results: Regression
★ Coefficients for both
non-severe monkeys
are much more similar,
in comparison to the
other monkeys
★ Suggests some ‘normal’
phenotype vs. sick =
anomaly
18. Results: Regression
★ # Reticulocytes largest
positive coefficient for
lethal phenotype -
possible indicator
★ MCV - red blood cell
size - as another
distinguishing factor?
# Reticulocytes - possible
indicator of lethal phenotype
✅
Anemic phenotype worsens
with severity
✅
19. Results: Regression
★ Hgb shown to have the
same relationship
among two non-severe
monkeys and among
the other group
★ Hgb is negative as
compared to positive in
non-severe monkeys
★ Hgb unrelated to mcv,
as found in paper
Relationship between hgb
and: parasitemia, mcv
✅
21. Results: Phased Regression
★ Did not improve mean squared error over whole interval - peak-finding worked
well for Gaussian fitting, but not finding phases automatically
★ Also because of evaluation pre-shifting, some of the phases may not have
aligned (resulting in larger errors for combined models, shown in table)
Target
monkey
Sum of
MSEs of
all phases
RIc14
(non-severe)
34.288
RSb14
(non-severe)
29.817
22. Techniques: Clustering
★ Clustering, e.g. kmeans
(tried Gaussian means,
agglomerative
hierarchical, spectral, and
birch methods)
★ Evaluation via Silhouette
Score
b = avg dissimilarity with nearest
neighboring cluster
a = avg similarity within own cluster
23. Techniques: Residual Matrices- representing
Monkeys in Vector Space
★ Construct residual matrices where
each clinical parameter is
characterized by the residual
between two monkeys
m1 vs.
m2
... m4 vs.
m5
gran
lymph
...
wbc
Monkey pairwise residuals / sign
match ⇒
Clinicalparameter⇒
24. Techniques: Bayesian Optimization
★ Bayesian Optimization
○ Motivation: derive insights from
very complex functions
○ E.g. 7^20 * 7^20 = !!!!!!
■ extremely computationally
heavy
○ Optimality of guessed result
based on loss function (in our
case, residual between two
monkeys)
26. Results: Residual Matrices + Shifting
★ Shifting helped
elucidate the trend
between
non-severe
monkeys
Reduced residual from ~ 57 to ~ 12
Lower parasitemia in non-severe
phenotype
✅
27. Results: Residual Matrices + Shifting
★ Shifting helped elucidate the trend between
non-severe monkeys
Relationship between hemoglobin and:
parasitemia kinetics
✅
Pre-shifting Post-shifting
28. Results: Residual Matrices + Shifting
★ Monocytes - role in adaptive immune system, so important in first phase?
(prognostic of long-term survival?)
29. Results: Clustering + Shifting
Parasites / uL clustered with monocytes and reticulocytes, as previously
mentioned, post-shifting (up to day 23)
# Reticulocytes - possible
indicator of lethal phenotype
✅
30. Results: Clustering + Shifting
★ Parameters clustered together over all days, k = 4: how are they related?
Granulocytes, lymphocytes,
monocytes, platelets, #
reticulocytes, reticulocytes
concentration, white blood cell
total count
Parasites / uL
Red blood cells / volume,
hemoglobin, mean corpuscular
hemoglobin conc, mean
corpuscular hemoglobin, red
blood cell volume, mean platelet
volume, total red blood cell
count, red blood cell distribution
width
% reticulocytes (proportional to
total red blood cell count)
Immune
response?
Red blood
cell-related
parameters?
31. Results: Normalization in clustering - tradeoff
Role of thrombocytopenia (platelet
deficiency) not well understood
✅ + insight?
32. Can we apply this methodology to NEW
experiments?
33. E03: P. coatneyi Hackeri (i.e. different parasite)
★ Different
representation in
vector space
★ Reticulocytes even
further from the
other clinical data
(more significant in
this parasite?)
★ More clusters =
white blood cells /
red blood cells
again separated
34. E23: Iterative P. cynomolgi, new monkeys
★ Similar
representation in
vector space as
E04
★ Thus - a way to
characterize the
malaria parasite?
35. Conclusions & Contributions & Future Work!
★ Comprehensive framework to analyze malaria experiments - automatically
characterizing relationships among monkeys (severity phenotype?), among
clinical parameters (which are similarly important?), and malaria parasite
○ Such that non-experts can analyze data
○ Applicable to other experiments
○ Reduce TIME spent studying these results and provide more precise analysis
★ Future
○ Expand existing framework with new methods
○ Application of FULL framework to other experiments - build a more generalized model
○ More in-depth consideration of biological profiles
○ More comprehensive data storage to help automate entire process from start to finish
(especially Bayesian optimization)
★ I’ve learned a lot!
36. Thank you ...
★ Dr. Galinski and co. for running the experiments!
★ Thesis committee members - Dr. Eisen, Dr. Fossati, Dr. Prinz
★ My friends for being here!!
★ Dr. Choi for guiding me through the process and pushing me throughout my
CS career!