Monkeying Around: Automatically Analyzing Malaria Infections in Rhesus Macaques

Monkeying Around:
Automatically
Analyzing Malaria
Infections in Rhesus
Macaques
Lindsay Hexter

Motivation
★ Why Malaria?
○ 3.2 billion people at risk globally
★ Why begin at E04?
○ To automate analysis in
Joyner 2016, foundation for
other experiments
○ Compare Human vs. Machine
Analysis

Motivation: Why Automation?
★ High-dimensional data
★ Precision of data analysis
★ Time: can re-apply same
analysis to new data
★ Discover something new
★ Non-experts can still
make conclusions about
the data

Purpose
★ Automatic analysis of E04
○ Come up with a framework to study small
datasets (differently shaped)
★ Apply this framework to other
experiments
○ Generalizable to other parasite strains? How
and why do the results change?

Overview of E04 Experiment
★ P. cynomolgi || P. vivax
★ Rhesus macaques ⇒ humans
★ Provide data for better
understanding of the infection =
better treatments!

Dataset
★ 5 monkeys: 2 non-severe, 1
severe, 1 very severe, 1 lethal
★ Clinical parameters taken daily
over the course of the experiment -
looked at specifically blood-related
parameters

Techniques: Data preprocessing / Normalization
★ Scaling, missing data?
★ Normalize data to equalize computations for
distance metrics
Unit vector Min-max
normalization

Normalization reduced important variation
★ Used to reduce noise, but in this case, removed important malaria phase
information

Techniques: Iterations on Curve-fitting
★ Two ways of thinking:
○ Guess x # of Gaussians based on x # peaks, then divide that window into x
chunks for fitting (e.g. 3 peaks, so divide window evenly and fit to those 3
ranges) - original logic to code an “n_gaussians” function
○ Guess x # Gaussians based on x # peaks, then specifically divide windows
based on peak ranges instead - code a fitting function over the whole
interval
★ Parameter search space?

★ Two ways of thinking:
○ Guess x # of Gaussians based on x # peaks, then divide that window into x
chunks for fitting (e.g. 3 peaks, so divide window evenly and fit to those 3
ranges) - original logic to code an “n_gaussians” function
○ Guess x # Gaussians based on x # peaks, then specifically divide windows
based on peak ranges instead - code a fitting function over the whole
interval
★ Two approaches: minimizing residual between fit and data
vs. relying on built-in function
○ Curve-fitting based on reduction of user-defined loss function: scipy
minimize, scipy leastsq, scipy fmin_slsqp
○ Curve-fitting based on built-in loss function: scipy curve_fit

★ Fitting window? Peakutils, scipy peak-finding...
Result: Curve fitting / peak finding
My peak
function -
includes
plateaux!
Peakutils
results in
poor fit

Data is cleaned, scaled and normalized;
mathematical representation via
concatenated Gaussian functions; now
onto analysis...

Analysis Roadmap: Goal + Technique
Relationship among clinical
parameters
Regression modeling
Representation of monkeys in vector
space
Residual matrices
Automatic grouping of clinical
parameters in this vector space
Clustering
Minimizing biological noise to
increase similarity among
monkeys of similar phenotype
Bayesian optimization

Analyses we can find automatically
Joyner et al. 2016 Automated analyses
# Reticulocytes - possible indicator of lethal
phenotype
✅
Anemic phenotype worsens with severity ✅
Relationship between hemoglobin and:
parasitemia kinetics, mean corpuscular
volume (red blood cell size)
✅
Role of thrombocytopenia (platelet
deficiency) not well understood
✅ + insight?
Lower parasitemia in non-severe
phenotype
✅

Techniques: Regression
★ Ridge method★ Combined model: Stochastic Gradient Descent Regressor- weights
for each monkey as measure of predictor significance

Results: Regression
★ Coefficients for both
non-severe monkeys
are much more similar,
in comparison to the
other monkeys
★ Suggests some ‘normal’
phenotype vs. sick =
anomaly

Results: Regression
★ # Reticulocytes largest
positive coefficient for
lethal phenotype -
possible indicator
★ MCV - red blood cell
size - as another
distinguishing factor?
# Reticulocytes - possible
indicator of lethal phenotype
✅
Anemic phenotype worsens
with severity
✅

Results: Regression
★ Hgb shown to have the
same relationship
among two non-severe
monkeys and among
the other group
★ Hgb is negative as
compared to positive in
non-severe monkeys
★ Hgb unrelated to mcv,
as found in paper
Relationship between hgb
and: parasitemia, mcv
✅

Results: Combined Regression
★ Possible representation of
non-severe phenotype with
low regression weights

Results: Phased Regression
★ Did not improve mean squared error over whole interval - peak-finding worked
well for Gaussian fitting, but not finding phases automatically
★ Also because of evaluation pre-shifting, some of the phases may not have
aligned (resulting in larger errors for combined models, shown in table)
Target
monkey
Sum of
MSEs of
all phases
RIc14
(non-severe)
34.288
RSb14
(non-severe)
29.817

Techniques: Clustering
★ Clustering, e.g. kmeans
(tried Gaussian means,
agglomerative
hierarchical, spectral, and
birch methods)
★ Evaluation via Silhouette
Score
b = avg dissimilarity with nearest
neighboring cluster
a = avg similarity within own cluster

Techniques: Residual Matrices- representing
Monkeys in Vector Space
★ Construct residual matrices where
each clinical parameter is
characterized by the residual
between two monkeys
m1 vs.
m2
... m4 vs.
m5
gran
lymph
...
wbc
Monkey pairwise residuals / sign
match ⇒
Clinicalparameter⇒

Techniques: Bayesian Optimization
★ Bayesian Optimization
○ Motivation: derive insights from
very complex functions
○ E.g. 7^20 * 7^20 = !!!!!!
■ extremely computationally
heavy
○ Optimality of guessed result
based on loss function (in our
case, residual between two
monkeys)

RSb14 (non-severe) and RIc14 (non-severe)

Results: Residual Matrices + Shifting
★ Shifting helped
elucidate the trend
between
non-severe
monkeys
Reduced residual from ~ 57 to ~ 12
Lower parasitemia in non-severe
phenotype
✅

★ Shifting helped elucidate the trend between
non-severe monkeys
Relationship between hemoglobin and:
parasitemia kinetics
✅
Pre-shifting Post-shifting

★ Monocytes - role in adaptive immune system, so important in first phase?
(prognostic of long-term survival?)

Results: Clustering + Shifting
Parasites / uL clustered with monocytes and reticulocytes, as previously
mentioned, post-shifting (up to day 23)
# Reticulocytes - possible
indicator of lethal phenotype
✅

Results: Clustering + Shifting
★ Parameters clustered together over all days, k = 4: how are they related?
Granulocytes, lymphocytes,
monocytes, platelets, #
reticulocytes, reticulocytes
concentration, white blood cell
total count
Parasites / uL
Red blood cells / volume,
hemoglobin, mean corpuscular
hemoglobin conc, mean
corpuscular hemoglobin, red
blood cell volume, mean platelet
volume, total red blood cell
count, red blood cell distribution
width
% reticulocytes (proportional to
total red blood cell count)
Immune
response?
Red blood
cell-related
parameters?

Results: Normalization in clustering - tradeoff
Role of thrombocytopenia (platelet
deficiency) not well understood
✅ + insight?

Can we apply this methodology to NEW
experiments?

E03: P. coatneyi Hackeri (i.e. different parasite)
★ Different
representation in
vector space
★ Reticulocytes even
further from the
other clinical data
(more significant in
this parasite?)
★ More clusters =
white blood cells /
red blood cells
again separated

E23: Iterative P. cynomolgi, new monkeys
★ Similar
representation in
vector space as
E04
★ Thus - a way to
characterize the
malaria parasite?

Conclusions & Contributions & Future Work!
★ Comprehensive framework to analyze malaria experiments - automatically
characterizing relationships among monkeys (severity phenotype?), among
clinical parameters (which are similarly important?), and malaria parasite
○ Such that non-experts can analyze data
○ Applicable to other experiments
○ Reduce TIME spent studying these results and provide more precise analysis
★ Future
○ Expand existing framework with new methods
○ Application of FULL framework to other experiments - build a more generalized model
○ More in-depth consideration of biological profiles
○ More comprehensive data storage to help automate entire process from start to finish
(especially Bayesian optimization)
★ I’ve learned a lot!

Thank you ...
★ Dr. Galinski and co. for running the experiments!
★ Thesis committee members - Dr. Eisen, Dr. Fossati, Dr. Prinz
★ My friends for being here!!
★ Dr. Choi for guiding me through the process and pushing me throughout my
CS career!

E23 combined
model
weights?
Monkey Coefficient
RBg14 0.456473
ROh14 0.431664
RAd14 0.085915
RJn13 0.080195
ROc14 0.010533
RIb13 0.044158

Monkeying Around: Automatically Analyzing Malaria Infections in Rhesus Macaques

Recommended

Recommended

More Related Content

Similar to Monkeying Around: Automatically Analyzing Malaria Infections in Rhesus Macaques

Similar to Monkeying Around: Automatically Analyzing Malaria Infections in Rhesus Macaques (20)

More from Jinho Choi

More from Jinho Choi (20)

Recently uploaded

Recently uploaded (20)

Monkeying Around: Automatically Analyzing Malaria Infections in Rhesus Macaques