SlideShare a Scribd company logo
1 of 23
Download to read offline
8/12/14, 1:58 PMMicrobiome Species Data Exploration
Page 1 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1
MicrobiomeSpeciesDataExplorationMicrobiomeSpeciesDataExploration
UnsupervisedandSupervisedApproach
Mehrdad Yazdani
August 12, 2014
8/12/14, 1:58 PMMicrobiome Species Data Exploration
Page 2 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1
Outline
1. Properties of data set
2. Unsupervised Analysis
3. Supervised Analysis
2/23
8/12/14, 1:58 PMMicrobiome Species Data Exploration
Page 3 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1
Where does this data come from?
The data originates from stool samples from the NIH Human Microbiome Project and Professor
Larry Smarr. The NIH HMP has healthy and sick subjects.
Here we focus on diļ¬€erent population of species.
3/23
8/12/14, 1:58 PMMicrobiome Species Data Exploration
Page 4 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1
Properties of Species Data Set
The data shows the diļ¬€erent compositions of diļ¬€erent species for each subject. Hence, it has
the properties of a compositional data set:
1. For each subject, the composition of a speciļ¬c species is greater than 0.0 and less than 1.0
2. The composition of all species for a single subject must sum to 1.0
In our data, the number of species are:
The number of subjects are:
Note that we have far more species than subjects in this data set.
## [1] 2572
## [1] 249
4/23
8/12/14, 1:58 PMMicrobiome Species Data Exploration
Page 5 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1
Species Compositions for Each Subject
The composition of the species for each subject must sum to 1.0, however this is not the case
for this data set:
Possible reason: numerical "round-oļ¬€"" errors introduce this discrepancy.
5/23
8/12/14, 1:58 PMMicrobiome Species Data Exploration
Page 6 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1
Zeros
Zeros must be handled carefully. There are two classes of zeros in compositional data sets:
1. Absolute zeros: indicate that the species should be removed
2. Round-oļ¬€ zeros: indicates that the amount of species was below threshold of detection
Absolute zeros are dealt with by removing them. Round-oļ¬€ zeros are trickier and are typically
replaced with "small" values (imputation tricks).
6/23
8/12/14, 1:58 PMMicrobiome Species Data Exploration
Page 7 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1
Number of Absolute Zeros
The number of species that are always zero for all subjects is:
We will treat these species as being absolute zeros and remove them from the data:
## [1] 29
## [1] "Marinilabilia sp. AK2"
## [2] "Desulfovibrio piezophilus"
## [3] "Streptomyces bottropensis"
## [4] "Novosphingobium sp. AP12"
## [5] "Acinetobacter sp. NCTC 7422"
## [6] "Caldisphaera lagunensis"
## [7] "Streptomyces auratus"
## [8] "Candidatus Arthromitus sp. SFB-1"
## [9] "Gillisia sp. CBA3202"
## [10] "Thielavia terrestris"
## [11] "Synechococcus sp. PCC 6312"
## [12] "Alcaligenes faecalis"
## [13] "Aspergillus fumigatus" 7/23
8/12/14, 1:58 PMMicrobiome Species Data Exploration
Page 8 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1
Number of Round-oļ¬€ Zeros
After removing absolute zeros, we observe that there are also a large number of zeros from
round-oļ¬€ errors:
Since the compositions do not sum to 1.0, we replace these round-oļ¬€ zeros with values so that
our data is a true compositional data set.
8/23
8/12/14, 1:58 PMMicrobiome Species Data Exploration
Page 9 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1
Recall that we are dealing with compositions
1. For each subject, the composition of a speciļ¬c species is greater than 0.0 and less than 1.0
2. The composition of all species for a single subject must sum to 1.0
Because of these constraints, the usual algebra of additions, multiplications, etc. that we are
used to does not apply. Typically, a transformation function is applied to the composition so
that we can apply the usual Euclidean algebra. There are many possible transformation
functions used.
Here we apply the log transformation on compositions.
9/23
8/12/14, 1:58 PMMicrobiome Species Data Exploration
Page 10 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1
Unsupervised Approach
10/23
8/12/14, 1:58 PMMicrobiome Species Data Exploration
Page 11 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1
11/23
8/12/14, 1:58 PMMicrobiome Species Data Exploration
Page 12 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1
PCA
Top 3 PC's explain 80% of variance.
12/23
8/12/14, 1:58 PMMicrobiome Species Data Exploration
Page 13 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1
Species Projected onto Top PC's
Hypothesis: PC2 is the most useful component for discriminating healthy vs. sick subjects.
13/23
8/12/14, 1:58 PMMicrobiome Species Data Exploration
Page 14 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1
What does PC-1 look like?
14/23
8/12/14, 1:58 PMMicrobiome Species Data Exploration
Page 15 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1
What does PC-2 look like?
Many of the loadings are close to zero, therefore PC2 can be approximated by a sparse vector:
this can lead to better interpretable results as to which species "matter." This is in contrast to
PC-1.
15/23
8/12/14, 1:58 PMMicrobiome Species Data Exploration
Page 16 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1
Supervised Approach
16/23
8/12/14, 1:58 PMMicrobiome Species Data Exploration
Page 17 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1
Logistic Regression
We build classiļ¬ers to determine which species are important for discriminating healthy from
sick subjects. In our approach, we pool all LS, CD, and UC subjects into one group labeled as
"sick," and all HE subjects are labeled as "healthy."
The classiļ¬er that we use is a logistic regression model and we measure the error of the
classiļ¬er using the Akaike information criterion (AIC).
Note that since we have an order of magnitude less subjects than species, this is an
undetermined system (more unknowns than equations) and it is not meaningful to use "all" the
data. To mitigate this issue, we take subsets of the species that we have. We ļ¬rst take subsets
from the PCs, followed by subsets of the species.
17/23
8/12/14, 1:58 PMMicrobiome Species Data Exploration
Page 18 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1
Logistic Regression on PCs
We build a logistic regression model on the top 3 PC's to measure just how good these
components are classifying sick from healthy subjects. Recall that our PCA plots from before
appeared to show PC2 to be the most useful for this task. The AIC for the logistic regression
model that uses only PC 1 is:
The AIC for the logistic regression model that uses only PC 2 is:
The AIC for the logistic regression model that uses only PC 3 is:
The lower the AIC, the less error the model has. Therefore these analyses support our earlier
hypothesis that PC2 is more discriminative than the other PCs.
## [1] 163.8
## [1] 61.6
## [1] 190.9
18/23
8/12/14, 1:58 PMMicrobiome Species Data Exploration
Page 19 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1
Logistic Regression on Individual Species
We now build classiļ¬ers on each individual species. The AIC for logistic regression models that
use single species is as follows:
We select the pair of species with lowest AIC. Since the AIC was computed for a model that uses
a single species, selecting a pair of species may be sub-optimal.
19/23
8/12/14, 1:58 PMMicrobiome Species Data Exploration
Page 20 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1
Two species with lowest AIC
The two species with the lowest individual AIC are:
Their respective individual AIC's are:
## [1] "Bacteroides.dorei" "Bacteroides.oleiciplenus"
## [1] 24.47 49.38
20/23
8/12/14, 1:58 PMMicrobiome Species Data Exploration
Page 21 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1
Plot of the Two Species with Lowest AIC
This plot shows that the species with the lowest AIC have a larger separability than the PCA plot
from before. However, a lot of interesting structure that the PCA plot revealed is lost (for
example: the sub-cluster within healthy subjects).
21/23
8/12/14, 1:58 PMMicrobiome Species Data Exploration
Page 22 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1
Plot of the Species with Lowest AIC against E. Coli
While E. Coli does not have lowest AIC, comparing it with the lowest AIC specie reveals good
discrimination and interesting structures that PCA had revealed.
22/23
8/12/14, 1:58 PMMicrobiome Species Data Exploration
Page 23 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1
Future Work
1. Instead of selecting a single species to build a logistic regression model, select pairs of
species. This will involve solving over 3 million logistic regression models. Solving a triplet will
require over 2.8 billion! (Dell resources??)
2. Incorporate Ayasdi features
3. Apply similar analysis to the other data sets.
23/23

More Related Content

Similar to Yazdani - Microbiome Species Data Exploration

Breast Cancer Prediction
Breast Cancer PredictionBreast Cancer Prediction
Breast Cancer PredictionIRJET Journal
Ā 
IRJET- Identification of Malaria Parasites in Cells using Object Detection
IRJET- Identification of Malaria Parasites in Cells using Object DetectionIRJET- Identification of Malaria Parasites in Cells using Object Detection
IRJET- Identification of Malaria Parasites in Cells using Object DetectionIRJET Journal
Ā 
Analysis of Machine Learning Techniques for Breast Cancer Prediction
Analysis of Machine Learning Techniques for Breast Cancer PredictionAnalysis of Machine Learning Techniques for Breast Cancer Prediction
Analysis of Machine Learning Techniques for Breast Cancer PredictionDr. Amarjeet Singh
Ā 
BOTNET DETECTION USING VARIOUS MACHINE LEARNING ALGORITHMS: A REVIEW
BOTNET DETECTION USING VARIOUS MACHINE LEARNING ALGORITHMS: A REVIEWBOTNET DETECTION USING VARIOUS MACHINE LEARNING ALGORITHMS: A REVIEW
BOTNET DETECTION USING VARIOUS MACHINE LEARNING ALGORITHMS: A REVIEWIRJET Journal
Ā 
My own Machine Learning project - Breast Cancer Prediction
My own Machine Learning project - Breast Cancer PredictionMy own Machine Learning project - Breast Cancer Prediction
My own Machine Learning project - Breast Cancer PredictionGabriele Mineo
Ā 
IRJET- Biochips Technology
IRJET-  	  Biochips TechnologyIRJET-  	  Biochips Technology
IRJET- Biochips TechnologyIRJET Journal
Ā 
EARLY BLIGHT AND LATE BLIGHT DISEASE DETECTION ON POTATO LEAVES USING CONVOLU...
EARLY BLIGHT AND LATE BLIGHT DISEASE DETECTION ON POTATO LEAVES USING CONVOLU...EARLY BLIGHT AND LATE BLIGHT DISEASE DETECTION ON POTATO LEAVES USING CONVOLU...
EARLY BLIGHT AND LATE BLIGHT DISEASE DETECTION ON POTATO LEAVES USING CONVOLU...IRJET Journal
Ā 
Identification of Disease in Leaves using Genetic Algorithm
Identification of Disease in Leaves using Genetic AlgorithmIdentification of Disease in Leaves using Genetic Algorithm
Identification of Disease in Leaves using Genetic Algorithmijtsrd
Ā 
Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...Damian R. Mingle, MBA
Ā 
Plant Leaf Disease Detection Using Machine Learning
Plant Leaf Disease Detection Using Machine LearningPlant Leaf Disease Detection Using Machine Learning
Plant Leaf Disease Detection Using Machine LearningIRJET Journal
Ā 
Fruit Disease Detection And Fertilizer Recommendation
Fruit Disease Detection And Fertilizer RecommendationFruit Disease Detection And Fertilizer Recommendation
Fruit Disease Detection And Fertilizer RecommendationIRJET Journal
Ā 
Leaf Disease Detection and Selection of Fertilizers using Artificial Neural N...
Leaf Disease Detection and Selection of Fertilizers using Artificial Neural N...Leaf Disease Detection and Selection of Fertilizers using Artificial Neural N...
Leaf Disease Detection and Selection of Fertilizers using Artificial Neural N...IRJET Journal
Ā 
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...IRJET Journal
Ā 
IRJET - Detection of Skin Cancer using Convolutional Neural Network
IRJET -  	  Detection of Skin Cancer using Convolutional Neural NetworkIRJET -  	  Detection of Skin Cancer using Convolutional Neural Network
IRJET - Detection of Skin Cancer using Convolutional Neural NetworkIRJET Journal
Ā 
IRJET- Result on the Application for Multiple Disease Prediction from Symptom...
IRJET- Result on the Application for Multiple Disease Prediction from Symptom...IRJET- Result on the Application for Multiple Disease Prediction from Symptom...
IRJET- Result on the Application for Multiple Disease Prediction from Symptom...IRJET Journal
Ā 
MLTDD : USE OF MACHINE LEARNING TECHNIQUES FOR DIAGNOSIS OF THYROID GLAND DIS...
MLTDD : USE OF MACHINE LEARNING TECHNIQUES FOR DIAGNOSIS OF THYROID GLAND DIS...MLTDD : USE OF MACHINE LEARNING TECHNIQUES FOR DIAGNOSIS OF THYROID GLAND DIS...
MLTDD : USE OF MACHINE LEARNING TECHNIQUES FOR DIAGNOSIS OF THYROID GLAND DIS...cscpconf
Ā 
Machine Learning for Molecules
Machine Learning for MoleculesMachine Learning for Molecules
Machine Learning for MoleculesIchigaku Takigawa
Ā 
A Survey of Convolutional Neural Network Architectures for Deep Learning via ...
A Survey of Convolutional Neural Network Architectures for Deep Learning via ...A Survey of Convolutional Neural Network Architectures for Deep Learning via ...
A Survey of Convolutional Neural Network Architectures for Deep Learning via ...ijtsrd
Ā 
DETECTION OF MALICIOUS EXECUTABLES USING RULE BASED CLASSIFICATION ALGORITHMS
DETECTION OF MALICIOUS EXECUTABLES USING RULE BASED CLASSIFICATION ALGORITHMSDETECTION OF MALICIOUS EXECUTABLES USING RULE BASED CLASSIFICATION ALGORITHMS
DETECTION OF MALICIOUS EXECUTABLES USING RULE BASED CLASSIFICATION ALGORITHMSAAKANKSHA JAIN
Ā 
IRJET- Greensworth: A Step Towards Smart Cultivation
IRJET- Greensworth: A Step Towards Smart CultivationIRJET- Greensworth: A Step Towards Smart Cultivation
IRJET- Greensworth: A Step Towards Smart CultivationIRJET Journal
Ā 

Similar to Yazdani - Microbiome Species Data Exploration (20)

Breast Cancer Prediction
Breast Cancer PredictionBreast Cancer Prediction
Breast Cancer Prediction
Ā 
IRJET- Identification of Malaria Parasites in Cells using Object Detection
IRJET- Identification of Malaria Parasites in Cells using Object DetectionIRJET- Identification of Malaria Parasites in Cells using Object Detection
IRJET- Identification of Malaria Parasites in Cells using Object Detection
Ā 
Analysis of Machine Learning Techniques for Breast Cancer Prediction
Analysis of Machine Learning Techniques for Breast Cancer PredictionAnalysis of Machine Learning Techniques for Breast Cancer Prediction
Analysis of Machine Learning Techniques for Breast Cancer Prediction
Ā 
BOTNET DETECTION USING VARIOUS MACHINE LEARNING ALGORITHMS: A REVIEW
BOTNET DETECTION USING VARIOUS MACHINE LEARNING ALGORITHMS: A REVIEWBOTNET DETECTION USING VARIOUS MACHINE LEARNING ALGORITHMS: A REVIEW
BOTNET DETECTION USING VARIOUS MACHINE LEARNING ALGORITHMS: A REVIEW
Ā 
My own Machine Learning project - Breast Cancer Prediction
My own Machine Learning project - Breast Cancer PredictionMy own Machine Learning project - Breast Cancer Prediction
My own Machine Learning project - Breast Cancer Prediction
Ā 
IRJET- Biochips Technology
IRJET-  	  Biochips TechnologyIRJET-  	  Biochips Technology
IRJET- Biochips Technology
Ā 
EARLY BLIGHT AND LATE BLIGHT DISEASE DETECTION ON POTATO LEAVES USING CONVOLU...
EARLY BLIGHT AND LATE BLIGHT DISEASE DETECTION ON POTATO LEAVES USING CONVOLU...EARLY BLIGHT AND LATE BLIGHT DISEASE DETECTION ON POTATO LEAVES USING CONVOLU...
EARLY BLIGHT AND LATE BLIGHT DISEASE DETECTION ON POTATO LEAVES USING CONVOLU...
Ā 
Identification of Disease in Leaves using Genetic Algorithm
Identification of Disease in Leaves using Genetic AlgorithmIdentification of Disease in Leaves using Genetic Algorithm
Identification of Disease in Leaves using Genetic Algorithm
Ā 
Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...
Ā 
Plant Leaf Disease Detection Using Machine Learning
Plant Leaf Disease Detection Using Machine LearningPlant Leaf Disease Detection Using Machine Learning
Plant Leaf Disease Detection Using Machine Learning
Ā 
Fruit Disease Detection And Fertilizer Recommendation
Fruit Disease Detection And Fertilizer RecommendationFruit Disease Detection And Fertilizer Recommendation
Fruit Disease Detection And Fertilizer Recommendation
Ā 
Leaf Disease Detection and Selection of Fertilizers using Artificial Neural N...
Leaf Disease Detection and Selection of Fertilizers using Artificial Neural N...Leaf Disease Detection and Selection of Fertilizers using Artificial Neural N...
Leaf Disease Detection and Selection of Fertilizers using Artificial Neural N...
Ā 
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
Ā 
IRJET - Detection of Skin Cancer using Convolutional Neural Network
IRJET -  	  Detection of Skin Cancer using Convolutional Neural NetworkIRJET -  	  Detection of Skin Cancer using Convolutional Neural Network
IRJET - Detection of Skin Cancer using Convolutional Neural Network
Ā 
IRJET- Result on the Application for Multiple Disease Prediction from Symptom...
IRJET- Result on the Application for Multiple Disease Prediction from Symptom...IRJET- Result on the Application for Multiple Disease Prediction from Symptom...
IRJET- Result on the Application for Multiple Disease Prediction from Symptom...
Ā 
MLTDD : USE OF MACHINE LEARNING TECHNIQUES FOR DIAGNOSIS OF THYROID GLAND DIS...
MLTDD : USE OF MACHINE LEARNING TECHNIQUES FOR DIAGNOSIS OF THYROID GLAND DIS...MLTDD : USE OF MACHINE LEARNING TECHNIQUES FOR DIAGNOSIS OF THYROID GLAND DIS...
MLTDD : USE OF MACHINE LEARNING TECHNIQUES FOR DIAGNOSIS OF THYROID GLAND DIS...
Ā 
Machine Learning for Molecules
Machine Learning for MoleculesMachine Learning for Molecules
Machine Learning for Molecules
Ā 
A Survey of Convolutional Neural Network Architectures for Deep Learning via ...
A Survey of Convolutional Neural Network Architectures for Deep Learning via ...A Survey of Convolutional Neural Network Architectures for Deep Learning via ...
A Survey of Convolutional Neural Network Architectures for Deep Learning via ...
Ā 
DETECTION OF MALICIOUS EXECUTABLES USING RULE BASED CLASSIFICATION ALGORITHMS
DETECTION OF MALICIOUS EXECUTABLES USING RULE BASED CLASSIFICATION ALGORITHMSDETECTION OF MALICIOUS EXECUTABLES USING RULE BASED CLASSIFICATION ALGORITHMS
DETECTION OF MALICIOUS EXECUTABLES USING RULE BASED CLASSIFICATION ALGORITHMS
Ā 
IRJET- Greensworth: A Step Towards Smart Cultivation
IRJET- Greensworth: A Step Towards Smart CultivationIRJET- Greensworth: A Step Towards Smart Cultivation
IRJET- Greensworth: A Step Towards Smart Cultivation
Ā 

Yazdani - Microbiome Species Data Exploration

  • 1. 8/12/14, 1:58 PMMicrobiome Species Data Exploration Page 1 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1 MicrobiomeSpeciesDataExplorationMicrobiomeSpeciesDataExploration UnsupervisedandSupervisedApproach Mehrdad Yazdani August 12, 2014
  • 2. 8/12/14, 1:58 PMMicrobiome Species Data Exploration Page 2 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1 Outline 1. Properties of data set 2. Unsupervised Analysis 3. Supervised Analysis 2/23
  • 3. 8/12/14, 1:58 PMMicrobiome Species Data Exploration Page 3 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1 Where does this data come from? The data originates from stool samples from the NIH Human Microbiome Project and Professor Larry Smarr. The NIH HMP has healthy and sick subjects. Here we focus on diļ¬€erent population of species. 3/23
  • 4. 8/12/14, 1:58 PMMicrobiome Species Data Exploration Page 4 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1 Properties of Species Data Set The data shows the diļ¬€erent compositions of diļ¬€erent species for each subject. Hence, it has the properties of a compositional data set: 1. For each subject, the composition of a speciļ¬c species is greater than 0.0 and less than 1.0 2. The composition of all species for a single subject must sum to 1.0 In our data, the number of species are: The number of subjects are: Note that we have far more species than subjects in this data set. ## [1] 2572 ## [1] 249 4/23
  • 5. 8/12/14, 1:58 PMMicrobiome Species Data Exploration Page 5 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1 Species Compositions for Each Subject The composition of the species for each subject must sum to 1.0, however this is not the case for this data set: Possible reason: numerical "round-oļ¬€"" errors introduce this discrepancy. 5/23
  • 6. 8/12/14, 1:58 PMMicrobiome Species Data Exploration Page 6 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1 Zeros Zeros must be handled carefully. There are two classes of zeros in compositional data sets: 1. Absolute zeros: indicate that the species should be removed 2. Round-oļ¬€ zeros: indicates that the amount of species was below threshold of detection Absolute zeros are dealt with by removing them. Round-oļ¬€ zeros are trickier and are typically replaced with "small" values (imputation tricks). 6/23
  • 7. 8/12/14, 1:58 PMMicrobiome Species Data Exploration Page 7 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1 Number of Absolute Zeros The number of species that are always zero for all subjects is: We will treat these species as being absolute zeros and remove them from the data: ## [1] 29 ## [1] "Marinilabilia sp. AK2" ## [2] "Desulfovibrio piezophilus" ## [3] "Streptomyces bottropensis" ## [4] "Novosphingobium sp. AP12" ## [5] "Acinetobacter sp. NCTC 7422" ## [6] "Caldisphaera lagunensis" ## [7] "Streptomyces auratus" ## [8] "Candidatus Arthromitus sp. SFB-1" ## [9] "Gillisia sp. CBA3202" ## [10] "Thielavia terrestris" ## [11] "Synechococcus sp. PCC 6312" ## [12] "Alcaligenes faecalis" ## [13] "Aspergillus fumigatus" 7/23
  • 8. 8/12/14, 1:58 PMMicrobiome Species Data Exploration Page 8 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1 Number of Round-oļ¬€ Zeros After removing absolute zeros, we observe that there are also a large number of zeros from round-oļ¬€ errors: Since the compositions do not sum to 1.0, we replace these round-oļ¬€ zeros with values so that our data is a true compositional data set. 8/23
  • 9. 8/12/14, 1:58 PMMicrobiome Species Data Exploration Page 9 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1 Recall that we are dealing with compositions 1. For each subject, the composition of a speciļ¬c species is greater than 0.0 and less than 1.0 2. The composition of all species for a single subject must sum to 1.0 Because of these constraints, the usual algebra of additions, multiplications, etc. that we are used to does not apply. Typically, a transformation function is applied to the composition so that we can apply the usual Euclidean algebra. There are many possible transformation functions used. Here we apply the log transformation on compositions. 9/23
  • 10. 8/12/14, 1:58 PMMicrobiome Species Data Exploration Page 10 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1 Unsupervised Approach 10/23
  • 11. 8/12/14, 1:58 PMMicrobiome Species Data Exploration Page 11 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1 11/23
  • 12. 8/12/14, 1:58 PMMicrobiome Species Data Exploration Page 12 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1 PCA Top 3 PC's explain 80% of variance. 12/23
  • 13. 8/12/14, 1:58 PMMicrobiome Species Data Exploration Page 13 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1 Species Projected onto Top PC's Hypothesis: PC2 is the most useful component for discriminating healthy vs. sick subjects. 13/23
  • 14. 8/12/14, 1:58 PMMicrobiome Species Data Exploration Page 14 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1 What does PC-1 look like? 14/23
  • 15. 8/12/14, 1:58 PMMicrobiome Species Data Exploration Page 15 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1 What does PC-2 look like? Many of the loadings are close to zero, therefore PC2 can be approximated by a sparse vector: this can lead to better interpretable results as to which species "matter." This is in contrast to PC-1. 15/23
  • 16. 8/12/14, 1:58 PMMicrobiome Species Data Exploration Page 16 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1 Supervised Approach 16/23
  • 17. 8/12/14, 1:58 PMMicrobiome Species Data Exploration Page 17 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1 Logistic Regression We build classiļ¬ers to determine which species are important for discriminating healthy from sick subjects. In our approach, we pool all LS, CD, and UC subjects into one group labeled as "sick," and all HE subjects are labeled as "healthy." The classiļ¬er that we use is a logistic regression model and we measure the error of the classiļ¬er using the Akaike information criterion (AIC). Note that since we have an order of magnitude less subjects than species, this is an undetermined system (more unknowns than equations) and it is not meaningful to use "all" the data. To mitigate this issue, we take subsets of the species that we have. We ļ¬rst take subsets from the PCs, followed by subsets of the species. 17/23
  • 18. 8/12/14, 1:58 PMMicrobiome Species Data Exploration Page 18 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1 Logistic Regression on PCs We build a logistic regression model on the top 3 PC's to measure just how good these components are classifying sick from healthy subjects. Recall that our PCA plots from before appeared to show PC2 to be the most useful for this task. The AIC for the logistic regression model that uses only PC 1 is: The AIC for the logistic regression model that uses only PC 2 is: The AIC for the logistic regression model that uses only PC 3 is: The lower the AIC, the less error the model has. Therefore these analyses support our earlier hypothesis that PC2 is more discriminative than the other PCs. ## [1] 163.8 ## [1] 61.6 ## [1] 190.9 18/23
  • 19. 8/12/14, 1:58 PMMicrobiome Species Data Exploration Page 19 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1 Logistic Regression on Individual Species We now build classiļ¬ers on each individual species. The AIC for logistic regression models that use single species is as follows: We select the pair of species with lowest AIC. Since the AIC was computed for a model that uses a single species, selecting a pair of species may be sub-optimal. 19/23
  • 20. 8/12/14, 1:58 PMMicrobiome Species Data Exploration Page 20 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1 Two species with lowest AIC The two species with the lowest individual AIC are: Their respective individual AIC's are: ## [1] "Bacteroides.dorei" "Bacteroides.oleiciplenus" ## [1] 24.47 49.38 20/23
  • 21. 8/12/14, 1:58 PMMicrobiome Species Data Exploration Page 21 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1 Plot of the Two Species with Lowest AIC This plot shows that the species with the lowest AIC have a larger separability than the PCA plot from before. However, a lot of interesting structure that the PCA plot revealed is lost (for example: the sub-cluster within healthy subjects). 21/23
  • 22. 8/12/14, 1:58 PMMicrobiome Species Data Exploration Page 22 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1 Plot of the Species with Lowest AIC against E. Coli While E. Coli does not have lowest AIC, comparing it with the lowest AIC specie reveals good discrimination and interesting structures that PCA had revealed. 22/23
  • 23. 8/12/14, 1:58 PMMicrobiome Species Data Exploration Page 23 of 23ļ¬le:///Users/myazdaniUCSD/Documents/microbiome/july29/index.html#1 Future Work 1. Instead of selecting a single species to build a logistic regression model, select pairs of species. This will involve solving over 3 million logistic regression models. Solving a triplet will require over 2.8 billion! (Dell resources??) 2. Incorporate Ayasdi features 3. Apply similar analysis to the other data sets. 23/23