Each case is typically represented by 4 spots 3 malignant lesions 1 matched normal tissue Each spot has a spot grade Sometime through H&E staining Multiple spots approach Reduce the impact of possible missing spots Control for the heterogeneity of the tumor Several “scores” per spot Max Maximum intensity Pos Percent of cells staining PosMax Percent of cells staining with the maximum intensity Spot grade NL,1,2,.. -- 4-10 spots of different grades per patient
One of the by-products of random forest is intrinsic proximity. Since……
Random forest not only works for supervised problem, but also can be used in unsupervised problem, where the class outcome is not specified. The key idea was again formulated by Breiman. It goes like this …
Breiman also proposed the two standard ways of generating synthetic observations. First, … Second,
Based on these ideas, we formulate the general Random Forest clustering framework as show here.
Here we show a MDS plot of all patients who have clear cell RCC, the most common histological subtype of RCC. RF clustering group the patients into three clusters.
We also tried hierarchical clustering using Euclidean distance. The result is not as nice as we got using RF clustering. It has more patients misclassified.
Here we compare the Euclidean distance vs. RF distance. We find that they are usually correlated. But since RF distance
We can also use the clustering result to identify “we-called” “irregular” patients. Let’s focus our attention to this table first. Using RF clustering, we group all 366 RCC patients into 3 cluster. The cluster 1 and 2 contain mainly clear cell patients and cluster 3 contains mainly non-clear cell patients. We find that the cluster 1 and 2 share similar survivorship function and they do much worse, while the cluster 3 patients have a different survival profile but they do much better. This is consistent with the fact that clear cell patients usually do worse than the non-clear cell patients. We notice that there are 9 clear cell patients in cluster 3, the better surviving cluster. When we plot the Kaplan-Meir estimate of these we-called “irregular” clear cell patients, other clear cell patients and non-clear cell patients, we find that these 9 patients behave very much like those non-clear cell patients. This is why we call them “irregular”.
Here we just show the boxplot of protein expression of the 8 tumor markers across the three clusters. And this is a way of interpreting the clusters biologically.
Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles
Goal: For the patients admitted into ER, to predict who is at higher risk of heart attack
Training data set:
# of subjects = 215
Outcome variable = High/Low Risk determined
19 noninvasive clinical and lab variables were used as the predictors
High 12% Low 88% High 17% Low 83% Is BP <= 91? High 70% Low 30% High 11% Low 89% High 50% Low 50% High 2% Low 98% High 23% Low 77% Is age <= 62.5? Classified as high risk! Classified as low risk! Classified as high risk! Classified as low risk! Is ST present? CART construction Yes No No No Yes Yes