Gene selection via significant subset using silhouette index
Gene Selection via Significant Subset
using Silhouette Index
1,2 1 1 1 1
Juan Ignacio Pastore , Guillermo Abras , Diego Sebastían Comas , Marcel Brun , Virginia Ballarin
Laboratorio de Procesos y Medición de Señales, Facultad de Ingeniería, UNMdP
Comisión Nacional de Investigaciones Científicas y Técnicas CONICET,
Gene selection is an important task in the area of bioinformatics, where significant genes are chosen using somecriterion of significance. In the case of classification, like disease vs. normal, tissue
type, etc, the criterion used is the ability to provide good features for the classification task. In other cases it is interesting to select large groups of genes with similar behavior, regardless of the class.
This task is usually carried on by clustering algorithm, where the whole family of genes, or a subset of them, is grouped into significant clusters. These techniques provide insight on possible co-
regulation between genes, but usually provide large, maybe enormous sets, depending on the number of clusters required. In this work we present a new algorithm that provides sets of genes with
very similar expression. This is possible by using the complete clustering tree provided by the hierarchicalclustering algorithm, and the Silhouette index for ranking of the subsets.
Microarray Data Hierarchical Silhouette Selected
Clustering Index Sets
Hierarchical Clustering Silhouette Index
1 K é1 ù
Hierarchical clustering is an The Silhouette index measures not S= åê
K k =1 ë nk
å S (x )ú
xÎCk û 1 1 1
agglomerative partitioning algorithm that only the compacness of the
b (x ) - a (x ) 0.8 0.8 0.8
identifies compact subsets of the data, in clusters, but also the distance S (x ) =
max é a (x ), b (x )ù
0.6 0.6 0.6
between them. The higher the
a iterative proceeding. The result of the 0.4 0.4 0.4
index, the more compact and 1
algorithm is a dendrogram, a tree a (x ) = å d ( x, y )
nk - 1 yÎCk , y ¹ x 0.2 0.2 0.2
structure informing all the steps of the separated from each other are the 0 0 0
grouping process. cluster. é1 ù 0 0.5 1 0 0.5 1 0 0.5 1
b (x ) = min ê
h =1,K, K ,h ¹ k n
å d (x, y )ú RAS1
ë h yÎCh û
Experiment : E-GEOD-15653 Submitter(s) : Patti Lab : Joslin Diabetes Center Mary Elizabeth Patti.
(Generated description): Experiment with 18 hybridizations, using 18 samples of species [Homo sapiens], using 18 arrays of array design [Affymetrix GeneChip® Human Genome HG-U133A [HG-
U133A]], producing 18 raw data files and 18 transformed and/or normalized data files.
(Submitter's description 1): Hepatic lipid accumulation is an important complication of obesity linked to risk for type 2 diabetes. To identify novel transcriptional changes in human liver which could
contribute to hepatic lipid accumulation and associated insulin resistance and type 2 diabetes (DM2), we evaluated gene expression and gene set enrichment in surgical liver biopsies from 13
obese (9 with DM2) and 5 control subjects, obtained in the fasting state at the time of elective abdominal surgery for obesity or cholecystectomy. RNA was isolated for cRNA preparation and
hybridized to Affymetrix U133A microarrays. Experiment Overall Design: Human liver samples were obtained from 5 lean control subjects undergoing elective cholecystectomy and 13 obese
subjects (with or without Type 2 diabetes) undergoing gastric bypass surgery. Subjects with diabetes were classified as either well-controlled or poorly-controlled.
We choose compact and separated clusters of genes by computing the Silhouette Index of Compactness[1,2,3] on every possible subset of the N genes. This approach may take an impractical
amount of time, since there are 2N such sets; therefore we propose a sub-optimal search, limiting the computation of the index on the sets provided by the Hierarchical Clustering algorithm, not
only on the final stage, but on every intermediate step. If there are N genes, there will be N such groupings, the first one with N clusters (subsets), and the last one with only 1 large cluster, making a
total of N(N+1)/2 candidate subsets. Because of the overlapping, there are only 2*N different subsets to be processed, and because of the way the clustering algorithm works, most of them will be
compact. The Silhouette index will ensure to select groups that are also separated from the other ones.
Program’s interface for gene selection Performance results
In the sample data used for testing purposes the top selected sets showed consistency and many of the The proposed tool may be a powerful tool for the biologists or computational
genes of the groups were related by function. Below we can see one of the top sets, with a Silhouette index biology researchers interested on generating new hypothesis on co-expressed
of 0.94,which consists of two probes for the same gene (GSTM1 ), and one probe for gene GSTM2, which are genes, which are not provided by more standard analysis tools.
both members of the mu class of enzymes, which functions in the detoxification of electrophilic compounds.
Biological Process Cellular Component
Probe Set Gene Title Gene Symbol Molecular Function Term
GSTM1 metabolic process
glutathione transferase activity
cytoplasm  Rousseeuw, Peter J., "Silhouettes: A graphical aid to the interpretation and validation of cluster
transferase mu 1 transferase activity analysis", Journal of Computational and Applied Mathematics , 20 (1) , pp.53-65 , 1987.
glutathione S-  Pearson John V. et.al., "Identification of the Genetic Basis for Complex Disorders by Use of Pooling-
glutathione transferase activity Based Genomewide Single-NucleotideyPolimorphism Association Studies”, The American Journal of
204418_x_at transferase mu 2 GSTM2 metabolic process cytoplasm
(muscle) Human Genetics, 80, pp. 126-139. 2007.
 Jianping Hua, David W. Craig, Marcel Brun, Jennifer Webster, Victoria Zismann, Waibhav Tembe,
glutathione S- glutathione transferase activity
transferase mu 1
GSTM1 metabolic process
cytoplasm Keta Joshipura, Matthew J. Huentelman, Edward R. Dougherty, Dietrich A. Stephan: SNiPer-HD:
improved genotype calling accuracy by an expectation-maximization algorithm for high-density SNP
arrays. Bioinformatics 23(1): pp. 57-63 (2007).