Gene Selection via Significant Subset
                                                            using Silhouette Index
Upcoming SlideShare
Loading in …5

Gene selection via significant subset using silhouette index


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Gene selection via significant subset using silhouette index

  1. 1. Gene Selection via Significant Subset using Silhouette Index 1,2 1 1 1 1 Juan Ignacio Pastore , Guillermo Abras , Diego Sebastían Comas , Marcel Brun , Virginia Ballarin 1 Laboratorio de Procesos y Medición de Señales, Facultad de Ingeniería, UNMdP 2 Comisión Nacional de Investigaciones Científicas y Técnicas CONICET, Introduction Gene selection is an important task in the area of bioinformatics, where significant genes are chosen using somecriterion of significance. In the case of classification, like disease vs. normal, tissue type, etc, the criterion used is the ability to provide good features for the classification task. In other cases it is interesting to select large groups of genes with similar behavior, regardless of the class. This task is usually carried on by clustering algorithm, where the whole family of genes, or a subset of them, is grouped into significant clusters. These techniques provide insight on possible co- regulation between genes, but usually provide large, maybe enormous sets, depending on the number of clusters required. In this work we present a new algorithm that provides sets of genes with very similar expression. This is possible by using the complete clustering tree provided by the hierarchicalclustering algorithm, and the Silhouette index for ranking of the subsets. Algorithm Microarray Data Hierarchical Silhouette Selected Clustering Index Sets Hierarchical Clustering Silhouette Index 1 K é1 ù Hierarchical clustering is an The Silhouette index measures not S= åê K k =1 ë nk å S (x )ú xÎCk û 1 1 1 agglomerative partitioning algorithm that only the compacness of the b (x ) - a (x ) 0.8 0.8 0.8 identifies compact subsets of the data, in clusters, but also the distance S (x ) = max é a (x ), b (x )ù ë û 0.6 0.6 0.6 between them. The higher the RA S2 RA S2 RA S2 a iterative proceeding. The result of the 0.4 0.4 0.4 index, the more compact and 1 algorithm is a dendrogram, a tree a (x ) = å d ( x, y ) nk - 1 yÎCk , y ¹ x 0.2 0.2 0.2 structure informing all the steps of the separated from each other are the 0 0 0 grouping process. cluster. é1 ù 0 0.5 1 0 0.5 1 0 0.5 1 b (x ) = min ê h =1,K, K ,h ¹ k n å d (x, y )ú RAS1 QI: 0.29 RAS1 QI: 0.43 RAS1 QI: 0.69 ë h yÎCh û Experimental Data Experiment : E-GEOD-15653 Submitter(s) : Patti Lab : Joslin Diabetes Center Mary Elizabeth Patti. (Generated description): Experiment with 18 hybridizations, using 18 samples of species [Homo sapiens], using 18 arrays of array design [Affymetrix GeneChip® Human Genome HG-U133A [HG- U133A]], producing 18 raw data files and 18 transformed and/or normalized data files. (Submitter's description 1): Hepatic lipid accumulation is an important complication of obesity linked to risk for type 2 diabetes. To identify novel transcriptional changes in human liver which could contribute to hepatic lipid accumulation and associated insulin resistance and type 2 diabetes (DM2), we evaluated gene expression and gene set enrichment in surgical liver biopsies from 13 obese (9 with DM2) and 5 control subjects, obtained in the fasting state at the time of elective abdominal surgery for obesity or cholecystectomy. RNA was isolated for cRNA preparation and hybridized to Affymetrix U133A microarrays. Experiment Overall Design: Human liver samples were obtained from 5 lean control subjects undergoing elective cholecystectomy and 13 obese subjects (with or without Type 2 diabetes) undergoing gastric bypass surgery. Subjects with diabetes were classified as either well-controlled or poorly-controlled. Experiments We choose compact and separated clusters of genes by computing the Silhouette Index of Compactness[1,2,3] on every possible subset of the N genes. This approach may take an impractical amount of time, since there are 2N such sets; therefore we propose a sub-optimal search, limiting the computation of the index on the sets provided by the Hierarchical Clustering algorithm, not only on the final stage, but on every intermediate step. If there are N genes, there will be N such groupings, the first one with N clusters (subsets), and the last one with only 1 large cluster, making a total of N(N+1)/2 candidate subsets. Because of the overlapping, there are only 2*N different subsets to be processed, and because of the way the clustering algorithm works, most of them will be compact. The Silhouette index will ensure to select groups that are also separated from the other ones. Software Implementation Microarray Data Program’s interface for gene selection Performance results Results Conclusion In the sample data used for testing purposes the top selected sets showed consistency and many of the The proposed tool may be a powerful tool for the biologists or computational genes of the groups were related by function. Below we can see one of the top sets, with a Silhouette index biology researchers interested on generating new hypothesis on co-expressed of 0.94,which consists of two probes for the same gene (GSTM1 ), and one probe for gene GSTM2, which are genes, which are not provided by more standard analysis tools. both members of the mu class of enzymes, which functions in the detoxification of electrophilic compounds. Biological Process Cellular Component References Probe Set Gene Title Gene Symbol Molecular Function Term Term Term ID 204550_x_at glutathione S- GSTM1 metabolic process glutathione transferase activity cytoplasm [1] Rousseeuw, Peter J., "Silhouettes: A graphical aid to the interpretation and validation of cluster transferase mu 1 transferase activity analysis", Journal of Computational and Applied Mathematics , 20 (1) , pp.53-65 , 1987. glutathione S- [2] Pearson John V., "Identification of the Genetic Basis for Complex Disorders by Use of Pooling- glutathione transferase activity Based Genomewide Single-NucleotideyPolimorphism Association Studies”, The American Journal of 204418_x_at transferase mu 2 GSTM2 metabolic process cytoplasm transferase activity (muscle) Human Genetics, 80, pp. 126-139. 2007. [3] Jianping Hua, David W. Craig, Marcel Brun, Jennifer Webster, Victoria Zismann, Waibhav Tembe, glutathione S- glutathione transferase activity 215333_x_at transferase mu 1 GSTM1 metabolic process transferase activity cytoplasm Keta Joshipura, Matthew J. Huentelman, Edward R. Dougherty, Dietrich A. Stephan: SNiPer-HD: improved genotype calling accuracy by an expectation-maximization algorithm for high-density SNP arrays. Bioinformatics 23(1): pp. 57-63 (2007).