Integrative analysis of transcriptomics and proteomics data with ArrayMining and TopoGSA


Published on

These slides are part of a presentation I gave on March 2010 at the BioInformatics and Genome Research Open Club at the Weizmann Institute of Science, Israel.

In these slides my student and I describe two web-applications for microarray and gene/protein set analysis, and TopoGSA. These use ensemble and consensus methods as well as the
possibility of modular combinations of different analysis techniques for an integrative view of
(microarray-based) gene sets, interlinking transcriptomics with proteomics data sources. This integrative process uses tools from different fields, e.g. statistics, optimisation and network
topological studies. As an example for these integrative techniques, we use a microarray
consensus-clustering approach based on Simulated Annealing, which is part of the
Class Discovery Analysis module, and show how this approach can be combined in a modular
fashion with a prior gene set analysis. The results reveal that improved cluster validity indices can be obtained by merging the two methods, and provide pointers to distinct sub-classes within pre-defined tumour categories for a breast cancer dataset by the Nottingham Queens Medical Centre.
In the second part of the talk, I show how results from a supervised
microarray feature selection analysis on can be investigated in further detail with
TopoGSA, a new web-tool for network topological analysis of gene/protein sets mapped on a
comprehensive human protein-protein interaction network. I discuss results from a TopoGSA
analysis of the complete set of genes currently known to be mutated in cancer.

Published in: Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Now we combine the Class Discovery analysis with the Gene Set Analysis module, discussed on the next slides.
  • Integrative analysis of transcriptomics and proteomics data with ArrayMining and TopoGSA

    1. 1. Integrative analysis of transcriptomics and proteomics data (ArrayMining and TopoGSA) Integrative analysis of transcriptomics and proteomics data: implications to cancer biology ASAP – Interdisciplinary Optimisation Laboratory School of Computer Science Centre for Integrative Plant Biology Centre for Healthcare Associated Infections Institute of Infection, Immunity and Inflammation University of Nottingham Enrico Glaab & Natalio Krasnogor
    2. 2. <ul><li>Overview: </li></ul><ul><li>Introduction : goals and data sets </li></ul><ul><li> : tool set for microarray analysis </li></ul><ul><ul><li>@ </li></ul></ul><ul><li>TopoGSA : network topological analysis of </li></ul><ul><li>genes/proteins </li></ul><ul><ul><li>@ </li></ul></ul><ul><li>(time permitting) Network-based pathway </li></ul><ul><li>extension </li></ul>Outline Gibson G (2003) Microarray Analysis. PLoS Biol 1(1): e15. doi:10.1371/journal.pbio.0000015
    3. 3. Introduction <ul><li>Typical problem in biosciences : How to make effective use of multiple, large-scale data sources? </li></ul><ul><li>Typical problem in computer science : How to exploit the strengths of different algorithms? </li></ul><ul><li> GOAL : Develop new (& existing) methods combining diverse data sources and algorithms </li></ul>
    4. 4. Reference data set Armstrong et al. Leukemia data set <ul><li>Platform: Affymetrix UV95A oligonucleotide array </li></ul><ul><li>Normalisation: Variance Stabilizing Normalisation (Huber et al., 2002) </li></ul><ul><li>72 samples and 12,626 genes </li></ul><ul><li>3 leukemia sub-types: ALL (24), AML (28), MLL (20) </li></ul><ul><li>Thresholding/Filtering steps: see Armstrong et al. (2001, Nat. Genet.) </li></ul><ul><li>Public access to data set: </li></ul>samples Heat map: 30 most differentially expressed genes vs. samples genes
    5. 5. Main data set QMC breast cancer microarray data set <ul><li>Platform: Illumina Sentrix Human-6 BeadChips </li></ul><ul><li>Pre-normalized data (log-scale, min: 4.9, max: 13.3) </li></ul><ul><li>128 samples and 47,293 genes </li></ul><ul><li>3 tumour grades: 1 (33), 2 (52), 3 (43) </li></ul><ul><li>Probe level data analysis: Bioconductor beadarray package </li></ul><ul><li>Public access to data set: accession number: E-TABM-576 </li></ul>grade1 grade 3 Heat map: 30 most differentially expressed genes vs. samples (grade 1 and grade 3) genes
    6. 6. Breast cancer data - difficulties Breast cancer outcome is hard to predict: Large degree of class-overlap in Breast cancer microarray data, whereas Leukemia decision boundaries are easy to find (Blazadonakis, 2009). Van‘t Veer et al. Alon et al. Golub et al.
    7. 7. Data Fusion Other biological data sources used: <ul><li>unweighted binary interactions (MIPS, DIP, BIND, HPRD, IntAct - only human)  9392 nodes, 38857 edges </li></ul> mutated genes in different human cancer types (Breast, Liver,...)  30 gene sets of size > 10 genes  obtained from GO, BioCarta, Reactome, KEGG and InterPro  total: approx. 3000 pathways (size > 10) <ul><li>additional public data sets: Huang et al., Veer et al. </li></ul><ul><li>pre-processing: GC-RMA </li></ul>Breast cancer microarray data : Protein interaction data : Cellular pathway data : Cancer gene sets :
    8. 8. Methods overview Methods overview: ArrayMining & TopoGSA
    9. 9. Web-tool: What is is an online microarray analysis tool set integrating multiple data sources and algorithms. 6 analysis modules: 1. Gene selection 2. Sample clustering 3. Sample classification 4. Gene Set Analysis 5. Gene Network Analysis 6. Cross-Study Normalization Goal : A “swiss knife“ for microarray analysis tasks classical new
    10. 10. Methods overview Methods overview: ArrayMining & TopoGSA
    11. 11. Gene selection <ul><li>Gene selection module </li></ul><ul><li>Applies supervised feature selection algorithms (CFS, eBayes, SAM, etc.) </li></ul><ul><li>Compares multiple algorithms or combines them into an ensemble </li></ul><ul><li>Example: ENSEMBLE feature selection for Armstrong et al. (2001) dataset : </li></ul> previously identified by Armstrong et al.  newly identified Affymetrix ID Gene symbol Gene descriptions – source: F-statistic 32847_at  MYLK myosin, light polypeptide kinase 159.59 1389_at  MME membrane metallo-endopeptidase (neutral endopeptidase, enkephalinase) 137.53 35164_at  WFS1 wolfram syndrome 1 (wolframin) 128 36239_at  POU2AF1 pou domain, class 2, associating factor 1 116.75 1325_at  SMAD1 smad, mothers against dpp homolog 1 (drosophila) 110.37 963_at  LIG4 ligase iv, dna, atp-dependent 89.77 34168_at  DNTT deoxynucleotidyltransferase, terminal 89.31 40570_at  FOXO1 forkhead box o1a (rhabdomyosarcoma) 86.89 33412_at  LGALS1 lectin, galactoside-binding, soluble, 1 (galectin 1) 81.31
    12. 12. Gene selection <ul><li>Gene selection module (2): Armstrong et al. dataset </li></ul><ul><li>Automatic generation of box plots with gene and sample class annotations </li></ul><ul><li>The first row shows the box plots for the two best-ranked newly identified genes in the Armstrong et al. dataset (  ) </li></ul><ul><li>The second row shows two top-ranked previously iden- tified genes (  ) </li></ul><ul><li>The user can easily compare and combine the results from different selection methods </li></ul>   
    13. 13. Examples Further examples: Gene selection and Clustering module Automatic generation of heatmaps and PCA Cluster plots (Armstrong et al. dataset) samples genes
    14. 14. Examples Further examples: 3D-ICA and Co-Expression analysis 3D Independent Component Analysis plot (left) and the largest connected components from a gene co-expression network (right) for the Armstrong et al. dataset Sample space: Gene space: ALL AML MLL
    15. 15. In-house data Heat map: 50 most significant genes Box plot: 4 most significant genes Apply the tools on new data: QMC Breast cancer data Expression levels across 3 tumour grades: STK6 MYBL2 KIF2C AURKb
    16. 16. QMC dataset <ul><li>QMC Breast cancer data set – selected genes </li></ul><ul><li>all top-ranked genes are known or likely to be involved in breast cancer </li></ul><ul><li>the selection is robust with regard to cross-validation cycles and algorithms </li></ul>Gene name PC (gene vs. outcome): Fold Change Q-value (Rank) ESTROGEN RECEPTOR 1 -0.75 0.16 1.6e-20 (1.) RAS-LIKE, ESTROGEN-REGULATED, GROWTH INHIBITOR -0.66 0.46 5.3e-14 (2.) WD REPEAT DOMAIN 19 -0.66 0.73 1.2e-13 (3.) CARBONIC ANHYDRASE XII -0.65 0.28 2.7e-13 (4.) ARP3 ACTIN-RELATED PROTEIN 3 HOMOLOG (YEAST) 0.64 1.37 9.6e-13 (5.) TETRATRICOPEPTIDE REPEAT DOMAIN 8 -0.63 0.82 2.2e-12 (6.) BREAST CANCER MEMBRANE PROTEIN 11 -0.62 0.24 7.1e-12 (7.)
    17. 17. Methods overview Methods overview: ArrayMining & TopoGSA
    18. 18. Example <ul><li>Motiviation : Exploiting the synergies between partition-based and hierarchical clustering algorithms </li></ul><ul><li>Approach: </li></ul><ul><li>Consensus clustering based on the agreement of clustering results for pairs of objects (details on next slide). - equivalent to median partition problem (NP-complete) - Simulated Annealing (SA) has been shown to provide good solutions </li></ul><ul><li>Our solution : - Compare SA (Aarts et al. cooling scheme) with thermodynamic SA (TSA) and fast SA (FSA)  FSA provides fastest convergence - Initialization: Input clustering with highest agreement to other inputs </li></ul>ArrayMining - Class Discovery Analysis module:
    19. 19. Consensus clustering ArrayMining‘s consensus clustering approach: Clustering Agreement := No. of times pairs of samples are assigned to the same cluster across all input clusterings Idea: Reward objects in the same cluster, if they have a high agreement. Agreement matrix: A ij := # agreements across all clusterings for samples i and j Fitness function:  := (max(A)+min(A))/2 Sample 1 Sample 2
    20. 20. <ul><li>FSA (Fast SA; Szu, Hartley; 1987) Uses Cauchy-distributed random numbers and a sligthly modified cooling scheme: </li></ul><ul><li>ASA (Adaptive SA; Ingber; 1993) Temperature-dependent pseudo-random numbers, quenching, „Re-Annealing“ </li></ul><ul><li>TSA (Thermodynamic SA; Vicente et al.; 2003) Automatically adjusts temperature based on laws of thermodynamics </li></ul>Simulated Annealing - variants Cauchy vs. Gaussian distribution
    21. 21. Clustering methods and validity indices <ul><li>Sample clustering methods: 8 different methods considered: - partition-based: k-Means, PAM, CLARA, SOM, SOTA - hierarchical: AL-HCL, DIANA, AGNES </li></ul><ul><li>Scoring & number of clusters selection: 5 validity indices / splitting rules used: </li></ul><ul><li>- Silhouette width, Calinski-Harabasz, Dunn, C-index, knn-Connectivity. - good validity indices should have: no multivariate normality assumptions (Gower, 1981), small or no bias (Milligan & Cooper, 1985) </li></ul><ul><li>Standardization : classical (mean 0, stddev. 1) or median absolute deviation </li></ul>Example: Silhouette width a(i) = avg. distance of obj(i) to all others in the same cluster b(i) = avg. distance of obj(i) to all others in closest distinct cluster
    22. 22. Consensus clustering: example <ul><li>Separate sub-classes in 84 luminal samples with consensus clustering </li></ul><ul><li>Input algorithms: k-Means, SOM, SOTA, PAM, HCL, DIANA, HYBRID-HCL </li></ul>Example application: QMC breast cancer dataset low confidence (silhouette widths) best separation for two clusters
    23. 23. External validation Random model Single clustering Consensus Measure similarity of clusterings with the rand index R : a, b, c and d are the #pairs of objects assigned to: - the same cluster in both clusterings (a) - different clusters in both clusterings (b) - the same cluster in clustering 1/2 and different clusters in clustering 2/1 (c/d) - Corrected for chance:  adjusted rand index Reference clustering: 3 tumour grades (low, medium, high) Clustering results – external validation (tumour grades) 10000 random clusterings
    24. 24. Methods overview Methods overview: ArrayMining & TopoGSA
    25. 25. Gene set analysis samples pathways <ul><li>Extension: Gene set analysis </li></ul><ul><li>Expression levels for a single gene are often unreliable </li></ul><ul><li>Similar genes might contain complementary information </li></ul><ul><li>We want to integrate functional annotation data </li></ul><ul><li>Gene Set Analysis (GSA) : </li></ul><ul><li>1) Identify sets of functionally similar genes (GO, KEGG, etc.) 2) Summarize gene sets to „Meta“- genes (PCA, MDS, etc.) </li></ul><ul><li>3) Apply statistical analysis </li></ul>(example: Van Andel institute cancer gene sets) Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Subramanian et al. PNAS October 25, 2005 vol. 102 no. 43 15545–15550
    26. 26. Examples Gene Set Analysis module – example analysis Heat map for the Armstrong et al. dataset based on pathway meta-genes <ul><li>we apply the Gene Set Analysis module to the Armstrong et al. dataset </li></ul><ul><li>with known cancer gene sets the class separation is better than for single genes </li></ul>
    27. 27. Consensus clustering: example (2) <ul><li>Map genes onto Gene Ontology (GO), reduce dimensionality (MDS) </li></ul><ul><li>Apply same consensus clustering as before on GO-based „meta-genes“ </li></ul>Combine consensus clustering with gene set analysis ~3 times higher confidence better separation
    28. 28. External validation Single clustering Consensus clustering Consensus (PAM+SOTA) 10000 random clusterings
    29. 29. Interim Summary <ul><li>Consensus clustering (CC) results tend to be similar to or slightly better than the best single clusterings in terms of adj. rand index and validity indices (but longer runtime) </li></ul><ul><li>The input clusterings should include diverse methods and exclude similar methods </li></ul><ul><li>Using gene sets (GS) representing cellular pathways instead of single genes results in better cluster separation, adj. rand indices and validity indices (annotation data required) </li></ul><ul><li>  GS & CC provide improved results, but: longer runtimes + annotation data required </li></ul>ArrayMining Integrative Clustering - Summary
    30. 30. Methods overview Methods overview: ArrayMining & TopoGSA
    31. 31. TopoGSA TopoGSA : Network topological analysis of gene sets What is TopoGSA ? TopoGSA is a web-application mapping gene sets onto a comprehensive human protein interaction network and analysing their network topological properties. Two types of analysis: 1. Compare genes within a gene set: e.g. up- vs. down-regulated genes 2. Compare a gene set against a database of known gene sets (e.g. KEGG, BioCarta, GO) TopoGSA
    32. 32. TopoGSA - Methods <ul><li>the degree of each node in the gene set </li></ul><ul><li>the local clustering coefficient C i for each node v i in the gene set: where ki is the degree of v i and e jk is the edge between v j and v k </li></ul><ul><li>the shortest path length between pairs of nodes v i and v j in the gene set </li></ul><ul><li>the node betweenness B(v) for each node v in the gene set: </li></ul><ul><li>here σ st (v) is the number of shortest paths from s to t passing through v </li></ul><ul><li>the eigenvector centrality for each node in the gene set </li></ul>TopoGSA computes the following topological properties for an uploaded geneset and matched-size random gene sets:
    33. 33. KEGG-BRITE pathway colouring <ul><li>LEGEND: </li></ul><ul><li>Cellular processes </li></ul><ul><li>Environmental information processing </li></ul><ul><li>Genetic information processing </li></ul><ul><li>Human diseases </li></ul><ul><li>Metabolism </li></ul><ul><li>Cancer genes </li></ul><ul><li>General results: </li></ul><ul><li>Metabolic pathways have high shortest path lenghts and low bet- weenness </li></ul><ul><li>Disease pathways and cancer gene sets tend to have high betweenness and small shortest path lenghts </li></ul>Mean node betweenness Mean clustering coefficient Mean shortest path length
    34. 34. ArrayMining  TopoGSA <ul><li>Send selected genes from ArrayMining to TopoGSA : </li></ul><ul><li>Results of within-gene-set comparison : Estrogen receptor 1 gene and apoptosis regulator Bcl2 , both up-regulated in luminal samples, have outstanding network topological properties (higher betweenness, higher degree, higher centrality) in comparison to other genes. </li></ul><ul><li>Results of comparison against reference databases : - Metabolic KEGG pathways are most similar to the uploaded gene set in terms of network topological properties. </li></ul><ul><li>- Most similar BioCarta pathways: Cytokine, Differentiation and inflammatory pathways. </li></ul>
    35. 35. Real-world application of tools sets <ul><li>ArrayMining identifies RERG as a tumour marker </li></ul><ul><li>RERG (Ras-related and oestrogen-regulated growth-inhibitor) was identified as a new candidate marker of ER-positive luminal-like breast cancer subtype </li></ul><ul><li>Validation using immunohistochemistry on Tissue Microarrays containing 1,140 invasive breast cancers confirmed RERG‘s utility as a marker gene </li></ul>TMAs of invasive breast cancer show strong RERG expression
    36. 36. RERG Protein Expression VS BCSS & DMFI Kaplan Meier plot of RERG protein expression with respect to BCSS in ER + U ER - cohort Kaplan Meier plot of RERG protein expression with respect to BCSS in ER + only Without adjuvant treatment Without Tamoxifen treatment
    37. 37. Conclusions(I): Feature comparison with similar tools ArrayMining & TopoGSA GEPAS (Tarraga et al.) Expression Profiler (Kapushesky et al.) Pre-processing : Image analysis, single- and dimensionality reduction, gene name normalization, cross-study normalization , covariance-based filtering Pre-processing : Image analysis, missing value imputation, multiple single study normalization methods, dimensionality reduction, ID converter Pre-processing : Image analysis, single study normalization, missing value imputation, dimensionality reduction, advanced data selection Analysis : Classification, Clustering, Gene selection, GSEA, PCA, ICA, Co-expression analysis, PPI-topology analysis, Ensembles/Cons. Analysis : Classification, Clustering, Gene selection, GSEA, PCA, CGH arrays, Tissue mining,Text mining, TF-binding site prediction Analysis : Clustering, Gene selection, PCA, Co-expression analysis (different from ArrayMining), COA, Similarity search Usability/features : PDF-reports, sortable ranking tables, data anno-tation, 2D/ 3D plots , e-mail notification, video tutorials Usability/features : special tree visualization (Caat, SotaTree, Newick Trees), 2D plots, data annotation (Babelomics), Usability/features : Excel export, XML queries, 2D plots, data annotation (GO, chromosome location)
    38. 38. Conclusions (2) <ul><li>Combining algorithms in a sequential and/or parallel fashion can provide performance improvements and new biological insights </li></ul><ul><li>Microarray and gene set analysis tasks can be interlinked flexibly in an (almost) completely automated process </li></ul><ul><li>New analysis types like network-based topology analysis and co-expression analysis complement existing tools </li></ul><ul><li>In the case of BC it allowed us to identify candidate genes to characterise ER+ luminal-like BC. </li></ul><ul><ul><li>RERG gene is a key marker of the luminal BC class </li></ul></ul><ul><ul><li>It can be used to separate distinct prognostic subgroups </li></ul></ul><ul><li>Accessible through </li></ul>
    39. 39. Outlook : PPI-based pathway-enlargement <ul><li>Idea : </li></ul><ul><li>Enlarge pathways by adding genes that are „strongly connected“ to the </li></ul><ul><li>pathway-nodes or increase the pathway-“compactness“ </li></ul><ul><li>Pathway extension criteria : </li></ul><ul><li>degree(v) > 1; and </li></ul><ul><li>#pathway-links(v,p) / #outside-links(v,p) > T 1 ; or </li></ul><ul><li>#triangle-links(v,p) / #possible_triangles(v,p) > T 2 ; or </li></ul><ul><li>#pathway-links(v,p) / #pathway-nodes(p) > T 3 </li></ul>black = pathway-nodes; red blue green = nodes added based on different criteria ... ... ...
    40. 40. Pathway enlargment – added genes Example case: BioCarta BTG family proteins and cell cycle regulation Black: Original pathway nodes – Green : Nodes added based on connectivity Added cancer gene
    41. 41. Pathway enlargment – Example 1 <ul><li>More than 20 proteins </li></ul><ul><li>annotated in our </li></ul><ul><li>PPIN </li></ul><ul><li>5 added proteins by the extension process </li></ul><ul><li>3 known disease </li></ul><ul><li>associated </li></ul><ul><li>2 candidates: METTL2B, TMED10 </li></ul>Example: Alzheimer disease pathway
    42. 42. Pathway enlargment – Example 2 <ul><li>Complex signaling </li></ul><ul><li>system, sharing </li></ul><ul><li>intracellular </li></ul><ul><li>Cascades </li></ul><ul><li>New regulators </li></ul><ul><li>New crosstalk proteins </li></ul>Example: Interleukin signaling pathways
    43. 43. Pathway enlargment - conclusion <ul><li>The method integrates two sources of information, extending canonical pathways using large-scale protein interaction data </li></ul><ul><li>Identifies new regulators, new candidates for disease pathways </li></ul><ul><li>Future: investigate extended pathways as input for enrichment/classification methods </li></ul><ul><li>This work is based on the following papers : </li></ul><ul><ul><li>Arraymining: a modular web-application for microarray analysis combining ensemble and consensus methods with cross-study normalization. E. Glaab, J. Garibaldi, and N. Krasnogor. BMC Bioinformatics , 10(1):358, 2009. </li></ul></ul><ul><ul><li>TopoGSA: network topological gene set analysis. E. Glaab, A. Baudot, N. Krasnogor and A. Valencia. Bioinformatics . </li></ul></ul><ul><ul><li>RERG (Ras—related and oestrogen-regulated growth-inhibitor) expression in breast cancer as a marker of ER-positive luminal-like subtype. H.O. Habashy, D.G. Powe, E. Glaab, N. Krasnogor, J.M. Garibaldi, E.A. Rakha, G. Ball, A.R. Green and I.O. Ellis (to be submitted) </li></ul></ul><ul><ul><li>Extending biological pathway definitions using molecular interaction networks. E. Glaab, A. Baudot, N.Krasnogor and A. Valencia (to be submitted) </li></ul></ul>
    44. 44. Acknowledgements <ul><li>QMC: Hany Onsy Habashy, Desmond G Powe, Emad A Rakha, Graham Ball, Andrew R Green, Ian O Ellis. </li></ul><ul><li>CS: Jon M. Garibaldi </li></ul><ul><li>CNIO: A. Valencia, A. Baudot </li></ul><ul><li>BBSRC for grants BB/F01855X/1, BB/D0196131 </li></ul><ul><li>EPSRC for grant EP/E017215/1 </li></ul><ul><li>The EC for grant Marie-Curie Early-Stage-Training programme (grant MEST-CT-2004-007597) </li></ul>