Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

CDAC 2018 Merico making sense of cancer somatic snv

22 views

Published on

Presentation at the CDAC 2018 Workshop and School on Cancer Development and Complexity
http://cdac2018.lakecomoschool.org

Published in: Science
  • Be the first to comment

  • Be the first to like this

CDAC 2018 Merico making sense of cancer somatic snv

  1. 1. Making sense of cancer somatic SNVs and indels: from variant effects to pathways Thu 24 May Daniele Merico, PhD Director of Molecular Genetics, Deep Genomics Inc. Visiting Scientist, The Hospital for Sick Children (Toronto, Canada)
  2. 2. Outline 1. Functional interpretation of somatic variants: overview [5 min] 2. From variants to genes [10 min] 1. Variant gene product effect 2. Missense impact prediction and beyond 3. Genes with significant somatic burden 3. From genes to functions, pathways & networks [40 min] 1. Gene-set analysis [30 min] 1. Overview 2. Gene-set types, Gene Ontology & pathway resources 3. Gene-set results visualization: Cytoscape Enrichment Map 4. Types of gene-set analysis tests 5. Competitive tests: GSEA for gene expression data 6. Self-contained tests: gene-set somatic burden 7. General tips 2. Network analysis [10 min] 1. Network visualization and gene network types 2. GeneMANIA 3. Reactome FI 4. Q&A [5 min]
  3. 3. 1. Functional Interpretation of Somatic Variants: Overview
  4. 4. Criteria to Interpret Somatic Variants • What’s the effect on the gene product? • Stop-gain, frameshift, splice site alteration, missense, splicing consensus sequence, synonymous, 5’UTR, 3’UTR, intronic, upstream, downstream, ncRNA exon, ncRNA intron • Truncating loss-of-function, missense (loss-of-function or gain-of-function?) • Is a missense variant recurrent, or overlapping a known mutation hotspot? • Is a missense variant predicted damaging by impact predictors? • Is the gene an established oncogene or tumour suppressor? • Is the gene significantly mutated à could act as a novel oncogene or tumour suppressor? • Otherwise, is the gene under negative selection for truncating loss-of-function or missense variants? • Does the gene belong to a pathway or subnetwork with other cancer driver genes or enriched in somatic mutation à could act as a novel oncogene or tumour suppressor?
  5. 5. Cancer somatic mutation data Established cancer genes Novel cancer genes Tumour suppressor Oncogene Significant burden (or genetic constraint) Truncating LOF Missense Truncating LOF Missense LOF? Gene-set and network analysis Missense GOF? y Established hostpotRecurrent Impact prediction y h h h
  6. 6. 2. From Variants to Genes
  7. 7. 2.1. Variant gene product effect
  8. 8. SNV and Indel Variant Annotations • Variant database mapping • Germline allele frequencies, dbSNP • COSMIC (somatic variant database) • Gene mapping • Gene product effect type • Stop-gain, frameshift, splice site alteration, missense, splicing consensus sequence, synonymous, 5’UTR, 3’UTR, intronic, upstream, downstream, ncRNA exon, ncRNA intron • Stop-gain, frameshift, splice site alteration à expected to cause complete loss-of-function (LOF) • Missense, other à can act as gain-of-function • Missense impact prediction • SIFT, PolyPhen2, MutationAssessor, … • Other impact predictions • Splicing (e.g. MaxEntScan, dbscSNV, SPIDEX, …) • Genomic conservation (e.g. phyloP, PhastCons, …) • Omnibus meta-predictors (CADD, Eigen, …)
  9. 9. 2.2. Missense impact prediction and beyond
  10. 10. SIFT • Broadly used, relatively old (2001) • Based uniquely on protein sequence (amino acid) conservation 1. Start from query protein sequence 2. Identify similar protein sequences (PSI-BLAST) 3. Multiple alignment of protein sequences (orthologs and paralogs) 4. Amino acid x residue probability matrix (PSSM) 5. For every residue, amino acid probability reweighted by amino acid diversity at the position (sum of frequency rank * frequency) à Score: probability of observing amino acid normalized by residue conservation cut-off: 0.05 (based on case studies) Predicting deleterious amino acid substitutions. Ng PC, Henikoff S. Genome Res. 2001 May;11(5):863-74.
  11. 11. PolyPhen2 • Integrates multiple features • 8 sequence-based, 3 structure-based (nucleotide and amino acid level) (e.g. side chain volume change, overlap with PFAM domain, multiple alignment metrics) • Supervised machine learning method (Naïve Bayes) à Requires training set • Set 1: HumDiv • Positive: damaging alleles for known Mendelian disorders (Uniprot) • Negative: nondamaging differences between human proteins and related mammalian homologs • Performance 5-fold crossv: (TP ~ 80%, FP ~10%), (TP ~ 90%, FP ~ 20%) • Set 2: HumVar • Positive: all human disease causing mutations (Uniprot) • Negative: non-synonymous SNPs without disease association àRicher model than SIFT àMore biased towards training set(s) than SIFT A method and server for predicting damaging missense mutations. Adzhubei IA, Schmidt S, Peshkin L, […], Bork P, Kondrashov AS, Sunyaev SR. Nat Methods. 2010 Apr;7(4):248-9.
  12. 12. CADD • Intended as a measure of “deleteriousness” for coding and non-coding sequence, not biased to known disease variation • However non particularly effective for non-coding regulatory sequence (see lecture) • Supervised machine learning model (Linear SVM) • Negative training set: nearly fixed human alleles, variant if compared to inferred human- chimp ancestral genome • Positive training set: simulated variants based on mutation model aware of sequence context and primate substitution rates • Predictive features (63): VEP (Variant Effect Predictor) output, UCSC tracks, Encode tracks à includes missense predictions and nucleotide-level conservation • Performance assessment: using pathogenic variants from ClinVar performs a bit better PhyloP for all sites and PolyPhen/SIFT for missense coding A general framework for estimating the relative pathogenicity of human genetic variants. Kircher M, Witten DM, Jain P, O'Roak BJ, Cooper GM, Shendure J. Nat Genet. 2014 Mar;46(3):310-5.
  13. 13. Example of Mutation Hotspots L858R G12D, V, C, A, S, R G13D EGFR KRAS
  14. 14. 2.3. Genes with significant somatic burden
  15. 15. MutSigCV • Goal: identify significantly mutated genes à Important to model mutational background model • Tumour-specific global mutation rate • Trinucleotide context and substitution • Expression level (impacting transcription-couple repair) • Replication timing (later-replicating regions have higher tumour rates) • Residual local genomic region mutation rate Lawrence MS, ..., Getz G. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 2013. PMID: 23770567
  16. 16. 3. From genes to functions, pathways & networks
  17. 17. Activity Maps Spindle Apoptosis Gene.A Gene.B Gene.C Gene.D Gene.E Gene.F GENE SETS NETWORKS PATHWAYS Ca++ Channels MAPK Gene.G Gene.H Gene.I Gene.L Gene.M Gene.N Activity Profiles / Somatic Mutations Prior Knowledge about genes Spindle Apoptosis Gene.A Gene.B Gene.C Gene.D Gene.E Gene.F GENE SETS NETWORKS PATHWAYS Ca++ Channels MAPK Gene.G Gene.H Gene.I Gene.L Gene.M Gene.N Scoring models Search algorithms Informatics
  18. 18. 3.1. Gene-set Analysis
  19. 19. 3.1.1. Gene-set analysis overview
  20. 20. Set p-value Spindle 0.00001 Apoptosis 0.00025 Experiment Gene-set Databases ENRICHMENT TEST Enrichment TableExperimentally “positive” genes (e.g UP-regulated) Experimentally “detectable” genes (aka background set) Gene-set Analysis Overview
  21. 21. Gene-sets for Gene-set Analysis Nuclear Pore Cell Cycle Gene.AAA Gene.ABA Gene.ABC Gene.CC1 Gene.CC2 Gene.CC3 Gene.CC4 Gene.CC5 Ribosome P53 signaling Gene.RP1 Gene.RP2 Gene.RP3 Gene.RP4 Gene.CC1 Gene.CK1 Gene.PPP From cell biology to gene-sets Gene-set Databases
  22. 22. Gene-set Analysis: Overview Spindle 0.00001 Apoptosis 0.00025 Enrichment Table FADD TRADD CYTC1 BAX BAXL CASP9 CASP10 …. SPP1 SPP2 CCCP MTC1 … Gene-sets Experimental data (e.g. gene expression table)
  23. 23. Gene-set Enrichment Test The P-value assesses the probability that, by random sampling the “detectable” genes, the overlap is at least as large as observed. Random samples of array genes The output of an enrichment test is a P-value Most used statistical model: Fisher’s Exact Test Fisher’s Exact Test does not require to actually perform the random sampling, it is based on a theoretical null-hypothesis distribution (Hypergeometric Distribution)
  24. 24. Fisher’s Exact Test (FET) b a d c Exp_positive=yes Exp_positive=no Gene-Set=yes a b Gene-Set=no c d Fisher’s Exact Test: 2 x 2 Contingency Table Probability of one table to occur by random sampling: Hypergeometric distribution formula: Test p-value: sum of random sampling probabilities for tables as extreme or more extreme than the real table
  25. 25. The Background is Important! b a d c • Inappropriate modeling of the background will lead to incorrectly biased results – What genes are detectable by the experiment? E.g.: in a kinase phosphorylation assay, only kinases can be detected – The Fisher’s Exact Test, GSEA and other tests assume all genes have the same “prior” probability of being experimentally positive à they can be used only in absence of systematic selection biases (example of bias: if you select genes with at least one mutation, then longer genes are systematically more likely to be selected)
  26. 26. Gene-set Enrichment Analysis: Multiple Test Correction by BH-FDR • FDR (false discovery rate) is the expected proportion of tests passing the significance threshold due to random sampling • Benjamini-Hochberg (BH) FDR: for a given FDR q-value threshold alpha (e.g. 25%), for m total tests (e.g. 1,000 gene-sets), find the largest k number of tests, so that: P-value (k) <= k / m * alpha so alpha >= P-value (k) * m / k (e.g. 0.0125 * 1,000 / 50 <= 0.25)
  27. 27. Gene-set Enrichment Analysis: Multiple Test Correction by BH-FDR P-valueCategory P-value * m / kRank FDR q-value 1 2 3 4 5 … 52 53 Transcriptional regulation Transcription factor Initiation of transcription Nuclear localization Chromatin modification … Cytoplasmic localization Translation 0.001 x 53/1 = 0.053 0.002 x 53/2 = 0.053 0.003 x 53/3 = 0.053 0.0031 x 53/4 = 0.040 0.005 x 53/5 = 0.053 … 0.985 x 53/52 = 1.004 0.99 x 53/53 = 0.99 In other words: (1) walk the list of tests from most significant, (2) estimate how many tests would pass at each p-value if they were random draws, (3) compute fraction of false positives, transform to monotonic 1 <= q-value <= 0 0.040 0.040 0.040 0.040 0.053 … 0.99 0.99 P-value threshold for FDR < 0.05 0.001 0.002 0.003 0.0031 0.005 … 0.97 0.99 Red: non-significant Green: significant at FDR < 0.05
  28. 28. 3.1.2. Gene-set types, Gene Ontology & pathway resources
  29. 29. Gene-set Types • Functions (e.g. Gene Ontology) • Pathways (e.g. KEGG, Reactome) • Genotype-phenotype/disease association (e.g. HPO) • Protein Families / Domains (e.g. PFAM) • Genomic position (e.g. cytobands) • Gene expression signatures (e.g. MSigDB Cancer Hallmarks) • Up/down after treatment or in relation to disease • Targets of regulators • Transcription factor targets • miRNA targets • Network-derived modules, e.g. protein-protein interactions • Drug targets
  30. 30. Gene Ontology (GO) / 1 • Effort to standardize functional description of eukaryotic gene products • Launched in 1998 • Many organism species supported • Normal function (e.g. cell cycle), not disorder / disease (e.g. metastasis formation) • Ontology defined by core team of curators who receive input from domain experts • Corpus of gene annotations based on expert curation of the literature (> 140,000 published papers in 2018), review of high-throughput data, or annotations in existing databases; performed by curators at specific organism genome databases (human: UniProtKB)
  31. 31. Gene Ontology (GO) / 2 • Ontology, intended as controlled structured vocabulary • Terms = functional concepts (e.g. cell cycle, proteasome) • Three main ontologies: molecular function (i.e. biochemical activity), cellular component, biological process (pathways and other processes) • Relations between terms: is-a, part-of / has-part, regulates, occurs-in à DAG (directed acyclic graph), supports logical inference • Most of the relations are within each main ontology, ongoing effort to link processes and molecular functions to components using occurs-in
  32. 32. Biological Process: DNA repair Cellular Component: Replication fork Cellular Component: Single-strand break containing DNA binding
  33. 33. CHILD PARENT ABB1 ACAP3 TRAC1 LUC2 POF5 ZUMM C5A75 DUCZ
  34. 34. Pathways • Depict mechanistic details of metabolic, signaling and other biological processes • Can be computationally exported as complex graph, but often just analyzed as gene-sets • Advantages: • Curated, accurate, cause and effect captured • Human-interpretable visualizations • Disadvantages: • More sparse coverage of genome than functional sets • More complex models are required to score pathways • Static model of dynamic systems • Main resources: KEGG, Reactome
  35. 35. KEGG Cell cycle
  36. 36. Reactome Cell cycle, G1/S transition
  37. 37. Resources to Download Gene-sets BaderLab (University of Toronto) http://baderlab.org/GeneSets • Gene Ontology; Reactome, Panther, NetPath, NCI, MSigDB C2 (Biocarta, ...), HumanCyc pathways; MSigDB cancer hallmarks; MSigDB C3 (miRNA and TF targets) • updated on a monthly basis MSigDB (Broad Institute) https://software.broadinstitute.org/gsea/msigdb/ • Gene Ontology; KEGG, Reactome, Biocarta, other pathways; cancer hallmarks; expression signatures; miRNA and TF targets; interaction modules; Cytobands (positional) • last update Oct 2017, several gene-set collections are derived from old research works (2004-2005) Bioconductor org.Hs.eg.db http://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html • Gene Ontology; KEGG pathways; PFAM (protein domains); Cytobands (positional) • updated every 4 months Notes: • KEGG stopped being freely available on 2011, so freely-available resources have largely outdated gene-sets • Carefully check how GO annotations are exported (e.g. all evidence codes, or excluding IEA)
  38. 38. 3.1.3. Gene-set results visualization: Cytoscape Enrichment Map
  39. 39. GO.id GO.name p.value covercover.rat Deg.mdn Deg.iqr GO:0042330 taxis 2.18E-06 23 0.056930693 54.94499375 9.139238998 GO:0006935 chemotaxis 2.18E-06 23 0.060209424 54.94499375 9.139238998 GO:0002460 adaptive immune response based on somatic recombination 7.10E-05 25 0.111111111 57.32306955 16.97054864 GO:0002250 adaptive immune response 7.10E-05 25 0.111111111 57.32306955 16.97054864 GO:0002443 leukocyte mediated immunity 0.000419328 23 0.097046414 58.27890582 15.58333739 GO:0019724 B cell mediated immunity 0.000683758 20 0.114285714 57.84161096 15.03496347 GO:0030099 myeloid cell differentiation 0.000691589 24 0.089219331 62.22171598 10.35284833 GO:0002252 immune effector process 0.000775626 31 0.090116279 58.27890582 23.86214773 GO:0050764 regulation of phagocytosis 0.000792138 8 0.2 53.54786293 5.742849971 GO:0050766 positive regulation of phagocytosis 0.000792138 8 0.216216216 53.54786293 5.742849971 GO:0002449 lymphocyte mediated immunity 0.00087216 22 0.101851852 57.84161096 16.13171132 GO:0019838 growth factor binding 0.000913285 15 0.068181818 83.0405088 10.58734852 GO:0051258 protein polymerization 0.00108876 17 0.080952381 57.97543252 17.31639968 GO:0005789 endoplasmic reticulum membrane 0.001178198 18 0.036072144 64.02284752 12.05209158 GO:0016064 immunoglobulin mediated immune response 0.001444464 19 0.113095238 58.27890582 15.58333739 GO:0007507 heart development 0.001991562 26 0.052313883 84.02538284 18.60761304 GO:0009617 response to bacterium 0.002552999 10 0.027173913 52.75249873 23.23104637 GO:0030100 regulation of endocytosis 0.002658555 11 0.099099099 56.38041132 16.02486889 GO:0002526 acute inflammatory response 0.002660742 24 0.103004292 57.80098769 24.94311116 GO:0045807 positive regulation of endocytosis 0.002903401 9 0.147540984 54.94499375 6.769909171 GO:0002274 myeloid leukocyte activation 0.002969661 7 0.077777778 54.94499375 16.07042339 GO:0008652 amino acid biosynthetic process 0.003502921 7 0.017241379 45.19797271 31.18248579 GO:0050727 regulation of inflammatory response 0.004999055 7 0.084337349 54.94499375 7.737346076 GO:0002253 activation of immune response 0.00500146 23 0.116161616 60.29679989 18.41103376 GO:0002684 positive regulation of immune system process 0.006581245 27 0.111570248 60.29679989 22.05051447 GO:0050778 positive regulation of immune response 0.006581245 27 0.113924051 60.29679989 22.05051447 GO:0019882 antigen processing and presentation 0.007244488 7 0.029661017 54.94499375 16.58797889 GO:0002682 regulation of immune system process 0.007252134 29 0.099656357 61.05645008 22.65935206 GO:0050776 regulation of immune response 0.007252134 29 0.102112676 61.05645008 22.65935206 GO:0043086 negative regulation of enzyme activity 0.008017022 9 0.040723982 53.28031076 17.48904224 GO:0006909 phagocytosis 0.008106069 10 0.080645161 55.66270253 12.47536747 GO:0002573 myeloid leukocyte differentiation 0.008174948 10 0.092592593 62.86577216 9.401887596 GO:0006959 humoral immune response 0.008396095 16 0.044568245 55.05654091 18.94209565 GO:0046649 lymphocyte activation 0.009044401 29 0.059917355 61.92213317 21.03553355 GO:0030595 leukocyte chemotaxis 0.009707319 7 0.101449275 56.33116709 6.945510559 GO:0006469 negative regulation of protein kinase activity 0.010782155 7 0.046357616 52.22863516 12.58524145 GO:0051348 negative regulation of transferase activity 0.010782155 7 0.04516129 52.22863516 12.58524145 GO:0007179 transforming growth factor beta receptor signaling pathw 0.012630825 13 0.071038251 83.49440788 12.63256309 GO:0005520 insulin-like growth factor binding 0.012950071 9 0.097826087 81.41963394 7.528247832 GO:0042110 T cell activation 0.013410548 20 0.064516129 59.77891783 26.06174863 GO:0002455 humoral immune response mediated by circulating immunogl 0.016780163 10 0.125 54.70766244 14.2572143 GO:0005830 cytosolic ribosome (sensu Eukaryota) 0.016907351 8 0.01843318 61.68933284 7.814673781 GO:0006487 protein amino acid N-linked glycosylation 0.01791078 7 0.044585987 56.50635337 6.780726553 GO:0051240 positive regulation of multicellular organismal process 0.017931228 31 0.096573209 62.2953212 23.86214773 GO:0042379 chemokine receptor binding 0.018849666 12 0.095238095 55.13915015 19.08254406 GO:0008009 chemokine activity 0.018849666 12 0.096774194 55.13915015 19.08254406 GO:0016055 Wnt receptor signaling pathway 0.020088086 18 0.04400978 85.47935979 20.92435897 Need visualization solution..!
  40. 40. Visualization: Cytoscape Enrichment Map • Visualization framework for gene-set analysis results • Cytoscape network: nodes correspond to gene-sets, edges correspond to gene-set overlaps (i.e. share a fraction of their genes) • Intuitive clustering of gene-sets that converge on the same functional themes • Determined by automatic network layout algorithm, based on edge weights • Overlaps < threshold are pruned, otherwise network layout would work poorly • Important: don’t confuse with gene networks • Nodes do not represent genes, they represent gene-sets/pathways • Edges do not represent physical interactions, they represent overlaps between gene-sets A B Edges represent gene-set overlap Merico D, Isserlin R, Stueker O, Emili A, Bader GD. Enrichment map: a network-based method for gene-set enrichment visualization and interpretation. PLoS One 2010. PMID: 21085593
  41. 41. Visualization: Cytoscape Enrichment Map ABB1 ACAP3 TRAC1 LUC2 POF5 ZUMM C5A75 DUCZ TP53 NTRK1 MAPK3 ANAAT PIK1 PRKCA gs1 gs2 gs3 gs4 gs5 PIRL2 TAZ CAZ1 gs1 gs3 gs4 gs2 gs5
  42. 42. Example: Differential expression after estrogen treatment of breast cancer cells, GSEA competitive gene-set analysis
  43. 43. • Using the native Gene Ontology relations results in a more disconnected graph
  44. 44. 3.1.4. Types of gene-set analysis tests
  45. 45. Competitive vs Self-contained CASES CONTROLS GENE TEST GENE-SETS ENRICHED IN SCORE (e.g. gene-sets enriched in up-regulated genes) CASES CONTROLS GENE-SET TEST GENE-SET SCORE (e.g. significant mutation burden difference) GENE SCORE (e.g. differential expression) COMPETITIVE (aka ENRICHMENT aka OVER-REPRESENTATION) SELF-CONTAINED SUPPORTING GENES Nam D, Kim SY. Gene-set approach for expression pattern analysis. Brief Bioinform 2008. PMID: 18202032 • Competitive à gene-set genes “compete” with all other genes (for enrichment) • Self-contained à gene-set scored independently of other genes
  46. 46. Competitive Test Types UP DOWN ENRICHMENT TEST Threshold- dependent e.g. FET, g:Profiler * Threshold- independent e.g. GSEA UP DOWN • More suitable for significantly mutated genes • More suitable for differential gene expression * g:Profiler also contains a “hybrid” approach that selects the most optimal cutoff for gene-set analysis
  47. 47. 3.1.5. Competitive tests: GSEA for gene expression data
  48. 48. Gene Expression Analysis Workflow Generate the expression data Collect the biological samples Identify the Differential Genes Identify the Functional Groups Define the experimental design
  49. 49. GSEA: Gene-Set Enrichment Analysis • Popular threshold-free gene-set test • Identifies gene-sets enriched in top- or bottom-ranking genes • Suggest typically used as competitive test (see permutation settings), which takes in input a ranked gene list • Statistical test: empirical test based on permutations; includes permutation- based FDR • The NES (normalized enrichment score) is a particularly valuable measure of enrichment effect size for visualization
  50. 50. GSEA: Gene-Set Enrichment Analysis High ES score <--> High local enrichment ES score calculation Distribution of ES from N permutations (e.g. 2000) Number of instances Real ES score value Randomized with ES ≥ real: 4 / 2000 ==> Empirical p- value = 0.002 ES Score
  51. 51. GSEA Permutation Settings • The permutation setting completely changes the nature of the GSEA test • Gene-set permutations (aka pre-ranked) • Takes in input a ranked gene list and permutes the genes in the gene-sets • à competitive • Recommended in presence of differential gene expression data for small or medium-scale experiments (2-4 biological replicates per condition) with modest expression heterogeneity • Phenotype permutation • Permute the phenotype labels (e.g. treated, untreated), then repeat gene scoring; gene scoring is performed within GSEA • à competitive / self-contained hybrid • Recommended for larger scale gene expression data (> 10 biological replicates per condition) with high expression heterogeneity • As an alternative, consider a pure self-contained test, or a self-contained test with a different competitive correction
  52. 52. 3.1.6. Self-contained tests: gene-set somatic burden
  53. 53. OICR PanCuRx: Dataset Summary • 200 primary tumours and 41 metastases (pancreatic cancer) • Whole genome sequencing à detection of SNVs, indels, SVs, copy number gains and losses • Mutation load outlier removal criterion: median + 2 IQR à Samples retained: 190/200 primaries and 41/41 metastases Met Pri 3.54.04.55.0 SNV count Log10(SNVcount) Met Pri 1.52.02.53.03.54.04.5 Indel count Log10(indelcount) Met Pri 0.00.51.01.52.02.53.0 SV count Log10(SVcount) Unpublished data
  54. 54. OICR PanCuRx: Gene-set Analysis Strategy 1. Perform gene-set burden test, primaries vs metastases • Logistic regression (metastases vs. primary), separating each variant type: M0 = y ~ ns_tot + ms_tot + ss_tot + sv_tot + cL_tot + cG_tot M1 = y ~ ns_tot + ms_tot + ss_tot + sv_tot + cL_tot + cG_tot + ns_gs + ms_gs + ss_gs + sv_gs + cL_gs + cG_gs • Multiple test correction by BH-FDR (significant when BH-FDR < 27.5%) 2. For significant gene-sets, categorize driver variant type(s) and extract genes more often mutated in metastases for such variant types (“leading edge” gene) 3. Cluster pathways based on leading gene overlaps, visualize using Cytoscape enrichment map plugin 4. Overlay key genes (even more stringent filter: mutation rate met/pri > 4.5) 5. Formulate hypotheses à correlation with other tumour properties • RNA-seq based proliferation index (CCP) and missense mutations in cell cycle genes Unpublished results; Gallinger, PanCuRx TRI, Toronto
  55. 55. REACT:TELOMERE MAINTENANCE REACT:ION CHANNEL TRANSPORT KEGG:BASE EXCISION REPAIR REACT:RESOLUTION OF ABASIC SITES (AP SITES) KEGG:MINERAL ABSORPTION REACT:CHROMOSOME MAINTENANCE REACT:BASE EXCISION REPAIR REACT:TRANSMEMBRANE TRANSPORT OF SMALL MOLECULES REACT:NUCLEOSOME ASSEMBLY REACT:HDACS DEACETYLATE HISTONES REACT:DEPOSITION OF NEW CENPA-CONTAINING NUCLEOSOMES AT THE CENTROMERE REACT:DNA REPLICATION PRE-INITIATION REACT:FORMATION OF THE BETA-CATENIN:TCF TRANSACTIVATING COMPLEX REACT:G2/M CHECKPOINTS KEGG:ECM-RECEPTOR INTERACTION REACT:CELL CYCLE, MITOTIC REACT:M/G1 TRANSITION REACT:G1/S TRANSITION REACT:MITOTIC METAPHASE AND ANAPHASE REACT:TRANSCRIPTION-COUPLED NUCLEOTIDE EXCISION REPAIR (TC-NER) REACT:GAP-FILLING DNA REPAIR SYNTHESIS AND LIGATION IN TC-NER KEGG:SEROTONERGIC SYNAPSE KEGG:GNRH SIGNALING PATHWAY KEGG:CIRCADIAN ENTRAINMENT Missense (gain and loss of function?) Nonsense + missense (loss of function?) Nonsense Nonsense + copy number loss Other combination Driver variants Copy number gain Missense + SV (loss and gain of function?) For all clusters, only variants driving corresponding gene-sets and with counts met >= pri are reported; considering the number of met and pri, this is corresponds to an enrichment ratio > 4.5 Unpublished results; Gallinger, PanCuRx TRI, Toronto
  56. 56. REACT:TELOMERE MAINTENANCE REACT:ION CHANNEL TRANSPORT KEGG:BASE EXCISION REPAIR REACT:RESOLUTION OF ABASIC SITES (AP SITES) KEGG:MINERAL ABSORPTION REACT:CHROMOSOME MAINTENANCE REACT:BASE EXCISION REPAIR REACT:TRANSMEMBRANE TRANSPORT OF SMALL MOLECULES REACT:NUCLEOSOME ASSEMBLY REACT:HDACS DEACETYLATE HISTONES REACT:DEPOSITION OF NEW CENPA-CONTAINING NUCLEOSOMES AT THE CENTROMERE REACT:DNA REPLICATION PRE-INITIATION REACT:FORMATION OF THE BETA-CATENIN:TCF TRANSACTIVATING COMPLEX REACT:G2/M CHECKPOINTS KEGG:ECM-RECEPTOR INTERACTION REACT:CELL CYCLE, MITOTIC REACT:M/G1 TRANSITION REACT:G1/S TRANSITION REACT:MITOTIC METAPHASE AND ANAPHASE REACT:TRANSCRIPTION-COUPLED NUCLEOTIDE EXCISION REPAIR (TC-NER) REACT:GAP-FILLING DNA REPAIR SYNTHESIS AND LIGATION IN TC-NER KEGG:SEROTONERGIC SYNAPSE KEGG:GNRH SIGNALING PATHWAY KEGG:CIRCADIAN ENTRAINMENT Missense (gain and loss of function?) Nonsense + missense (loss of function?) Nonsense Nonsense + copy number loss Other combination Driver variants Cell cycle (cell cycle progression and checkpoints), DNA replication (polymerase, replication initiation, replication fork complexes), chromosome maintenance and segregation (centromere components, centrosome components, spindle checkpoint) – missense, sometimes also sv [labelled] CDT1 (4,0): prevents initiation of replication when DNA replication is ongoing POLA1 (1,0) : DNA polymerases [POLD1, POLD3 and other DNA polymerases listed only in repair cluster] MCM8 (2,0), MCM3 (1,0), MCM10 (1,1), MCM7 (1,1): replication fork complex – [MCM10 in CCP] CENPA (1,0), CENPL (1,0), CENPJ (1,1), : centromere (chromosome segregation) – [CENPM, CENPF in CCP] NCAPD3 (1,0), NIPBL (1,1): chromosome condensation and/or segregation CEP57 (2,0), CEP152 (2,1), CNTRL (1,1): microtubule centrosome (chromosome segregation) – [CEP55 in CCP] ERCC6L (2,1): spindle checkpoint; CKAP5 (2,2): spindle formation; CASC5/KNL1 (sv 1,0): kinetochore E2F1 (1,0), E2F4 (1,0), TFDP1 (1,0; sv 1,1): TFs regulating cell cycle progression ANAPC11 (1,0), ANAPC2 (1,0): anaphase promoting complex (cell cycle progression); FBXO5 (1,0; sv 1,0): anaphase promoting complex inhibitor ATM (sv 1,1), TP53BP1 (1,1): TP53 pathway and DNA damage response; HMG20B (1,0): DNA damage response [histone and histone (de)acetylation listed for the separate subcluster] Other: AHCTF1 (2,2; sv 2,1), B9D2 (1,0), BARD1 (1,0), GORASP1 (1,0), LEMD2 (1,0), NEDD1 (1,0), NUP205 (1,0), NUP88 (1,1), NUP133 (1,1), PPP1R12A (1,0), PSMA3 (1,1), PSMD1 (1,1), SDCCAG8 (1,1), SGOL2 (1,1), TUBGCP5 (1,0), UBB (1,0), YWHAH (1,0), XPO1 (1,0), WRAP53 (sv 1,0), ZW10 (1,0) DNA base excision repair – missense, sv PARP1 (sv 1,0), PARP2 (ms 1,0), PARP4 (ms 1,0), POLD3 (sv 1,0), MPG (ms 1,0), RPA1 (sv 2,0), RPA2 (1,0), TDG (ms 1,0) Transcription-coupled nucleotide excision repair – only missense COPS2 (ms 1,0), EP300 (ms 2,0), ERCC3 (ms 2,0), POLK (ms 1,0), UBB (ms 1,0) Both – missense, sv LIG3 (ms 1,0), POLD1 (ms 1,1; sv 1,0), XRCC1 (ms 1,0; sv 1,0) Beta catenin pathway – only missense CTNNB1 (2,2): beta catenin TCF7L2 (2,0): TF that partners with CTNNB1 and activates target genes Extracellular matrix–receptor interactions – only missense LAMB4 (1,1), LAMC1 (1,1), LAMC2 (1,0) COL4A2 (1,1), COL6A3 (2,2), COL9A2 (1,1), COL6A5 (4,2), HSPG2 (2,0) COMP (1,0), TNR (1,1) ITGA1 (2,0), ITGB4 (2,0), ITGB3 (2,0), ITGA2B (1,0), ITGA11 (1,1), ITGAV (1,1) CD47 (1,0), CD36 (1,0) Histones and histone (de)acetylation – only missense HIST1H2BB (2,1), HIST1H2BD (1,0), HIST1H2BL (1,0), HIST1H2BO (sv 1,0),: transcriptional activation, response to DNA damage and other processes H2AFB1 (sv 2,1) CHD4 (1,0): nucleosome remodeling and histone deacetylase complex EP300 (2,0): histone acetyltransferase recognizing enhancers, involved in cell cycle, DNA damage response, … KAT5 (1,0): histone acetyltransferase ARID4B (1,1): histone deacetylase WHSC1 (1,1; sv 1,0): histone methyltransferase NCOR1 (1,1), TBL1XR1 (1,1): nuclear receptor corepressor (N-CoR) and histone deacetylase 3 (HDAC 3) complexes Misc. signalling – only cnGain ITPR2 (2,0) ALOX12 (1,0) GNAS (1,0) MAP3K3 (1,0) PRKCG (1,0) Misc. signalling – only nonsense ADCY2 (1,0) ADCY10 (1,1) GUCY1A3 (1,0) RYR3 (2,1) Copy number gain Missense + SV (loss and gain of function?) For all clusters, only variants driving corresponding gene-sets and with counts met >= pri are reported; considering the number of met and pri, this is corresponds to an enrichment ratio > 4.5 Unpublished results; Gallinger, PanCuRxTRI, Toronto
  57. 57. All samples # Estimate Std. Error t value Pr(>|t|) # (Intercept) 0.42557 0.08620 4.937 2.39e-05 *** # gsCC_ms_bin_stdz 0.14522 0.08739 1.662 0.1063 # gCDKN2ALOF_bin_stdz 0.16066 0.09449 1.700 0.0988 . # vc_ms_tot_stdz 0.13934 0.08962 1.555 0.1298 Samples with <= 60 missense # Estimate Std. Error t value Pr(>|t|) # (Intercept) 0.31231 0.10020 3.117 0.00455 ** # gsCC_ms_bin_stdz 0.14051 0.09719 1.446 0.16068 # gCDKN2ALOF_bin_stdz 0.25673 0.11289 2.274 0.03181 * # vc_ms_tot_stdz -0.09684 0.11187 -0.866 0.39489 Cell cycle missense x CDKN2A LOF (ns, sv, cL) Met_CDKN2Ay_CCMSy Met_CDKN2Ay_CCMSn Met_CDKN2An_CCMSy Met_CDKN2An_CCMSn Pri_CDKN2Ay_CCMSy Pri_CDKN2Ay_CCMSn Pri_CDKN2An_CCMSy Pri_CDKN2An_CCMSn -3-2-1012 Met/pri x CDKN2A y/n x Cell Cycle ms y/n: ccp ccpRNAindex Met_CDKN2Ay_CCMSy Met_CDKN2Ay_CCMSn Met_CDKN2An_CCMSy Met_CDKN2An_CCMSn Pri_CDKN2Ay_CCMSy Pri_CDKN2Ay_CCMSn Pri_CDKN2An_CCMSy Pri_CDKN2An_CCMSn -3-2-1012 Met/pri x CDKN2A y/n x Cell Cycle ms y/n: ccp ccpRNAindex Unpublished results; Gallinger, PanCuRxTRI, Toronto
  58. 58. General Applications of Self-Contained Tests • Compare different tumour subtypes • Compare tumours by survival or other properties (e.g. clinical grade, response to therapy) • Important to address systematic differences between tumours from different groups (mutation load, mutation signatures, etc.) • Relatively minor differences can be corrected for, whereas large differences will likely prevent the analysis from working properly • Correcting for total number of variants is typically recommended, and it can be considered a “competitive” correction of the self-contained test (i.e. the gene-set is more predictive of the difference between sample groups than all genes)
  59. 59. 3.1.7. General tips
  60. 60. General Tips for Gene-set Analysis / 1 • Carefully design your experiment • Flaws in experimental design, like presence of hidden confounders or insufficient number of replicates, will result in confounded or negative gene-set results • For gene expression experiments, perform exploratory analysis (PCA, MDS, hierarchical clustering) to check relations among samples and validate the experimental design • Choose gene-set types and filter gene-sets by size • Start from most informative gene-sets: Gene Ontology, KEGG and Reactome pathways, MSigDB cancer hallmarks • Remove small gene-sets to improve power after multiple test correction (e.g. < 15 genes for competitive tests applied to differential gene expression) • For Gene Ontology, remove large gene-sets (e.g. > 500 genes) as they tend to be uninformative
  61. 61. General Tips for Gene-set Analysis / 2 • Chose a competitive of self-contained test Competitive: • requires meaningful gene seletion or ranking à typically suitable for differential gene expression or genes with significant mutation burden • if analyzing other –omics, model carefully the background distribution, do not simply assume Fisher’s Exact Test or GSEA will be suitable (e.g. use GREAT for ChIP-seq, etc.) Self-contained: • typically suitable for sparser mutations, when differences are significant at gene-set level only • ensure that different sample groups are comparable, correct for confounders • Proper visualization is important to interpret results and to identify issues • Use visualization solution like Enrichment Map • Visualize the full gene-set results, do not cherry-pick based on prior expectation • Unexpected results can suggest issues (e.g. contamination, statistical bias)
  62. 62. • Do not forget to carefully evaluate genes with limited or no gene-set annotations and network interactions…! General Tips for Gene-set Analysis / 3
  63. 63. Time1 ... Zz34 13.56Aabc Ranked List 1.07 ... Time3 PIK3CA TP53 Gene List VisualizeInterpret Extractgenelist froman'omics experiment Performpathway enrichment analysis clusterMaker Word Cloud Annotate Auto Cytoscape EnrichmentMap REGULATION OF INTERFERON-GAMMA-MEDIATED SIGNALING PATHWAY%GOBP%GO:0060334 Pathway P-value Q-value POSITIVE REGULATION OF RHO PROTEIN SIGNAL TRANSDUCTION%GOBP%GO:0035025 POSITIVE REGULATION OF RAS PROTEIN SIGNAL TRANSDUCTION%GOBP%GO:0046579 0.00304414 0.0 0.004622496 0.0056384853 0.0038799183 0.008516296 positive regulation of small GTPase mediated signal transduction positive regulation of Ras protein signal transduction regulation of interferon-gamma-mediated signaling pathwaypositive regulation of Rho protein signal transduction regulation of response to interferon-gamma gtpase signal transduction regulation interferon gamma Outputs • Published on bioRxiv Jan 2017, provisionally accepted by Nature Protocols • General concepts and resources • Step-by-step instructions for gene-set analysis of gene expression data
  64. 64. 3.2. Network Analysis
  65. 65. 3.2.1. Network visualization and gene network types
  66. 66. Network Representation and Visualization Merico D, Gfeller D, Bader GD. How to visually interpret biological data using networks. Nature Biotechnology 2009. PMID: 19816451
  67. 67. Network Visualization: Automatic Layout Before layout After layout • Yeast proteins annotated to GO cellular component "chromosome” • Colored based on sub-component (nucleosome, kinetochore, replication fork) • The layout (force directed) meaningfully arranges nodes (genes/proteins) and edges (interactions) Merico D, Gfeller D, Bader GD. How to visually interpret biological data using networks. Nature Biotechnology 2009. PMID: 19816451
  68. 68. Network Visualization: Cytoscape • Rich GUI to map visual markup to data • Imports tabular data (computational biologist friendly) • Default functions for visualization, search, layout • Lots of “apps” implementing specific algorithms and functionalities (e.g. Enrichment Map)
  69. 69. Gene Network Types • Protein-protein (physical) interactions • Biochemical reaction adjacency (mainly shared output /input in metabolic pathways) • Regulator-target interactions (e.g. TF/miRNA-target) • Co-expression • Genetic interactions (e.g. synthetic lethality in double KO) • Semantic similarity (e.g. similarity of Gene Ontology annotations) • Publication co-citation • Aggregate functional similarity (based on multi-omics)
  70. 70. Networks vs Pathways Pathways • Hand-curated à more accurate • Represent biochemical reactions, or molecular events, or regulatory relations among proteins, protein complexes, metabolites and other bio- entities Networks • Derived from experimental high throughput methods or text mining à more noisy • Represent simple relations among genes (e.g. binds, is similar to, is co-expressed with, regulates) • Cover a larger number of genes
  71. 71. Gene Network Resources iRefWeb/iRefIndex wodaklab.org/iRefWeb • Resource integrating different databases • Mainly protein interactions • Useful to explore specific interactions, or bulk download GeneMANIA www.genemania.org • Multiple networks available (including iRefIndex protein interactions) • Useful to construct, visualize, and evaluate networks from “seed” genes (network propagation algorithm) STRING string-db.org • Integrated network, based on algorithm for function prediction • Protein interactions, pathway interactions, co-expression, etc..
  72. 72. Network Analysis Overview Most common analysis types: • Subnetwork construction from seed genes à GeneMANIA • Network clustering / module finding à ClusterMaker2 (MCODE, MCL, …) • Enriched sub-network identification à Reactome FI, HyperModules, HotNet Other types of analysis: • Network inference from expression data à ARACNE • Pathway/network activity inference à SPIA, PARADIGM • Overall analysis of network topology • Motif identification, motif content analysis
  73. 73. Gene-set vs Network Analysis • Gene-set pros • Better coverage of genes and known biological processes / components • Simple algorithmics, a few well-established analysis options • Gene-set cons • Simple and flat structure, do not represent mechanistic details • Pre-constructed based on “general biology” • Network pros • More structured, more insight on mechanistic details • Can reveal new gene-gene associations • Network cons • More limited coverage of genes and known biological processes / components • More complex algorithmic, more analysis options
  74. 74. 3.2.2. GeneMANIA
  75. 75. Component 1: Weighted network combination • Gene Ontology prediction • Input gene connectivity Component 2: Label propagation algorithm INPUT = Query gene list (e.g. DLG1, SHANK) OUTPUT = Query genes + interaction neighbour network GeneMANIA
  76. 76. 3.2.3. Reactome FI
  77. 77. Reactome FIViz Components: • Functional Interaction (FI) Network • Use experimental protein interactions in human, protein interactions in model organism, gene expression, to predict “functional interactions” • Positive set: pathway-based interactions from Reactome • Subnetwork construction algorithm • Classical: only direct connections, or additionally linkers • HotNet: heat kernel • Clustering Algorithm • Edge-betweenness used to find “local interaction communities” in the sub-network
  78. 78. Cell cycle checkpoints, DNA damage response Adhesion molecules NOTCH pathway Glioblastoma Subnetwork a. DNA copy number detection for 206 glioblastomas b. detection of somatic mutations in 601 selected genes for 91 matched tumor-normal pairs Growth factor signaling Wu G, Feng X, Stein L. A human functional protein interaction network and its application to cancer data analysis. Genome Biol 2010. PMID: 20482850
  79. 79. GeneMANIA or Reactome FIViz? • GeneMANIA: start from experimental genes, construct a larger network of related genes (without further using the same experimental data); typically works well when initial genes form one cluster, when genes are too diverse tends to connect them using less specific hubs • Reactome FIViz: start from experimental genes, inter-connect them using functional interactions and potentially including some linker genes, cluster them into modules
  80. 80. Nature Methods 2015 For More Reading…
  81. 81. Thanks for your attention! Baked by Ruth Isserlin

×