Building a Global Map of  (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010
From one genome to many biological states  While there is only one genome sequence, different genes are expressed in many different cell types and tissues, different developmental or disease states The size and structure of this “expression space” is still largely unknown Most individual experiments are looking at small regions We would like to build a map of the global human gene expression space
Mapping the human transcriptome Traditional research A microarray experiment Everest Lhasa Kathmandu The map we want to build
How to build such a global map This space is huge  - There are thousands of potentially different states – cell types, tissue types, developmental stages, disease states, systems under various treatments (drugs, radiation, stress, …) –  It is not feasible to study them all in a single laboratory experiment (costs, rare samples, …) However thousands of gene expression experiments are performed every year (microarrays, new generation sequencing) Can we use the published data to build the global expression map?
 
 
 
 
ArrayExpress www.ebi.ac.uk/arrayexpress Data from  over 280,000 assays and over 10,000 independent studies (microarrays, sequencing, …) Gene expression and other functional genomics assays Over 200 species Data collection and exchange from GEO
Can we integrate these data to answer questions that go beyond what was done in the individual studies? On a quantitative level - data on only the same microarray platform can be integrated
A global map of human gene expression Angela Gonzales (EBI) Misha Kapushesky (EBI) Janne Nikkila (Helsinki University of Technology)  Helen Parkinson (EBI),  Wolfgang Huber (EMBL) Esko Ukkonen (University of Helsinki) Margus   Lukk et al,  Nature Biotechnology , 28, p322-324 (April, 2010)
We collected over 9000 raw data files from Affymetrix U133A from GEO and ArrayExpress Applying strict quality controls, removing the duplicates  Data on  5372  samples remained from  206  different studies generated  in  163  different laboratories grouped in  369  different biological ‘conditions’ (tissue types, diseases, various cell lines, etc) The 369 conditions grouped in different larger ‘metagroups’ The most popular gene expression microarray platform: Affymetrix U133A
Different metagroupings (4 and 15):
5372 samples (369 different conditions) ~18,000 genes After RMA normalisation we obtain:
Principal Component Analysis  – each dot is one of the 5372 samples 1 st 2 nd
Human gene expression map 17/08/10 1 st 2 nd
Human gene expression map 17/08/10 Hematopoietic axis 2 nd
Human gene expression map 17/08/10 Hematopoietic axis 2 nd
Human gene expression map 17/08/10 Hematopoietic axis Malignancy
Hematopoietic and malignancy axes Lukk et al, Nature Biotechnology, 28: 322
1 st   2 nd   3 rd
Coloured by tissues of origin 3 rd  PC
Tissues of origin Neurological axis
First 3 (5) principal components Hematopoietic axis  – blood, ‘solid tissues’, ‘incompletely differentiated cells and connective tissues’ Malignancy axis  - Cell lines – cancer – normals and other diseases Neurological axis  – nervous system / the rest  RNA degradation Samples seem to ‘cluster’ by the tissues of origin
 
Hierarchical clustering of 97 groups with at least 10 replicates each Human gene expression map 17/08/10
Comparison of the 97 larger sample groups to the rest Incompletely differentiated cell type and connective tissue group
Conclusions so far We have identified 6 major transcription profile classes in these data:  cell lines incompletely differentiated cells and connective tissues neoplasms  blood  brain  muscle Cell lines cluster together!
 
 
Gene expression across the 5372 samples The expression of most genes is relatively constant There are only 1034 probesets (mapping to less than 900) genes where normalised signal variability has standard deviation > 2
Clustering of 97 sample groups and 1000 most variable probesets (about 900 genes) Immune repsonse Nervous system development Lipid raft Mitosis Neurotransmitter uptake Cytoskeletal protein binding Extracellular matrix Extracellular regions Extracellular matirx Extracellular region Mitosis Defence response Nervous system development Actin cytoskeleton organisation and biogenesis Protein carrier activity No significant resout Antigen  presentation, exogenous antigen Trans – 1,2-dyhydrobenzene, 1,2-dyhydrogenase activity S100 alpha binding 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Clustering based on subset of these genes produce similar results Clustering based on 350 most variable probesets gives almost the same result Even clustering based on 30 most variable probesets is very close
 
 
24 most variable genes
www.ebi.ac.uk/gxa/human/U133a
Can we go beyond the 6 major classes?
Hierarchical clustering of all 369 sample groups Human gene expression map 17/08/10 Some finer groups: Cancer: Sarcomas Carcinomas Neuroblastomas Normal: Liver and gut
Leukemia Normal blood  and  blood non-neoplastic disease Other blood neoplasm Blood cell lines
Identifying condition specific genes by supervised analysis Using linear models to find condition specific genes, multiple testing correction, differential expression cut-offs Example -  174 leukemia specific genes  include most well known markers (e.g,  BCR, ETV6, FLT3, HOXA9, MUST3, PRDM2, RUNX1, and TAL1 )  Many confirmed as associated with leukemia
Beyond the major 6 classes the ‘signal’ becomes weak The problem may be lab effects The large biological effects are stronger than the lab effects However, when we zoom into particular subclasses, the lab effects may be taking precedence
Mapping the human transcriptome Traditional research A microarray experiment Everest Lhasa Kathmandu The map we want to build Our current view on global transcriptome
97 groups – colours recycled Frontal cortex Muscular dystrophy Skeletal muscle Brain Heart and  heart parts Cerebellum Caudate nucleus Hippocampal tissue Nervous system tumors Mono- nuclear cells AML
Second approach  Integrating data on statistics level
Gene Expression Atlas Ele Holloway Ibrahim Emam Pavel Kurnosov  Helen Parkinson Anrey Zorin Tony Burdett  Gabriella Rustici Eleanor William Andrew Tikhonov
Global Differential Expression Analysis Selected ~ 10% of the data from ArrayExpress (including GEO imports), manually curated for quality and mapped to a custom-built ontology of experimental factors , EFO:  http://www.ebi.ac.uk/efo Data on differential expression of genes in 1000+ studies, comprising ~30000 assays, in over 5000 conditions For each experiment, differentially expressed genes have been identified computationally via moderated t-tests and statistical meta-analysis Meta-Analysis Approaches Vote counting:  number of independent studies supporting an observation for a particular gene Effect size integration:  compute effect size statistics in each study, assess relevant statistical model and compute combined z-score, for each gene/condition/study combination (extension of Choi et al, 2003)
Analysing each contributing dataset separately: one-way ANOVA AML CML normal genes
Combining the datasets … Experiments 1, 2, 3, …, m
Effect size-based meta-analysis We have for each gene in each experiment/condition: p -value for significance simulaneous  t -statistics & confidence intervals d.e. label (“up” or “down”) However, we would like to: Measure of strength of d.e.  effect size Ability to combine d.e. findings statistically Effect Size Standardized mean difference or similar (e.g., correlation coef.)
Meta-analysis Procedure For each  gene-experiment-condition  combination Compute effect size from simultaneous d.e. t-statistics Combine effect sizes  across  multiple studies Using fixed-effects or random- effects models Obtain for each  gene-condition  combination: Mean effect size estimate Combined  z -score Overall  p -value
Long tail of annotations…
Annotating data with ontologies Diverse nature of annotations on data Need to  support complex queries which contain semantic information E.g. which genes are under-expressed in brain  samples in human or mouse If we annotate with  do we get this data? cancer adenocarcinoma James Malone
Decoupling knowledge from data Atlas/AE James Malone
Semantically-enriched Queries with EFO
We can use the ontology structure We can perform effect size  meta-analysis on a hierarchy, if we follow several rules:
Increased statistical power
Condition-specificity through EFO
Condition-specific Gene Expression
Query for genes Query for conditions species The ‘advanced query’ option allows building more complex queries http://www.ebi.ac.uk/gxa www.ebi.ac.uk/gxa
Query results for gene  ASPM ArrayExpress ASPM is  downregulated  in ‘normlal’ condition in comparison to a disease in 9 studies out of 10 Upregulated  in ‘Glioblastoma’ in 3 indepnendent studies Zoom into one of the ‘Glioblastoma’ studies. Each bar represents an expression level in a particular sample
‘ wnt pathway ’ genes in various  cancers ArrayExpress
Integrating both approaches First approach gives the global view, but obsucres the detail The second approach gives detail, but doesn’t allow easily to integrate everything in one map Can we combine both approaches?
Other data RNAseq data Proteomics data – Human Proteome Atlas from KTH in Stockholm (collaboration with Mathias Uhlen) Time series – what states a cell goes through to become from an ESC to a mature cell?
 
 
Two ways of integrating the data On a quantitative level – normalise all data together  Advantages – results easier to interpret Disadvantages – lab effects  On a statistics level – analyse each dataset separately first Advantages – less lab effects  Disadvantages – combined data difficult to interpret (in each experiment each conditions is compared to something else) How to combine the two approaches?
Acknowledgements Margus Lukk Misha Kapushesky   Angela Gonzales Helen Parkinson Gabriela Rustici Ugis Sarkans Ele Holloway  Roby Mani  Mohammadreza Shojatalab  Nikolay Kolesnikov  Niran Abeygunawardena  Anjan Sharma  Miroslaw Dylag Ekaterina Pilicheva  Ibrahim Emam Pavel Kurnosov Andrew Tikhonov Andrey Zorin Collaborators Audrey Kaufman (EBI) Wolfgang Huber (EBI) Sami Kaski (Helsinki) Morris Swertz (Groningen) … Funding European Commision FELICS MolPAGE ENGAGE MuGEN SLING DIAMONDS EMERALD NIH (NHGRI) EMBL Anna Farne Eleanor Williams  Tony Burdett James Malone Holly Zheng Tomasz Adamusiak Susanna-Assunta Sansone Philippe Rocca-Serra  Natalija Sklyar Marco Brandizi Chris Taylor Eamonn Maguire Maria Krestyaninova Mikhail Gostev Johan Rung Natalja Kurbatova Katherine Lawler Nils Gehlenborg  Lynn French

20100509 bioinformatics kapushesky_lecture05_0

  • 1.
    Building a GlobalMap of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010
  • 2.
    From one genometo many biological states While there is only one genome sequence, different genes are expressed in many different cell types and tissues, different developmental or disease states The size and structure of this “expression space” is still largely unknown Most individual experiments are looking at small regions We would like to build a map of the global human gene expression space
  • 3.
    Mapping the humantranscriptome Traditional research A microarray experiment Everest Lhasa Kathmandu The map we want to build
  • 4.
    How to buildsuch a global map This space is huge - There are thousands of potentially different states – cell types, tissue types, developmental stages, disease states, systems under various treatments (drugs, radiation, stress, …) – It is not feasible to study them all in a single laboratory experiment (costs, rare samples, …) However thousands of gene expression experiments are performed every year (microarrays, new generation sequencing) Can we use the published data to build the global expression map?
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
    ArrayExpress www.ebi.ac.uk/arrayexpress Datafrom over 280,000 assays and over 10,000 independent studies (microarrays, sequencing, …) Gene expression and other functional genomics assays Over 200 species Data collection and exchange from GEO
  • 10.
    Can we integratethese data to answer questions that go beyond what was done in the individual studies? On a quantitative level - data on only the same microarray platform can be integrated
  • 11.
    A global mapof human gene expression Angela Gonzales (EBI) Misha Kapushesky (EBI) Janne Nikkila (Helsinki University of Technology) Helen Parkinson (EBI), Wolfgang Huber (EMBL) Esko Ukkonen (University of Helsinki) Margus Lukk et al, Nature Biotechnology , 28, p322-324 (April, 2010)
  • 12.
    We collected over9000 raw data files from Affymetrix U133A from GEO and ArrayExpress Applying strict quality controls, removing the duplicates Data on 5372 samples remained from 206 different studies generated in 163 different laboratories grouped in 369 different biological ‘conditions’ (tissue types, diseases, various cell lines, etc) The 369 conditions grouped in different larger ‘metagroups’ The most popular gene expression microarray platform: Affymetrix U133A
  • 13.
  • 14.
    5372 samples (369different conditions) ~18,000 genes After RMA normalisation we obtain:
  • 15.
    Principal Component Analysis – each dot is one of the 5372 samples 1 st 2 nd
  • 16.
    Human gene expressionmap 17/08/10 1 st 2 nd
  • 17.
    Human gene expressionmap 17/08/10 Hematopoietic axis 2 nd
  • 18.
    Human gene expressionmap 17/08/10 Hematopoietic axis 2 nd
  • 19.
    Human gene expressionmap 17/08/10 Hematopoietic axis Malignancy
  • 20.
    Hematopoietic and malignancyaxes Lukk et al, Nature Biotechnology, 28: 322
  • 21.
    1 st 2 nd 3 rd
  • 22.
    Coloured by tissuesof origin 3 rd PC
  • 23.
    Tissues of originNeurological axis
  • 24.
    First 3 (5)principal components Hematopoietic axis – blood, ‘solid tissues’, ‘incompletely differentiated cells and connective tissues’ Malignancy axis - Cell lines – cancer – normals and other diseases Neurological axis – nervous system / the rest RNA degradation Samples seem to ‘cluster’ by the tissues of origin
  • 25.
  • 26.
    Hierarchical clustering of97 groups with at least 10 replicates each Human gene expression map 17/08/10
  • 27.
    Comparison of the97 larger sample groups to the rest Incompletely differentiated cell type and connective tissue group
  • 28.
    Conclusions so farWe have identified 6 major transcription profile classes in these data: cell lines incompletely differentiated cells and connective tissues neoplasms blood brain muscle Cell lines cluster together!
  • 29.
  • 30.
  • 31.
    Gene expression acrossthe 5372 samples The expression of most genes is relatively constant There are only 1034 probesets (mapping to less than 900) genes where normalised signal variability has standard deviation > 2
  • 32.
    Clustering of 97sample groups and 1000 most variable probesets (about 900 genes) Immune repsonse Nervous system development Lipid raft Mitosis Neurotransmitter uptake Cytoskeletal protein binding Extracellular matrix Extracellular regions Extracellular matirx Extracellular region Mitosis Defence response Nervous system development Actin cytoskeleton organisation and biogenesis Protein carrier activity No significant resout Antigen presentation, exogenous antigen Trans – 1,2-dyhydrobenzene, 1,2-dyhydrogenase activity S100 alpha binding 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
  • 33.
    Clustering based onsubset of these genes produce similar results Clustering based on 350 most variable probesets gives almost the same result Even clustering based on 30 most variable probesets is very close
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
    Can we gobeyond the 6 major classes?
  • 39.
    Hierarchical clustering ofall 369 sample groups Human gene expression map 17/08/10 Some finer groups: Cancer: Sarcomas Carcinomas Neuroblastomas Normal: Liver and gut
  • 40.
    Leukemia Normal blood and blood non-neoplastic disease Other blood neoplasm Blood cell lines
  • 41.
    Identifying condition specificgenes by supervised analysis Using linear models to find condition specific genes, multiple testing correction, differential expression cut-offs Example - 174 leukemia specific genes include most well known markers (e.g, BCR, ETV6, FLT3, HOXA9, MUST3, PRDM2, RUNX1, and TAL1 ) Many confirmed as associated with leukemia
  • 42.
    Beyond the major6 classes the ‘signal’ becomes weak The problem may be lab effects The large biological effects are stronger than the lab effects However, when we zoom into particular subclasses, the lab effects may be taking precedence
  • 43.
    Mapping the humantranscriptome Traditional research A microarray experiment Everest Lhasa Kathmandu The map we want to build Our current view on global transcriptome
  • 44.
    97 groups –colours recycled Frontal cortex Muscular dystrophy Skeletal muscle Brain Heart and heart parts Cerebellum Caudate nucleus Hippocampal tissue Nervous system tumors Mono- nuclear cells AML
  • 45.
    Second approach Integrating data on statistics level
  • 46.
    Gene Expression AtlasEle Holloway Ibrahim Emam Pavel Kurnosov Helen Parkinson Anrey Zorin Tony Burdett Gabriella Rustici Eleanor William Andrew Tikhonov
  • 47.
    Global Differential ExpressionAnalysis Selected ~ 10% of the data from ArrayExpress (including GEO imports), manually curated for quality and mapped to a custom-built ontology of experimental factors , EFO: http://www.ebi.ac.uk/efo Data on differential expression of genes in 1000+ studies, comprising ~30000 assays, in over 5000 conditions For each experiment, differentially expressed genes have been identified computationally via moderated t-tests and statistical meta-analysis Meta-Analysis Approaches Vote counting: number of independent studies supporting an observation for a particular gene Effect size integration: compute effect size statistics in each study, assess relevant statistical model and compute combined z-score, for each gene/condition/study combination (extension of Choi et al, 2003)
  • 48.
    Analysing each contributingdataset separately: one-way ANOVA AML CML normal genes
  • 49.
    Combining the datasets… Experiments 1, 2, 3, …, m
  • 50.
    Effect size-based meta-analysisWe have for each gene in each experiment/condition: p -value for significance simulaneous t -statistics & confidence intervals d.e. label (“up” or “down”) However, we would like to: Measure of strength of d.e. effect size Ability to combine d.e. findings statistically Effect Size Standardized mean difference or similar (e.g., correlation coef.)
  • 51.
    Meta-analysis Procedure Foreach gene-experiment-condition combination Compute effect size from simultaneous d.e. t-statistics Combine effect sizes across multiple studies Using fixed-effects or random- effects models Obtain for each gene-condition combination: Mean effect size estimate Combined z -score Overall p -value
  • 52.
    Long tail ofannotations…
  • 53.
    Annotating data withontologies Diverse nature of annotations on data Need to support complex queries which contain semantic information E.g. which genes are under-expressed in brain samples in human or mouse If we annotate with do we get this data? cancer adenocarcinoma James Malone
  • 54.
    Decoupling knowledge fromdata Atlas/AE James Malone
  • 55.
  • 56.
    We can usethe ontology structure We can perform effect size meta-analysis on a hierarchy, if we follow several rules:
  • 57.
  • 58.
  • 59.
  • 60.
    Query for genesQuery for conditions species The ‘advanced query’ option allows building more complex queries http://www.ebi.ac.uk/gxa www.ebi.ac.uk/gxa
  • 61.
    Query results forgene ASPM ArrayExpress ASPM is downregulated in ‘normlal’ condition in comparison to a disease in 9 studies out of 10 Upregulated in ‘Glioblastoma’ in 3 indepnendent studies Zoom into one of the ‘Glioblastoma’ studies. Each bar represents an expression level in a particular sample
  • 62.
    ‘ wnt pathway’ genes in various cancers ArrayExpress
  • 63.
    Integrating both approachesFirst approach gives the global view, but obsucres the detail The second approach gives detail, but doesn’t allow easily to integrate everything in one map Can we combine both approaches?
  • 64.
    Other data RNAseqdata Proteomics data – Human Proteome Atlas from KTH in Stockholm (collaboration with Mathias Uhlen) Time series – what states a cell goes through to become from an ESC to a mature cell?
  • 65.
  • 66.
  • 67.
    Two ways ofintegrating the data On a quantitative level – normalise all data together Advantages – results easier to interpret Disadvantages – lab effects On a statistics level – analyse each dataset separately first Advantages – less lab effects Disadvantages – combined data difficult to interpret (in each experiment each conditions is compared to something else) How to combine the two approaches?
  • 68.
    Acknowledgements Margus LukkMisha Kapushesky Angela Gonzales Helen Parkinson Gabriela Rustici Ugis Sarkans Ele Holloway Roby Mani Mohammadreza Shojatalab Nikolay Kolesnikov Niran Abeygunawardena Anjan Sharma Miroslaw Dylag Ekaterina Pilicheva Ibrahim Emam Pavel Kurnosov Andrew Tikhonov Andrey Zorin Collaborators Audrey Kaufman (EBI) Wolfgang Huber (EBI) Sami Kaski (Helsinki) Morris Swertz (Groningen) … Funding European Commision FELICS MolPAGE ENGAGE MuGEN SLING DIAMONDS EMERALD NIH (NHGRI) EMBL Anna Farne Eleanor Williams Tony Burdett James Malone Holly Zheng Tomasz Adamusiak Susanna-Assunta Sansone Philippe Rocca-Serra Natalija Sklyar Marco Brandizi Chris Taylor Eamonn Maguire Maria Krestyaninova Mikhail Gostev Johan Rung Natalja Kurbatova Katherine Lawler Nils Gehlenborg Lynn French

Editor's Notes

  • #48 Picture shows how many times gene THY1 was observed significantly over- or under-expressed (red/blue) in each tissue. For instance, 5/2 under cerebellum means that in 5 independent experiments THY1 was over-expressed in cerebellum vs. background, and in 2 experiments it was under-expressed. Most experiments contain reference samples, though some do not.
  • #56 Querying via the ontology, displaying ontology-enriched results (tree in the display aggregates samples under haemopoietic system, for example).