20100509 bioinformatics kapushesky_lecture05_0


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Picture shows how many times gene THY1 was observed significantly over- or under-expressed (red/blue) in each tissue. For instance, 5/2 under cerebellum means that in 5 independent experiments THY1 was over-expressed in cerebellum vs. background, and in 2 experiments it was under-expressed. Most experiments contain reference samples, though some do not.
  • Querying via the ontology, displaying ontology-enriched results (tree in the display aggregates samples under haemopoietic system, for example).
  • 20100509 bioinformatics kapushesky_lecture05_0

    1. 1. Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010
    2. 2. From one genome to many biological states <ul><li>While there is only one genome sequence, different genes are expressed in many different cell types and tissues, different developmental or disease states </li></ul><ul><li>The size and structure of this “expression space” is still largely unknown </li></ul><ul><li>Most individual experiments are looking at small regions </li></ul><ul><li>We would like to build a map of the global human gene expression space </li></ul>
    3. 3. Mapping the human transcriptome Traditional research A microarray experiment Everest Lhasa Kathmandu The map we want to build
    4. 4. How to build such a global map <ul><li>This space is huge - There are thousands of potentially different states – cell types, tissue types, developmental stages, disease states, systems under various treatments (drugs, radiation, stress, …) – </li></ul><ul><li>It is not feasible to study them all in a single laboratory experiment (costs, rare samples, …) </li></ul><ul><li>However thousands of gene expression experiments are performed every year (microarrays, new generation sequencing) </li></ul><ul><li>Can we use the published data to build the global expression map? </li></ul>
    5. 9. ArrayExpress <ul><li>www.ebi.ac.uk/arrayexpress </li></ul><ul><li>Data from over 280,000 assays and over 10,000 independent studies (microarrays, sequencing, …) </li></ul><ul><li>Gene expression and other functional genomics assays </li></ul><ul><li>Over 200 species </li></ul><ul><li>Data collection and exchange from GEO </li></ul>
    6. 10. Can we integrate these data to answer questions that go beyond what was done in the individual studies? <ul><li>On a quantitative level - data on only the same microarray platform can be integrated </li></ul>
    7. 11. A global map of human gene expression <ul><li>Angela Gonzales (EBI) </li></ul><ul><li>Misha Kapushesky (EBI) </li></ul><ul><li>Janne Nikkila (Helsinki University of Technology) </li></ul><ul><li>Helen Parkinson (EBI), </li></ul><ul><li>Wolfgang Huber (EMBL) </li></ul><ul><li>Esko Ukkonen (University of Helsinki) </li></ul>Margus Lukk et al, Nature Biotechnology , 28, p322-324 (April, 2010)
    8. 12. <ul><li>We collected over 9000 raw data files from Affymetrix U133A from GEO and ArrayExpress </li></ul><ul><li>Applying strict quality controls, removing the duplicates </li></ul><ul><li>Data on 5372 samples remained </li></ul><ul><ul><li>from 206 different studies generated </li></ul></ul><ul><ul><li>in 163 different laboratories </li></ul></ul><ul><ul><li>grouped in 369 different biological ‘conditions’ (tissue types, diseases, various cell lines, etc) </li></ul></ul><ul><li>The 369 conditions grouped in different larger ‘metagroups’ </li></ul>The most popular gene expression microarray platform: Affymetrix U133A
    9. 13. Different metagroupings (4 and 15):
    10. 14. 5372 samples (369 different conditions) ~18,000 genes After RMA normalisation we obtain:
    11. 15. Principal Component Analysis – each dot is one of the 5372 samples 1 st 2 nd
    12. 16. Human gene expression map 17/08/10 1 st 2 nd
    13. 17. Human gene expression map 17/08/10 Hematopoietic axis 2 nd
    14. 18. Human gene expression map 17/08/10 Hematopoietic axis 2 nd
    15. 19. Human gene expression map 17/08/10 Hematopoietic axis Malignancy
    16. 20. Hematopoietic and malignancy axes Lukk et al, Nature Biotechnology, 28: 322
    17. 21. 1 st 2 nd 3 rd
    18. 22. Coloured by tissues of origin 3 rd PC
    19. 23. Tissues of origin Neurological axis
    20. 24. First 3 (5) principal components <ul><li>Hematopoietic axis – blood, ‘solid tissues’, ‘incompletely differentiated cells and connective tissues’ </li></ul><ul><li>Malignancy axis - Cell lines – cancer – normals and other diseases </li></ul><ul><li>Neurological axis – nervous system / the rest </li></ul><ul><li>RNA degradation </li></ul><ul><li>Samples seem to ‘cluster’ by the tissues of origin </li></ul>
    21. 26. Hierarchical clustering of 97 groups with at least 10 replicates each Human gene expression map 17/08/10
    22. 27. Comparison of the 97 larger sample groups to the rest Incompletely differentiated cell type and connective tissue group
    23. 28. Conclusions so far <ul><li>We have identified 6 major transcription profile classes in these data: </li></ul><ul><ul><li>cell lines </li></ul></ul><ul><ul><li>incompletely differentiated cells and connective tissues </li></ul></ul><ul><ul><li>neoplasms </li></ul></ul><ul><ul><li>blood </li></ul></ul><ul><ul><li>brain </li></ul></ul><ul><ul><li>muscle </li></ul></ul><ul><li>Cell lines cluster together! </li></ul>
    24. 31. Gene expression across the 5372 samples <ul><li>The expression of most genes is relatively constant </li></ul><ul><li>There are only 1034 probesets (mapping to less than 900) genes where normalised signal variability has standard deviation > 2 </li></ul>
    25. 32. Clustering of 97 sample groups and 1000 most variable probesets (about 900 genes) <ul><li>Immune repsonse </li></ul><ul><li>Nervous system development </li></ul><ul><li>Lipid raft </li></ul><ul><li>Mitosis </li></ul><ul><li>Neurotransmitter uptake </li></ul><ul><li>Cytoskeletal protein binding </li></ul><ul><li>Extracellular matrix </li></ul><ul><li>Extracellular regions </li></ul><ul><li>Extracellular matirx </li></ul><ul><li>Extracellular region </li></ul><ul><li>Mitosis </li></ul><ul><li>Defence response </li></ul><ul><li>Nervous system development </li></ul><ul><li>Actin cytoskeleton organisation and biogenesis </li></ul><ul><li>Protein carrier activity </li></ul><ul><li>No significant resout </li></ul><ul><li>Antigen presentation, exogenous antigen </li></ul><ul><li>Trans – 1,2-dyhydrobenzene, 1,2-dyhydrogenase activity </li></ul><ul><li>S100 alpha binding </li></ul>1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
    26. 33. Clustering based on subset of these genes produce similar results <ul><li>Clustering based on 350 most variable probesets gives almost the same result </li></ul><ul><li>Even clustering based on 30 most variable probesets is very close </li></ul>
    27. 36. 24 most variable genes
    28. 37. www.ebi.ac.uk/gxa/human/U133a
    29. 38. Can we go beyond the 6 major classes?
    30. 39. Hierarchical clustering of all 369 sample groups Human gene expression map 17/08/10 <ul><li>Some finer groups: </li></ul><ul><li>Cancer: </li></ul><ul><li>Sarcomas </li></ul><ul><li>Carcinomas </li></ul><ul><li>Neuroblastomas </li></ul><ul><li>Normal: </li></ul><ul><li>Liver and gut </li></ul>
    31. 40. Leukemia Normal blood and blood non-neoplastic disease Other blood neoplasm Blood cell lines
    32. 41. Identifying condition specific genes by supervised analysis <ul><li>Using linear models to find condition specific genes, multiple testing correction, differential expression cut-offs </li></ul><ul><li>Example - 174 leukemia specific genes </li></ul><ul><ul><li>include most well known markers (e.g, BCR, ETV6, FLT3, HOXA9, MUST3, PRDM2, RUNX1, and TAL1 ) </li></ul></ul><ul><ul><li>Many confirmed as associated with leukemia </li></ul></ul>
    33. 42. <ul><li>Beyond the major 6 classes the ‘signal’ becomes weak </li></ul><ul><li>The problem may be lab effects </li></ul><ul><ul><li>The large biological effects are stronger than the lab effects </li></ul></ul><ul><ul><li>However, when we zoom into particular subclasses, the lab effects may be taking precedence </li></ul></ul>
    34. 43. Mapping the human transcriptome Traditional research A microarray experiment Everest Lhasa Kathmandu The map we want to build Our current view on global transcriptome
    35. 44. 97 groups – colours recycled Frontal cortex Muscular dystrophy Skeletal muscle Brain Heart and heart parts Cerebellum Caudate nucleus Hippocampal tissue Nervous system tumors Mono- nuclear cells AML
    36. 45. Second approach <ul><li>Integrating data on statistics level </li></ul>
    37. 46. Gene Expression Atlas <ul><li>Ele Holloway </li></ul><ul><li>Ibrahim Emam </li></ul><ul><li>Pavel Kurnosov </li></ul><ul><li>Helen Parkinson </li></ul><ul><li>Anrey Zorin </li></ul><ul><li>Tony Burdett </li></ul><ul><li>Gabriella Rustici </li></ul><ul><li>Eleanor William </li></ul><ul><li>Andrew Tikhonov </li></ul>
    38. 47. Global Differential Expression Analysis <ul><ul><li>Selected ~ 10% of the data from ArrayExpress (including GEO imports), manually curated for quality and mapped to a custom-built ontology of experimental factors , EFO: http://www.ebi.ac.uk/efo </li></ul></ul><ul><li>Data on differential expression of genes in 1000+ studies, comprising ~30000 assays, in over 5000 conditions </li></ul><ul><li>For each experiment, differentially expressed genes have been identified computationally via moderated t-tests and statistical meta-analysis </li></ul><ul><li>Meta-Analysis Approaches </li></ul><ul><li>Vote counting: </li></ul><ul><ul><li>number of independent studies supporting an observation for a particular gene </li></ul></ul><ul><li>Effect size integration: </li></ul><ul><ul><li>compute effect size statistics in each study, assess relevant statistical model and compute combined z-score, for each gene/condition/study combination (extension of Choi et al, 2003) </li></ul></ul>
    39. 48. Analysing each contributing dataset separately: one-way ANOVA AML CML normal genes
    40. 49. Combining the datasets … Experiments 1, 2, 3, …, m
    41. 50. Effect size-based meta-analysis <ul><li>We have for each gene in each experiment/condition: </li></ul><ul><ul><li>p -value for significance </li></ul></ul><ul><ul><li>simulaneous t -statistics & confidence intervals </li></ul></ul><ul><ul><li>d.e. label (“up” or “down”) </li></ul></ul><ul><li>However, we would like to: </li></ul><ul><ul><li>Measure of strength of d.e. effect size </li></ul></ul><ul><ul><li>Ability to combine d.e. findings statistically </li></ul></ul><ul><li>Effect Size </li></ul><ul><ul><li>Standardized mean difference or similar (e.g., correlation coef.) </li></ul></ul>
    42. 51. Meta-analysis Procedure <ul><li>For each gene-experiment-condition combination </li></ul><ul><ul><li>Compute effect size from simultaneous d.e. t-statistics </li></ul></ul><ul><li>Combine effect sizes across multiple studies </li></ul><ul><ul><li>Using fixed-effects or random- effects models </li></ul></ul><ul><ul><li>Obtain for each gene-condition combination: </li></ul></ul><ul><ul><ul><li>Mean effect size estimate </li></ul></ul></ul><ul><ul><ul><li>Combined z -score </li></ul></ul></ul><ul><ul><ul><li>Overall p -value </li></ul></ul></ul>
    43. 52. Long tail of annotations…
    44. 53. Annotating data with ontologies <ul><li>Diverse nature of annotations on data </li></ul><ul><li>Need to support complex queries which contain semantic information </li></ul><ul><ul><li>E.g. which genes are under-expressed in brain samples in human or mouse </li></ul></ul><ul><li>If we annotate with do we get this data? </li></ul>cancer adenocarcinoma James Malone
    45. 54. Decoupling knowledge from data Atlas/AE James Malone
    46. 55. Semantically-enriched Queries with EFO
    47. 56. We can use the ontology structure We can perform effect size meta-analysis on a hierarchy, if we follow several rules:
    48. 57. Increased statistical power
    49. 58. Condition-specificity through EFO
    50. 59. Condition-specific Gene Expression
    51. 60. Query for genes Query for conditions species The ‘advanced query’ option allows building more complex queries http://www.ebi.ac.uk/gxa www.ebi.ac.uk/gxa
    52. 61. Query results for gene ASPM ArrayExpress ASPM is downregulated in ‘normlal’ condition in comparison to a disease in 9 studies out of 10 Upregulated in ‘Glioblastoma’ in 3 indepnendent studies Zoom into one of the ‘Glioblastoma’ studies. Each bar represents an expression level in a particular sample
    53. 62. ‘ wnt pathway ’ genes in various cancers ArrayExpress
    54. 63. Integrating both approaches <ul><li>First approach gives the global view, but obsucres the detail </li></ul><ul><li>The second approach gives detail, but doesn’t allow easily to integrate everything in one map </li></ul><ul><li>Can we combine both approaches? </li></ul>
    55. 64. Other data <ul><li>RNAseq data </li></ul><ul><li>Proteomics data – Human Proteome Atlas from KTH in Stockholm (collaboration with Mathias Uhlen) </li></ul><ul><li>Time series – what states a cell goes through to become from an ESC to a mature cell? </li></ul>
    56. 67. Two ways of integrating the data <ul><li>On a quantitative level – normalise all data together </li></ul><ul><ul><li>Advantages – results easier to interpret </li></ul></ul><ul><ul><li>Disadvantages – lab effects </li></ul></ul><ul><li>On a statistics level – analyse each dataset separately first </li></ul><ul><ul><li>Advantages – less lab effects </li></ul></ul><ul><ul><li>Disadvantages – combined data difficult to interpret (in each experiment each conditions is compared to something else) </li></ul></ul><ul><li>How to combine the two approaches? </li></ul>
    57. 68. Acknowledgements <ul><li>Margus Lukk </li></ul><ul><li>Misha Kapushesky </li></ul><ul><li>Angela Gonzales </li></ul><ul><li>Helen Parkinson </li></ul><ul><li>Gabriela Rustici </li></ul><ul><li>Ugis Sarkans </li></ul><ul><li>Ele Holloway </li></ul><ul><li>Roby Mani </li></ul><ul><li>Mohammadreza Shojatalab </li></ul><ul><li>Nikolay Kolesnikov </li></ul><ul><li>Niran Abeygunawardena </li></ul><ul><li>Anjan Sharma </li></ul><ul><li>Miroslaw Dylag </li></ul><ul><li>Ekaterina Pilicheva </li></ul><ul><li>Ibrahim Emam </li></ul><ul><li>Pavel Kurnosov </li></ul><ul><li>Andrew Tikhonov </li></ul><ul><li>Andrey Zorin </li></ul><ul><li>Collaborators </li></ul><ul><ul><li>Audrey Kaufman (EBI) </li></ul></ul><ul><ul><li>Wolfgang Huber (EBI) </li></ul></ul><ul><ul><li>Sami Kaski (Helsinki) </li></ul></ul><ul><ul><li>Morris Swertz (Groningen) </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><li>Funding </li></ul><ul><ul><li>European Commision </li></ul></ul><ul><ul><ul><li>FELICS </li></ul></ul></ul><ul><ul><ul><li>MolPAGE </li></ul></ul></ul><ul><ul><ul><li>ENGAGE </li></ul></ul></ul><ul><ul><ul><li>MuGEN </li></ul></ul></ul><ul><ul><ul><li>SLING </li></ul></ul></ul><ul><ul><ul><li>DIAMONDS </li></ul></ul></ul><ul><ul><ul><li>EMERALD </li></ul></ul></ul><ul><ul><li>NIH (NHGRI) </li></ul></ul><ul><ul><li>EMBL </li></ul></ul><ul><li>Anna Farne </li></ul><ul><li>Eleanor Williams </li></ul><ul><li>Tony Burdett </li></ul><ul><li>James Malone </li></ul><ul><li>Holly Zheng </li></ul><ul><li>Tomasz Adamusiak </li></ul><ul><li>Susanna-Assunta Sansone </li></ul><ul><li>Philippe Rocca-Serra </li></ul><ul><li>Natalija Sklyar </li></ul><ul><li>Marco Brandizi </li></ul><ul><li>Chris Taylor </li></ul><ul><li>Eamonn Maguire </li></ul><ul><li>Maria Krestyaninova </li></ul><ul><li>Mikhail Gostev </li></ul><ul><li>Johan Rung </li></ul><ul><li>Natalja Kurbatova </li></ul><ul><li>Katherine Lawler </li></ul><ul><li>Nils Gehlenborg </li></ul><ul><li>Lynn French </li></ul>