20100509 bioinformatics kapushesky_lecture05_0
Upcoming SlideShare
Loading in...5
×
 

20100509 bioinformatics kapushesky_lecture05_0

on

  • 457 views

 

Statistics

Views

Total Views
457
Slideshare-icon Views on SlideShare
428
Embed Views
29

Actions

Likes
0
Downloads
5
Comments
0

3 Embeds 29

http://logic.pdmi.ras.ru 21
http://compsciclub.ru 7
http://www.compsciclub.ru 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Picture shows how many times gene THY1 was observed significantly over- or under-expressed (red/blue) in each tissue. For instance, 5/2 under cerebellum means that in 5 independent experiments THY1 was over-expressed in cerebellum vs. background, and in 2 experiments it was under-expressed. Most experiments contain reference samples, though some do not.
  • Querying via the ontology, displaying ontology-enriched results (tree in the display aggregates samples under haemopoietic system, for example).

20100509 bioinformatics kapushesky_lecture05_0 20100509 bioinformatics kapushesky_lecture05_0 Presentation Transcript

  • Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010
  • From one genome to many biological states
    • While there is only one genome sequence, different genes are expressed in many different cell types and tissues, different developmental or disease states
    • The size and structure of this “expression space” is still largely unknown
    • Most individual experiments are looking at small regions
    • We would like to build a map of the global human gene expression space
  • Mapping the human transcriptome Traditional research A microarray experiment Everest Lhasa Kathmandu The map we want to build
  • How to build such a global map
    • This space is huge - There are thousands of potentially different states – cell types, tissue types, developmental stages, disease states, systems under various treatments (drugs, radiation, stress, …) –
    • It is not feasible to study them all in a single laboratory experiment (costs, rare samples, …)
    • However thousands of gene expression experiments are performed every year (microarrays, new generation sequencing)
    • Can we use the published data to build the global expression map?
  •  
  •  
  •  
  •  
  • ArrayExpress
    • www.ebi.ac.uk/arrayexpress
    • Data from over 280,000 assays and over 10,000 independent studies (microarrays, sequencing, …)
    • Gene expression and other functional genomics assays
    • Over 200 species
    • Data collection and exchange from GEO
  • Can we integrate these data to answer questions that go beyond what was done in the individual studies?
    • On a quantitative level - data on only the same microarray platform can be integrated
  • A global map of human gene expression
    • Angela Gonzales (EBI)
    • Misha Kapushesky (EBI)
    • Janne Nikkila (Helsinki University of Technology)
    • Helen Parkinson (EBI),
    • Wolfgang Huber (EMBL)
    • Esko Ukkonen (University of Helsinki)
    Margus Lukk et al, Nature Biotechnology , 28, p322-324 (April, 2010)
    • We collected over 9000 raw data files from Affymetrix U133A from GEO and ArrayExpress
    • Applying strict quality controls, removing the duplicates
    • Data on 5372 samples remained
      • from 206 different studies generated
      • in 163 different laboratories
      • grouped in 369 different biological ‘conditions’ (tissue types, diseases, various cell lines, etc)
    • The 369 conditions grouped in different larger ‘metagroups’
    The most popular gene expression microarray platform: Affymetrix U133A
  • Different metagroupings (4 and 15):
  • 5372 samples (369 different conditions) ~18,000 genes After RMA normalisation we obtain:
  • Principal Component Analysis – each dot is one of the 5372 samples 1 st 2 nd
  • Human gene expression map 17/08/10 1 st 2 nd
  • Human gene expression map 17/08/10 Hematopoietic axis 2 nd
  • Human gene expression map 17/08/10 Hematopoietic axis 2 nd
  • Human gene expression map 17/08/10 Hematopoietic axis Malignancy
  • Hematopoietic and malignancy axes Lukk et al, Nature Biotechnology, 28: 322
  • 1 st 2 nd 3 rd
  • Coloured by tissues of origin 3 rd PC
  • Tissues of origin Neurological axis
  • First 3 (5) principal components
    • Hematopoietic axis – blood, ‘solid tissues’, ‘incompletely differentiated cells and connective tissues’
    • Malignancy axis - Cell lines – cancer – normals and other diseases
    • Neurological axis – nervous system / the rest
    • RNA degradation
    • Samples seem to ‘cluster’ by the tissues of origin
  •  
  • Hierarchical clustering of 97 groups with at least 10 replicates each Human gene expression map 17/08/10
  • Comparison of the 97 larger sample groups to the rest Incompletely differentiated cell type and connective tissue group
  • Conclusions so far
    • We have identified 6 major transcription profile classes in these data:
      • cell lines
      • incompletely differentiated cells and connective tissues
      • neoplasms
      • blood
      • brain
      • muscle
    • Cell lines cluster together!
  •  
  •  
  • Gene expression across the 5372 samples
    • The expression of most genes is relatively constant
    • There are only 1034 probesets (mapping to less than 900) genes where normalised signal variability has standard deviation > 2
  • Clustering of 97 sample groups and 1000 most variable probesets (about 900 genes)
    • Immune repsonse
    • Nervous system development
    • Lipid raft
    • Mitosis
    • Neurotransmitter uptake
    • Cytoskeletal protein binding
    • Extracellular matrix
    • Extracellular regions
    • Extracellular matirx
    • Extracellular region
    • Mitosis
    • Defence response
    • Nervous system development
    • Actin cytoskeleton organisation and biogenesis
    • Protein carrier activity
    • No significant resout
    • Antigen presentation, exogenous antigen
    • Trans – 1,2-dyhydrobenzene, 1,2-dyhydrogenase activity
    • S100 alpha binding
    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
  • Clustering based on subset of these genes produce similar results
    • Clustering based on 350 most variable probesets gives almost the same result
    • Even clustering based on 30 most variable probesets is very close
  •  
  •  
  • 24 most variable genes
  • www.ebi.ac.uk/gxa/human/U133a
  • Can we go beyond the 6 major classes?
  • Hierarchical clustering of all 369 sample groups Human gene expression map 17/08/10
    • Some finer groups:
    • Cancer:
    • Sarcomas
    • Carcinomas
    • Neuroblastomas
    • Normal:
    • Liver and gut
  • Leukemia Normal blood and blood non-neoplastic disease Other blood neoplasm Blood cell lines
  • Identifying condition specific genes by supervised analysis
    • Using linear models to find condition specific genes, multiple testing correction, differential expression cut-offs
    • Example - 174 leukemia specific genes
      • include most well known markers (e.g, BCR, ETV6, FLT3, HOXA9, MUST3, PRDM2, RUNX1, and TAL1 )
      • Many confirmed as associated with leukemia
    • Beyond the major 6 classes the ‘signal’ becomes weak
    • The problem may be lab effects
      • The large biological effects are stronger than the lab effects
      • However, when we zoom into particular subclasses, the lab effects may be taking precedence
  • Mapping the human transcriptome Traditional research A microarray experiment Everest Lhasa Kathmandu The map we want to build Our current view on global transcriptome
  • 97 groups – colours recycled Frontal cortex Muscular dystrophy Skeletal muscle Brain Heart and heart parts Cerebellum Caudate nucleus Hippocampal tissue Nervous system tumors Mono- nuclear cells AML
  • Second approach
    • Integrating data on statistics level
  • Gene Expression Atlas
    • Ele Holloway
    • Ibrahim Emam
    • Pavel Kurnosov
    • Helen Parkinson
    • Anrey Zorin
    • Tony Burdett
    • Gabriella Rustici
    • Eleanor William
    • Andrew Tikhonov
  • Global Differential Expression Analysis
      • Selected ~ 10% of the data from ArrayExpress (including GEO imports), manually curated for quality and mapped to a custom-built ontology of experimental factors , EFO: http://www.ebi.ac.uk/efo
    • Data on differential expression of genes in 1000+ studies, comprising ~30000 assays, in over 5000 conditions
    • For each experiment, differentially expressed genes have been identified computationally via moderated t-tests and statistical meta-analysis
    • Meta-Analysis Approaches
    • Vote counting:
      • number of independent studies supporting an observation for a particular gene
    • Effect size integration:
      • compute effect size statistics in each study, assess relevant statistical model and compute combined z-score, for each gene/condition/study combination (extension of Choi et al, 2003)
  • Analysing each contributing dataset separately: one-way ANOVA AML CML normal genes
  • Combining the datasets … Experiments 1, 2, 3, …, m
  • Effect size-based meta-analysis
    • We have for each gene in each experiment/condition:
      • p -value for significance
      • simulaneous t -statistics & confidence intervals
      • d.e. label (“up” or “down”)
    • However, we would like to:
      • Measure of strength of d.e. effect size
      • Ability to combine d.e. findings statistically
    • Effect Size
      • Standardized mean difference or similar (e.g., correlation coef.)
  • Meta-analysis Procedure
    • For each gene-experiment-condition combination
      • Compute effect size from simultaneous d.e. t-statistics
    • Combine effect sizes across multiple studies
      • Using fixed-effects or random- effects models
      • Obtain for each gene-condition combination:
        • Mean effect size estimate
        • Combined z -score
        • Overall p -value
  • Long tail of annotations…
  • Annotating data with ontologies
    • Diverse nature of annotations on data
    • Need to support complex queries which contain semantic information
      • E.g. which genes are under-expressed in brain samples in human or mouse
    • If we annotate with do we get this data?
    cancer adenocarcinoma James Malone
  • Decoupling knowledge from data Atlas/AE James Malone
  • Semantically-enriched Queries with EFO
  • We can use the ontology structure We can perform effect size meta-analysis on a hierarchy, if we follow several rules:
  • Increased statistical power
  • Condition-specificity through EFO
  • Condition-specific Gene Expression
  • Query for genes Query for conditions species The ‘advanced query’ option allows building more complex queries http://www.ebi.ac.uk/gxa www.ebi.ac.uk/gxa
  • Query results for gene ASPM ArrayExpress ASPM is downregulated in ‘normlal’ condition in comparison to a disease in 9 studies out of 10 Upregulated in ‘Glioblastoma’ in 3 indepnendent studies Zoom into one of the ‘Glioblastoma’ studies. Each bar represents an expression level in a particular sample
  • ‘ wnt pathway ’ genes in various cancers ArrayExpress
  • Integrating both approaches
    • First approach gives the global view, but obsucres the detail
    • The second approach gives detail, but doesn’t allow easily to integrate everything in one map
    • Can we combine both approaches?
  • Other data
    • RNAseq data
    • Proteomics data – Human Proteome Atlas from KTH in Stockholm (collaboration with Mathias Uhlen)
    • Time series – what states a cell goes through to become from an ESC to a mature cell?
  •  
  •  
  • Two ways of integrating the data
    • On a quantitative level – normalise all data together
      • Advantages – results easier to interpret
      • Disadvantages – lab effects
    • On a statistics level – analyse each dataset separately first
      • Advantages – less lab effects
      • Disadvantages – combined data difficult to interpret (in each experiment each conditions is compared to something else)
    • How to combine the two approaches?
  • Acknowledgements
    • Margus Lukk
    • Misha Kapushesky
    • Angela Gonzales
    • Helen Parkinson
    • Gabriela Rustici
    • Ugis Sarkans
    • Ele Holloway
    • Roby Mani
    • Mohammadreza Shojatalab
    • Nikolay Kolesnikov
    • Niran Abeygunawardena
    • Anjan Sharma
    • Miroslaw Dylag
    • Ekaterina Pilicheva
    • Ibrahim Emam
    • Pavel Kurnosov
    • Andrew Tikhonov
    • Andrey Zorin
    • Collaborators
      • Audrey Kaufman (EBI)
      • Wolfgang Huber (EBI)
      • Sami Kaski (Helsinki)
      • Morris Swertz (Groningen)
    • Funding
      • European Commision
        • FELICS
        • MolPAGE
        • ENGAGE
        • MuGEN
        • SLING
        • DIAMONDS
        • EMERALD
      • NIH (NHGRI)
      • EMBL
    • Anna Farne
    • Eleanor Williams
    • Tony Burdett
    • James Malone
    • Holly Zheng
    • Tomasz Adamusiak
    • Susanna-Assunta Sansone
    • Philippe Rocca-Serra
    • Natalija Sklyar
    • Marco Brandizi
    • Chris Taylor
    • Eamonn Maguire
    • Maria Krestyaninova
    • Mikhail Gostev
    • Johan Rung
    • Natalja Kurbatova
    • Katherine Lawler
    • Nils Gehlenborg
    • Lynn French