Kishor Presentation


Published on

The presentation lists various approaches for analysing microarray data using R/Bioconductor.

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Kishor Presentation

  1. 1. Design and Analysis Strategies for DNA microarray data: hits to targets
  2. 2. Organization of the presentation <ul><li>DNA Microarray data analysis </li></ul><ul><ul><li>gene based </li></ul></ul><ul><ul><li>gene sets based </li></ul></ul><ul><ul><li>functional groups based </li></ul></ul><ul><li>Clone ID </li></ul><ul><li>Lead toxicity investigation using genetic algorithms </li></ul>
  3. 3. Molecular based discover <ul><li>The completion of “Human Genome Project” which used an approach of sequencing to characterize and map the entire human genome turned the attention of several researchers to investigate diseases and biological mechanisms at the level of molecules which comprise mostly of DNA , RNA and Proteins. </li></ul><ul><li>After pinpointing to a few disease related genes the comparative genomics approach which uses evolutionary biology principles to find similar genes in model organisms gave researchers extra degrees of freedom to study and thoroughly gain insights of the underlying biological mechanisms. </li></ul><ul><li>This ultimately drove the discovery approach towards functional genomics to quantitatively elicit the patterns associated with diseases or biological mechanisms. </li></ul>
  4. 4. <ul><li>DNA microarrays became popular and useful functional genomics tools. </li></ul><ul><li>The availability of gene sequences for most of the sequenced organisms made it feasible to design Gene Chips to survey genome wide analysis implications on target discovery. </li></ul>
  5. 5. Microarrays <ul><li>DNA microarrays simultaneously measure thousands of gene expression levels using hybridization and sequence complementarity's </li></ul><ul><li>useful tools for detecting biological mechanisms involved in pathogenesis , disease related and other phenotypes using comparative methods. </li></ul><ul><li>two types </li></ul><ul><ul><li>Two-channel (spotted arrays) </li></ul></ul><ul><ul><li>Single Channel (oligonucleotides) </li></ul></ul>
  6. 6. Applications of microarrays <ul><li>Biomarker discovery </li></ul><ul><li>Clinical outcome ( survival, response to treatment) </li></ul><ul><li>Diagnostic , prognostic inferences </li></ul><ul><li>Regulatory networks (guilt by association. </li></ul><ul><li>Personalized medicine </li></ul>
  7. 7. Microarray <ul><li>Platforms </li></ul><ul><ul><li>Agilent </li></ul></ul><ul><ul><li>Affymetrix </li></ul></ul><ul><ul><li>ABI 1700 </li></ul></ul><ul><li>Gene based common data analysis methods </li></ul><ul><ul><li>fold change </li></ul></ul><ul><ul><li>t-test (two groups) </li></ul></ul><ul><ul><li>factorial methods (multiple groups) </li></ul></ul><ul><ul><li>time course experiments </li></ul></ul><ul><li>Gene sets based analysis </li></ul><ul><ul><li>GSEA </li></ul></ul><ul><ul><li>Gene Ontology (GOStats,topGO) </li></ul></ul>
  8. 8. Affymetrix <ul><li>Commonly referred as Gene Chips </li></ul><ul><ul><li>Each gene is represented by 16-20 oligonucleotides each made of 25 nucleotides (A,C,T,G) </li></ul></ul><ul><ul><li>probe pair : PM/MM </li></ul></ul><ul><ul><li>probe set : vector of all probe pairs for a gene </li></ul></ul><ul><ul><li>MM indicates non-specific binding. </li></ul></ul><ul><ul><li>MA plots can used to understand probe specific and intensity specific non-specific binding. </li></ul></ul>
  9. 9. Preprocessing methods (BMC Bioinformatics 2006, 7 :105)
  10. 10. Preprocessing <ul><li>Normalization </li></ul><ul><ul><li>Global </li></ul></ul><ul><ul><ul><li>Mean centering </li></ul></ul></ul><ul><ul><ul><li>MA – plots (two channel) </li></ul></ul></ul><ul><ul><ul><li>Quantile normalization </li></ul></ul></ul><ul><ul><li>Local </li></ul></ul><ul><ul><ul><li>Loess (intensity dependent) </li></ul></ul></ul><ul><ul><ul><li>Lowess (remove dye effects) </li></ul></ul></ul>
  11. 11. common experimental inquires <ul><li>gene knock-out </li></ul><ul><li>time-series </li></ul><ul><li>phenotypic differences </li></ul><ul><li>drug effects </li></ul><ul><li>disease associated pathways and biological mechanisms </li></ul>
  12. 12. LIMMA : Linear Models for microarray analysis ( Subramanian, Tamayo, et al. ( 2005, PNAS 102, 15545-15550 ) ) <ul><li>fits a linear model to each gene based on the RNA source and contrasts of interest for testing its differential expression </li></ul><ul><li>the inherent statistics borrows information across the genes/probes to assess differential expression as per the experimental design </li></ul><ul><li>works very efficiently even with experiments with smaller sample sizes. </li></ul><ul><li>some contrast comparisons may not require replicates (depending on variability between the sources of comparison). </li></ul><ul><li>supports factorial designs </li></ul>
  13. 13. Examples of comparisons
  14. 14. mock experimental design <ul><li>Notation ( Factors : drug treatment , age) </li></ul><ul><ul><li>DG 1-10 : treated with drug A </li></ul></ul><ul><ul><li>PL 1-10 : placebo </li></ul></ul><ul><ul><li>D.Y 1-4 : yng patients treated with drug A </li></ul></ul><ul><ul><li>D.S 5-10 :old patients treated with drug A </li></ul></ul><ul><ul><li>P.Y 1-6 : yng placebo </li></ul></ul><ul><ul><li>P.O 7-10 :old placebo </li></ul></ul>
  15. 15. <ul><li>> cont.matrix <- makeContrasts ( </li></ul><ul><li>PL.YvsO=PL.Y-NM.Y , </li></ul><ul><li>DG.YvsO=DG.Y-DG.Y , </li></ul><ul><li>Diff = (DG.Y-DG.O) - (PL.Y-PL.O) , </li></ul><ul><li>levels=design </li></ul><ul><li>) </li></ul><ul><li>> fit2 <-, cont.matrix) </li></ul><ul><li>> fit2 <- eBayes(fit2) </li></ul><ul><li>topTable(fit2, coef= Diff ) # combined effect </li></ul><ul><li>topTable(fit2, coef= PL.YvsO ) # age effect in normal </li></ul><ul><li>topTable(fit2, coef= DG.YvsO , adjust=“BH”) # age effect in drug treated </li></ul>
  16. 16. steps involved <ul><li>construct a design matrix using target file </li></ul><ul><li>indicate contrasts of comparison using contrasts fit method </li></ul><ul><li>fit a linear model </li></ul><ul><li>assess differential expression using eBayes method </li></ul>
  17. 17. Interpretation of results <ul><li>Statistics to assess differential expression using LIMMA </li></ul><ul><ul><li>Moderated t-Statistics </li></ul></ul><ul><ul><ul><li>Similar to t-statistic with estimating standard errors based on the expression values of all genes. </li></ul></ul></ul><ul><ul><li>B-Statistics </li></ul></ul><ul><ul><ul><li>log-odds that a gene is differentially expressed </li></ul></ul></ul><ul><ul><li>F-Statistics </li></ul></ul><ul><ul><ul><li>assess differential expression the genes based on the coefficients of all contrasts. </li></ul></ul></ul>
  18. 18. Significance Analysis of Microarrays <ul><li>measures differential expression of the data for time course designed experiments. </li></ul><ul><li>assesses significance of differential expression of genes using repeated permutations of the sample labels </li></ul><ul><li>supports several experimental designs </li></ul><ul><li>works efficiently even for smaller sample sizes </li></ul>
  19. 19. Experimental designs supported by SAM (Chu, G., Narasimhan, B., Tibshirani, R. & Tusher, V. (2002), Signicance analysis of microarrays (sam) software)
  20. 20. Sample input format (Chu, G., Narasimhan, B., Tibshirani, R. & Tusher, V. (2002), Signicance analysis of microarrays (sam) software)
  21. 21. SAM statistics (Chu, G., Narasimhan, B., Tibshirani, R. & Tusher, V. (2002), Signicance analysis of microarrays (sam) software)
  22. 23. SAM plot
  23. 24. <ul><li>Limitations of gene based approaches </li></ul><ul><ul><li>arbitrary cutoffs </li></ul></ul><ul><ul><li>too stringent criteria ( effect of multiple hypothesis testing) </li></ul></ul><ul><ul><li>speculative selection </li></ul></ul><ul><ul><li>lack of ways to efficiently differentiate differential expression of a gene due to experimental noise and a true biological signal. </li></ul></ul><ul><ul><li>incoherence between multiple microarray results </li></ul></ul>
  24. 25. Gene set enrichment analysis (GSEA) <ul><li>cross comparison and validation of multiple experiments with relevant biological motives </li></ul><ul><li>gene set based interrogation of microarray data </li></ul><ul><li>infer pathway enrichment / analysis and gene regulatory networks </li></ul><ul><li>biomarker detection </li></ul><ul><li>refinement or drilling down gene lists </li></ul>
  25. 26. Methodology <ul><li>Choose a ranking metric for sorting genes based on their correlation with the phenotype </li></ul><ul><li>Compute a running sum statistic (enrichment score) based on the overrepresentation of the genes at the extremes of the rank ordered list. </li></ul><ul><li>Estimate the significance of enrichment score relative to null distribution (empirical phenotype based permutation test). </li></ul><ul><li>Multiple hypothesis testing is performed on the normalized enrichment score (gene set size into account) by controlling FDR which is the probability of finding false computation of the normalized enrichment score. </li></ul>
  26. 27. Subramanian, Tamayo, et al. ( 2005, PNAS 102, 15545-15550 )
  27. 28. Leading edge subset Subramanian, Tamayo, et al. ( 2005, PNAS 102, 15545-15550 )
  28. 29. Subramanian, Tamayo, et al. ( 2005, PNAS 102, 15545-15550 )
  29. 30. Novel methodology based on gene set enrichment <ul><li>gives the option of preserving gene-gene correlations while computing enrichment statistics. </li></ul><ul><li>user friendly tool with a programmatic interface (API). </li></ul><ul><li>availability of curated gene sets database MSig database </li></ul>
  30. 31. Caveats <ul><li>availability and requirement of pre-defined gene sets. </li></ul><ul><li>more knowledge based rather than discovery based in terms of inferring biological mechanisms this is reduced to some extent with the provision of an exhaustive gene sets through MSig database. </li></ul>
  31. 32. Enriched gene sets Phenotype (
  32. 33. Enriched gene sets in mutant (
  33. 34. RNA interference <ul><li>“ RNA interference (RNAi), a form of post-transcriptional gene silencing induced by introduction of double-stranded RNA (dsRNA), has become a powerful experimental tool for studying gene function.” [7] </li></ul><ul><li>“ For drug developers, RNAi phenotypes can provide clues about what to assay to screen antagonist drug candidates” [7]. </li></ul>
  34. 35. <ul><li>Uses the principle of reverse genetics to understand changes in biological pathways by simultaneously knocking down (silencing) multiple genes. </li></ul><ul><li>depends on siRNA libraries built to target specific genes and proteins. </li></ul><ul><li>A careful designed RNAi screen is equivalent to performing multiple gene knock-out microarray experiments. </li></ul><ul><li>Can be using siRNA`s (better specificity) and miRNA’s </li></ul>
  35. 36. Endocytotic pathways <ul><li>Endocytosis is a process in which several molecules (cargos) are transported into the cytoplasm using membrane proteins. </li></ul><ul><ul><li>cell surface selection </li></ul></ul><ul><ul><li>budding and pinching off </li></ul></ul><ul><ul><li>recruited to target protein </li></ul></ul><ul><li>Pathways can be inferred using high resolution microscopy which provide quantitative and qualitative information of endocytocised complexes using image processing tools. </li></ul><ul><li>Useful for understanding cell growth, development and pathogenesis. </li></ul>
  36. 37. Gene Ontology (Description) <ul><li>Since the completion of Human Genome Project a major challenge has been annotation and standardized dissemination of information related to genes and gene products. </li></ul><ul><li>GO is a consortium which successfully derived ontology by capturing and representing gene features, relationships using direct acyclic graphs. </li></ul><ul><li>Accordingly, gene attributes were broadly classified into 3 categories </li></ul><ul><ul><li>Biological Process </li></ul></ul><ul><ul><li>Molecular Function </li></ul></ul><ul><ul><li>Subcellular Colocalization </li></ul></ul>
  37. 39. biomaRt ( Bioconductor interface to BioMart Software Suit [ ] ) (The biomaRt user’s guide Steffen Durinck, Wolfgang Huber)
  38. 42. GO based functional characterization of gene sets using topGO <ul><li>Biological interpretation of gene lists obtained from microarray or high throughput screening platforms using gene ontology based on overlap statistics. </li></ul><ul><li>Not only useful for functional based characterization of gene lists but can also provide clues of co-expressed genes. </li></ul><ul><li>Along with providing built-in statistical methodologies, features customizable incorporation of user chosen statistics for assessing the differential expression and enrichment of GO terms. </li></ul>
  39. 43. Alexa et al. Bioinformatics , 13, 1600-1607, 2006
  40. 44. <ul><li>Elim </li></ul><ul><ul><li>reduces overlap by iteratively removing genes from ancestral nodes of a significantly enriched node (GO term). </li></ul></ul><ul><ul><li>more stringent in terms of reducing false positives when compared with weight algorithm. </li></ul></ul><ul><ul><li>Works better with small values of k ( diffex genes) </li></ul></ul><ul><li>Weight </li></ul><ul><ul><li>significant node score is computed by down-weighing the overlap gene scores of its children. </li></ul></ul><ul><ul><li>significant nodes and vector of weights are recursively updated. </li></ul></ul>
  41. 45. A.Alexa et al. Bioinformatics , 13, 1600-1607, 2006
  42. 52. Clone ID <ul><li>Bergeys vs. Phylotypes </li></ul><ul><li>Below is the list of the classifications tools we used in our analysis and the classification schema used by that tool. </li></ul><ul><li>Classification Tool Classification Schema </li></ul><ul><li>NCBI’s MegaBLAST NCBI’s taxonomy Hierarchy Browser </li></ul><ul><li>RDP II Bergey`s Manual </li></ul><ul><li>RDPquery Bergey`s Manual </li></ul><ul><li>SIMO Bergey`s Manual </li></ul><ul><li>Clone ID MegaBLAST Phylotypes </li></ul><ul><li>Bergey’s Manual is based on polyphasic numerical taxonomy and provides information about multiple phenotypic traits. The classification based on Bergey`s Manual is complicated, expensive, and time consuming. In contrast, classification using 16S rRNA phylotypes is more objective, faster, and less expensive. </li></ul>
  43. 54. Relational Database Development <ul><li>Normalization </li></ul><ul><ul><li>1 st Normal Form </li></ul></ul><ul><ul><li>2 nd Normal Form </li></ul></ul><ul><ul><li>3 rd Normal Form </li></ul></ul><ul><ul><li>BCNF </li></ul></ul><ul><li>E-R Diagrams </li></ul><ul><li>Joins (outer, inner ,self) </li></ul><ul><li>Aggregate functions (sum, count, min..) </li></ul><ul><li>Miscellaneous (decode ,nvl , instr…) </li></ul>
  44. 56. References <ul><li> </li></ul><ul><li>Subramanian, Tamayo, et al. ( 2005, PNAS 102, 15545-15550 ) and Mootha, Lindgren, et al. ( 2003, Nat Genet 34, 267-273 ). </li></ul><ul><li>Chu, G., Narasimhan, B., Tibshirani, R. & Tusher, V. (2002), Signicance analysis of microarrays (sam) software </li></ul><ul><li>Adrian Alexa, Jörg Rahnenführer, Thomas Lengauer Improved scoring of functional groups from gene expression data by decorrelating GO graph structure Bioinformatics , 13, 1600-1607, 2006 </li></ul><ul><li> </li></ul><ul><li> </li></ul><ul><li>Axon guidance genes identified in a large-scale RNAi screen using the RNAi-hypersensitive Caenorhabditis elegans strain nre-1(hd20) lin-15b(hd126) Caroline Schmitz, Parag Kinge*, and Harald Hutter </li></ul><ul><li>Smyth, G. K. (2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology 3 , No. 1, Article 3. </li></ul><ul><li>Smyth, G. K. (2005). Limma: linear models for microarray data. In: Bioinformatics and Computational Biology Solutions using R and Bioconductor , R. Gentleman, V. Carey, S. Dudoit, R. Irizarry, W. Huber (eds.), Springer, New York, pages 397-420 </li></ul>
  45. 57. Acknowledgements <ul><li>George Mason University </li></ul><ul><ul><li>Glenda Wilson (MS advisor) </li></ul></ul><ul><ul><li>Dr. Patrick Gillevet (thesis advisor) </li></ul></ul><ul><ul><li>Prof. James Willett </li></ul></ul><ul><li>GSK </li></ul><ul><ul><li>Amy Creech (Supervisor) and Workbench team </li></ul></ul><ul><li>Vanderbilt University </li></ul><ul><ul><li>Prof. Frank Harrell (supervisor) </li></ul></ul><ul><ul><li>Dr. Christine Konradi </li></ul></ul><ul><ul><li>Dr. Jay Snoddy </li></ul></ul><ul><ul><li>Dr. Karoly Mirnics </li></ul></ul><ul><ul><li>Dr. Lily Wang </li></ul></ul><ul><ul><li>Dr. Jeff Franklin </li></ul></ul><ul><li>NCBS </li></ul><ul><ul><li>Prof. Satyajit Mayor </li></ul></ul><ul><ul><li>Dr. Gagan Gupta </li></ul></ul><ul><ul><li>Mr. Gautam Dey </li></ul></ul><ul><li>BITS, Pilani </li></ul><ul><ul><li>Dr. V.S Rao </li></ul></ul><ul><ul><li>Dr. N.V.Muralidhara Rao </li></ul></ul><ul><ul><li>Dr. A.P.Koley </li></ul></ul>