Michael ReichBroad Institute of Harvard and MIT           July 14, 2012
Need: Insights through integrative studiesGene mutation causing Leigh Syndrome                  Discovery of 3 new genes i...
Translational Research Example                            GenePattern                Cytoscape              IGV/UCSC      ...
12 steps, 6 tools, 7 transitions1                   i                                       GenePattern                   ...
The Need• A lightweight “connection layer” for a wide variety of  integrative genomic analyses   o Support for all types o...
Cloud-based storageAPI connectivity layer
Online community to share diverse computational tools 6 Seed Tools                            3 Driving Biological        ...
GenomeSpace Principles• Aimed towards non-programming users• Support interoperability through  automatic cross-tool data t...
GenomeSpace Components     GenePattern         Galaxy                      Cytoscape                  Integrative Genomics...
Seed ToolsCytoscape      Galaxy     GenePatternGenomica        IGV      UCSC Genome                           Browser
New ToolsRecently added                        InSilicoDB                      Cistrome                (University of Brus...
Using GenomeSpace                    Tools and                      Data                     Sources                     A...
GenomeSpace Actions
GenomeSpace Tool Enablement: IGV
GenomeSpace Tool Enablement: GenePattern
GenomeSpace Tool Enablement: GenePattern
GenomeSpace Data Source Integration: InSilico DB
Other collaborating projects• Taverna/MyExperiment (University of  Manchester)• National Center for Biomedical Ontology  (...
DBP3: Studying the regulatory control ofhuman hematopoiesis
DBP3: Studying the regulatory control ofhuman hematopoiesis – OverviewPart 1: Data pre-processing              Part 3: Stu...
Part 5: Finding new  transcription factor       regulators From part 4Step 1   2     3   Step 4Step 2   2Step 3 2
Step 1: Create transcription factor dataset in   Genomica and save to GenomeSpace
Step 2: Send transcription factor datasets into GenePattern
Step 2: Perform differential expression analysis in                  GenePattern
Step 3: Send differentially expressed genes to Genomica
Step 3: perform module network analysis in Genomica
Step 4: Visualize regulators with known SNPs and linkage regions
Step 4: Visualize regulators with known SNPs and linkage regions
Deployment Architecture and APIs                  Deployment ArchitectureGS Clients     REST                              ...
Join the GenomeSpace community• Researchers with biological projects• Developers  – Add your tools  – Contribute format co...
AcknowledgementsGenomeSpace Collaborators                             Broad InstituteCytoscape: Trey Ideker Lab, UCSD     ...
Upcoming SlideShare
Loading in …5
×

M Reich - GenomeSpace

877 views
749 views

Published on

Presentation at BOSC2012 by M Reich - GenomeSpace

Published in: Health & Medicine, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
877
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
17
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • The key observation here is that if you can focus on making the transitions between tools easier, you can greatly accelerate the way you can use those tools in concert
  • The key challenge is the growing gap between the need to use a variety of different analysis and visualization tools, and the difficulty of getting tools from different sources to work together. 7-10K bioinformatics toolsBroad alone lists ~60 tools on its external website.Doesn’t include internal use tools, lims, data processing/mgmt pipelines,5K public data bases
  • Data storage in the cloud is being used in a huge number of applications for the cost, scalability and accessibilityAnd, Web-based APIs have been used to connect disparate Web resources in a very easy wayThese were the principles we used when we developed GenomeSpace
  • 0. GenomeSpace project to build an online community to share, find, and interoperate diverse computational tools. Tools retain their identity and use as stand- alone software and GenomeSpace maintains their native look and feel. Our goal is to bring the ever-changing wealth of genomic analysis methods and whatever data is required to the fingertips of any biologistSeeded with 6 popular genomics tools representing diverse architectures(cytoscape, galaxy, genepattern, genomica, igv, UCSC browser)Support interoperability through frictionless data transfer with Reproducibility, analytic work flows, comprehensive documentationDevelopment driven by 3 driving biological projects in (cancer, lincRNAs,, stem cell circuits, and patient stratification)Live in the CloudNext phase starting to Engage new toolsEngage new biomedical projectsCurrent participating institutions
  • Seed tools span a wide variety of genomic analysis areas: Network analysis and visualization Sequence analysis Functional genomics analysis Module map analysis Integrated genomics and sequence visualization
  • Seed tools span a wide variety of genomic analysis areas: Network analysis and visualization Sequence analysis Functional genomics analysis Module map analysis Integrated genomics and sequence visualization
  • View GenomeSpace files from within GenePattern, use a file from GenomeSpace in a GenePattern analysisGo to the GenomeSpace UI
  • This is a way to get GEO datasets into any tool that will accept gene expression files. You also don’t need to convert from MAGE-TAB.We are also working with the developers of ArrayExpress to GenomeSpace-enable their repository.
  • Let’s show you how it works. One of the scientific collaborators, Aviv Regev, recently published work on the regulatory control of human hematopoiesis. For this GenomeSpace demonstration, we will show a simplified version of part of this work.
  • This is an overview of the analytic workflow of the work done by Aviv and colleagues. This is months of research condensed to one slide. The analysiscanbedivided in 5 major parts. The first part comprises data preprocessing and qualitycontrol, the second part is a basicanalysis, the third part comprisesstudying the transcriptional program, part four is a cis-regulatoryanalysis, and part fiveaimsforfindingnewtranscription factor regulators. Eachcirclerepresentsworkdone in one of the seed tools, whichyoucanseehere, coloredby the identity of the tool. Circlescolored in greyrepresentmanual steps. They are manualbecause the functionalitydoesn'texist in any of the tools. Duringthis demo, we willmentionwhen we have to do such a manual step. The numberinsideeachcirclerepresents the number of steps that are performedwithin the tool. And the arrowsrepresent the different pathsthatcanbefollowed. At certainpoints, indicatedby the blue boxes, there are alternativeoptionalpathsthatcanbefollowed. We willnow zoom in on part 5 in this demo, which is onfindingnewtranscription factor regulators.
  • Whatwe’reactuallydoing in this demo is the following. We start from a very broad set of transcription factors. We then find those transcription factorsthat are significantly differentially expressed in a particular lineage as compared to the other lineages. These we’ll call our lineage-dependent transcription factors. We then infer regulatory programs using Module Networks from Genomica. Regulatory programs consist of modules of coexpressed genes that themselves are coregulated. These regulators we then visualize and validate using previously published GWAS SNPs and linkage regions. We’renowgoing to walk youthrough the steps in this demo. We first start in Genomica and there we load the full expression data containing more than 200 samples and about 8000 genestogetherwith a gene set thatcontains the GO transcription factors. The blue squares next to the genesindicatewhichgenes are GO transcription factors. In Genomica we thencan save the expression data fromonly the GO transcription factors into a text file. In GenePattern we want to refine the GO transcription factors to a set of lineage-specifictranscription factors. We do thisusing the ComparativeMarkerSelection and ExtractComparativeMarkerResults modules in GenePattern. So we basicallyuse a t-test to assessdifferentialexpression of transcription factors in eachlineage versus the rest of the lineages, and we canuse criteria such as an FDR < 0.05 to actually select significantlydifferentiallyexpressedtranscription factors. As anexample, we continue with the transcription factors that are specific to the hematopoietic stem cell (HSC) lineage. Back in Genomica, we run Module Networks on the full expression data, whilealsoloading the lineage-specifictranscription factors we justgenerated in GenePattern. As running Module Networks usuallytakes a while, we nowimmediatelyloadourpreviouslysavedresults. In this case, we have 80 modules of coexpressedgenesforwhich we cangenerate a list of the potential regulators. This list allowsus to browseveryeasilythrough the results. Hereyousee a module that has as top regulator SMAD4. SMAD4 divides the sample spaceinto 2 partitions, whileother regulators furthersubdivide the samples. As a last step, we want to visualize and validateour list of potential regulators. To do that, we follow 2 approaches in parallel. Both approachesrequireus to manuallycreate a .bed annotation track thatcontains the coordinates of the regulators. To validateour regulators, we want to findoverlapsbetweenour regulators and previouslypublished GWAS SNPs and linkageregions. We download these SNPs and linkageregions and we manuallycreate .bed annotation tracks forboth of them. The firstapproach is to visualizeour regulators in IGV. Thisrequiresus to upload the 3 .bed annotation tracks in IGV. In parallel, we submit the same 3 .bed annotation tracks to Galaxy, where we can do overlap analysisbetween regulators, SNPs and linkageregions in several steps. Fromthisanalysis, we getanexcel file that displays the overlaps in tableformat. Interestingalso is thatthistablecontains the pubmedIDs of the papers in which a particular GWAS SNP has been published in relation to a particulardisease.
  • To launch desktop-based tools, we first get the prompt that we’re downloading the application in jnlp format
  • NOTE: The file sent to GenePattern is still a Genomica-formatted file. We specify in the URL what format we want to convert the file to
  • We send this file directly from the GenePattern interface to Genomica, now converting it back to a Genomica formatted file
  • We now see Genomica open with the data file, and we perform a module map analysis on this data to determine regulators of module networks, and since we’ll need some other data for the next step, we save the results to GenomeSpace
  • From the GenomeSpace Ui, we’re going to send these files now to IGV
  • We now see IGV open with our tracks.What’s under the covers…
  • The beta release is available for use!A few brave men and women (explorers)
  • M Reich - GenomeSpace

    1. 1. Michael ReichBroad Institute of Harvard and MIT July 14, 2012
    2. 2. Need: Insights through integrative studiesGene mutation causing Leigh Syndrome Discovery of 3 new genes involved inFrench Canadian Type (LFSC) and 8 Glioblastoma Multiforme (NF1, ERRB2,other mitochondrial diseases PIK3R1); Confirmation of TP53, PTEN, Integrate: candidate genomic region, EGFR, RB1, PIK3CA mitochondrial proteomic data, and cancer Integrate: DNA sequence, copy number, expression compendium methylation aberrations and expression Authors: Mootha et al. 2003, Calvo et al. 2006 profiles in 206 glioblastomas Authors: TCGA Research Network 2008Subtle repression of oxphos genes indiabetic muscle: role for mitochondrial ~3000 novel, large non-coding RNAs withdysregulation in diabetes pathology; new functions in development, the immunecomputational approach Gene Set response and cancerEnrichment Analysis Integrate: Genome sequences from 21 Integrate: Gene sets/pathways & processes mammals, epigenomic maps, and with expression profiles expression profiles Authors: Mootha et al. 2003, Subramanian & Tamayo Authors: Guttman, Rinn et al. 2009 et al. 2005 Characterization of disease subtypes andIKBKE as a new breast cancer oncogene improved risk stratification for Integrate: RNAi screens, transformation medulloblastoma patients of activated kinases, and copy number Integrate: Copy number, expression, from SNP arrays of cell lines clinical data for 96 medulloblastoma Authors: Boehm et al. 2007 patients Authors: Tamayo et al., Cho et al. 2011
    3. 3. Translational Research Example GenePattern Cytoscape IGV/UCSC Genomica CMAP iii 1 Compendium Differentially Expressed Genes GCT  GXP Load compendium Arrests SIF Show module ODF  GMT G2/M 3 map i Network 2 Show GSEA test network GXC  GRP enrichment 5 Extract HTML 4 Show module Expand +1 Chromosome NA  gene list Idea ii (include Alterations neighbors) 6 Add Transcription Learn p53atcgcgtttattcgataagg site/score onatcgcgttttttcgataagg Factor track Added to GenePattern from UCSC promoter Pathway iv activation v 8 Expression 7 Test for Looks closeGFF  similarity of GXA Conclusion to p53 site p53 and gene location vi
    4. 4. 12 steps, 6 tools, 7 transitions1 i GenePattern Cytoscape IGV Genomica2 3 ii iii CMAP UCSC Browser Analysis step 4 5 iv v Analysis conclusion Within tool Across tools 6 8
    5. 5. The Need• A lightweight “connection layer” for a wide variety of integrative genomic analyses o Support for all types of resource: Web-based, desktop, etc. o Automatic conversion of data formats between tools o Easy access to data from any location o Any tool that joins is automatically connected to the community of tools o Ease of entry into the environment
    6. 6. Cloud-based storageAPI connectivity layer
    7. 7. Online community to share diverse computational tools 6 Seed Tools 3 Driving Biological Projects Cytoscape Galaxy lincRNAs GenePattern Genomica Cancer stem cells IGV Patient Stratification UCSC Browser New Biological New tools www.genomespace.org Projects
    8. 8. GenomeSpace Principles• Aimed towards non-programming users• Support interoperability through automatic cross-tool data transfer• Requires minimal changes to tools www.genomespace.org
    9. 9. GenomeSpace Components GenePattern Galaxy Cytoscape Integrative Genomics ViewerGS Enabled Tools 1 Authentication and Authorization 2 3 Analysis and Data Manager Tool Manager Genome Space GenomeSpace Server Project Data geWorkbenchExternal Data Sources & Tools
    10. 10. Seed ToolsCytoscape Galaxy GenePatternGenomica IGV UCSC Genome Browser
    11. 11. New ToolsRecently added InSilicoDB Cistrome (University of Brussels) (Dana-Farber Cancer Institute)In development geWorkbench Reactome ArrayExpress(Columbia University) (Ontario Institute of Cancer Research) (EBI)
    12. 12. Using GenomeSpace Tools and Data Sources Actions Cloud- based filesystem
    13. 13. GenomeSpace Actions
    14. 14. GenomeSpace Tool Enablement: IGV
    15. 15. GenomeSpace Tool Enablement: GenePattern
    16. 16. GenomeSpace Tool Enablement: GenePattern
    17. 17. GenomeSpace Data Source Integration: InSilico DB
    18. 18. Other collaborating projects• Taverna/MyExperiment (University of Manchester)• National Center for Biomedical Ontology (Stanford University)
    19. 19. DBP3: Studying the regulatory control ofhuman hematopoiesis
    20. 20. DBP3: Studying the regulatory control ofhuman hematopoiesis – OverviewPart 1: Data pre-processing Part 3: Studying the Part 4: cis-regulatory and quality control transcriptional program site analysis Genomica From From part 3 3 2 2 4 part 2 GenePattern 3 4 1 2 IGV To part 2 2 2 1 1 1 From part 2 1 Cytoscape Galaxy 1 1 1 1 1 2 2 Manual step 3 2 To part 5 1 2 Analysis section 1 Part 5: Finding new Analysis step (# steps) 4 transcription factor 1 1 regulators Analysis conclusion From part 4 Optional choices Part 2: Basic analysis 1 Currently integrated From part 1 To part 3 3 1 2 3 2 2 Not yet integrated 2 1 2 2 1 3+ 2 1 2 1 To part 4
    21. 21. Part 5: Finding new transcription factor regulators From part 4Step 1 2 3 Step 4Step 2 2Step 3 2
    22. 22. Step 1: Create transcription factor dataset in Genomica and save to GenomeSpace
    23. 23. Step 2: Send transcription factor datasets into GenePattern
    24. 24. Step 2: Perform differential expression analysis in GenePattern
    25. 25. Step 3: Send differentially expressed genes to Genomica
    26. 26. Step 3: perform module network analysis in Genomica
    27. 27. Step 4: Visualize regulators with known SNPs and linkage regions
    28. 28. Step 4: Visualize regulators with known SNPs and linkage regions
    29. 29. Deployment Architecture and APIs Deployment ArchitectureGS Clients REST Amazon GS UI Identity Service (OpenID) REST IGV Analysis REST CDK Task Genomica Manager (ATM) CDK Gene- Provenance Pattern Simple DB REST Galaxy Data REST Manager CDK Cytoscape (DM) External Data REST Sources UCSC (e.g., Arrary Express) S3 File transfers 30
    30. 30. Join the GenomeSpace community• Researchers with biological projects• Developers – Add your tools – Contribute format converters – Build new infrastructure• Data portals and repositories – Link your resources
    31. 31. AcknowledgementsGenomeSpace Collaborators Broad InstituteCytoscape: Trey Ideker Lab, UCSD Ted LiefeldGalaxy: Anton Nekrutenko Lab, Penn State University Helga ThorvaldsdottirGenomica: Eran Segal Lab, Weizmann Institute Jim RobinsonUCSC Browser Team Marco OcanaGenePattern Team Eliot PolkIGV Team Jill Mesirov, PIDriving Biological ProjectsHoward Chang Lab – Stanford UniversityAviv Regev Lab – Broad Institute Funding gs-help@broadinstitute.org www.genomespace.org

    ×