GenePattern: Ted Liefeld

1,110 views

Published on

Published in: Business, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,110
On SlideShare
0
From Embeds
0
Number of Embeds
25
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

GenePattern: Ted Liefeld

  1. 1. GenePattern Overview for MAGE-TAB Workshop Ted Liefeld January 24, 2007
  2. 2. a platform for integrative genomics Client User Interfaces Pipeline Environment Module Repository Module Integrator Desktop Programming Web all_aml_train all_aml_test Preprocess Class Neighbors Weighted Voting Cross-Val SOM Clustering Preprocess Weighted Voting Train/Test SOM Cluster Viewer Marker Selection Viewer Prediction Results Viewer Prediction Results Viewer Golub and Slonim et. al 1999 KNN SVM SOM GSEA NMF PCA
  3. 3. Features <ul><li>Automatic Module Integration </li></ul><ul><li>Add new modules without writing code </li></ul><ul><li>Supports any command line callable code (language independent) </li></ul><ul><li>Multiple user interfaces </li></ul><ul><li>Desktop client </li></ul><ul><li>Web client </li></ul><ul><li>Programmatic interfaces to Java, MATLAB, R </li></ul><ul><li>Local and Distributed Computing </li></ul><ul><li>Laptop </li></ul><ul><li>Client/Server </li></ul><ul><li>Compute farm </li></ul><ul><li>Public server (1/2008) </li></ul><ul><li>Interoperability </li></ul><ul><li>caBIG </li></ul><ul><ul><li>caArray </li></ul></ul><ul><ul><li>caGrid </li></ul></ul><ul><li>geWorkbench </li></ul><ul><li>Cytoscape </li></ul><ul><li>Analytic Reproducibility </li></ul><ul><li>Easy, rapid sharing of methodologies via pipelines </li></ul><ul><li>Versioning using Life Sciences Identifier (LSID) </li></ul><ul><li>Executable history of all sessions </li></ul><ul><li>Automatic pipeline generation from result files </li></ul><ul><li>Executable research documents </li></ul><ul><li>Comprehensive Module Repository </li></ul><ul><li>~90 modules: analysis, visualization, pipelines </li></ul><ul><li>Expression, proteomic, sequence, variation (SNP), and whole genome association data </li></ul><ul><li>Construction of context-sensitive, flexible analytic workflows </li></ul><ul><li>Module suites </li></ul>
  4. 4. Gene Expression Analysis <ul><li>Differential Marker Analysis </li></ul><ul><li>Gene Neighbors </li></ul><ul><li>caArray Retriever </li></ul><ul><li>GEO Download </li></ul><ul><li>Expression File Creator </li></ul><ul><li>Threshold </li></ul><ul><li>Variation Filter </li></ul><ul><li>MAGE-ML Import </li></ul><ul><li>MAGE-TAB Import… </li></ul>
  5. 5. SNP Analysis <ul><li>Copy Number Estimation </li></ul><ul><li>Smoothing </li></ul><ul><li>LOH determination </li></ul><ul><li>Batch Correction </li></ul><ul><li>SNPViewer </li></ul><ul><li>SNPFileCreator </li></ul><ul><li>X Chromosome Correction </li></ul><ul><li>GISTIC pipeline (soon…) </li></ul>
  6. 6. Statistical Methods & Machine Learning Analyses Prediction K-Nearest Neighbors (KNN) Weighted Voting (WV) Support Vector Machines (SVM) Probabilistic Neural Networks (PNN) Classification and Regression Trees (CART) Clustering Hierarchical k-Means SOM Consensus Pathway Analysis GSEA ARACNE Cytoscape Other Statistical Methods Missing value imputation Kolmogorov-Smirnov score Non-negative Matrix Factorization (NMF) Principal Components Analysis (PCA)
  7. 7. Module Integrator <ul><li>Add modules and visualizers without writing code </li></ul><ul><li>Share custom analysis tasks </li></ul><ul><li>Integrate your own or “third-party” tools easily </li></ul><ul><li>Add tools to a common repository </li></ul>
  8. 8. Pipelines for reproducible research all_aml_train all_aml_test Preprocess Class Neighbors Weighted Voting Cross-Val SOM Clustering Preprocess Weighted Voting Train/Test SOM Cluster Viewer Marker Selection Viewer Prediction Results Viewer Prediction Results Viewer Golub and Slonim et. al 1999 <ul><li>Users can design workflows where the input to any module is the output of any previous module </li></ul><ul><li>Users can start with a result and automatically generate the workflow that created it </li></ul><ul><li>Input data, parameters, and code (optionally) are packaged with a pipeline </li></ul><ul><li>Every version of a module or pipeline is retained and uniquely identified </li></ul><ul><li>Pipelines and modules are exportable/importable and can be shared among GenePattern users </li></ul>
  9. 9. as a Visualization & Analysis Engine http://www.broad.mit.edu/mmgp Portal GenePattern GenePattern SNPViewer visualizer (running as applet) Run GenePattern Analyses LSF Worker Nodes
  10. 10. Using MAGE-ML today
  11. 11. MAGE-TAB use tomorrow <ul><li>Ideally </li></ul><ul><ul><li>Be able to automatically find raw/derived bioassay data when parsing MAGE-TAB files </li></ul></ul><ul><ul><ul><li>Use MAGE-TAB like our native (tab-delimited) data formats, GCT, RES in (almost) any GenePattern analysis module </li></ul></ul></ul><ul><ul><ul><li>Not require user interaction to specify Assays or quantitation types </li></ul></ul></ul><ul><ul><ul><li>? MGED-Ontology for common data transform protocols (eg RMA, MAS5) in addition to free text </li></ul></ul></ul><ul><li>Sub-optimal but still good </li></ul><ul><ul><li>Have an interactive viewer to convert from MAGE-TAB to a native format (e.g. MAGE-ML import viewer) </li></ul></ul><ul><ul><ul><li>Human interaction required… </li></ul></ul></ul>
  12. 12. More MAGE-TAB thoughts <ul><li>Define structure/format for keeping multiple MAGE-TAB files together </li></ul><ul><ul><li>IDF, ADF, SDRF, raw data files -> package together as ZIP? tgz? </li></ul></ul><ul><ul><ul><li>Sub directories in the zip? (defined) </li></ul></ul></ul><ul><li>Does MAGE-TAB support for multiple Arrays in one file? </li></ul><ul><ul><li>Useful & MAGE-ML allows this now (but I don’t like it for automated processing) </li></ul></ul><ul><ul><ul><li>E.g. E-GEOD-995.mageml.tgz from ArrayExpress </li></ul></ul></ul>
  13. 13. More MAGE-TAB thoughts <ul><li>Persistent identifiers </li></ul><ul><ul><li>For protocols, samples etc </li></ul></ul><ul><ul><ul><li>Allow use of SDRF, data matrix (eg in GP with persistent references to external entities) </li></ul></ul></ul><ul><ul><ul><ul><li>Array details, experiment design, etc </li></ul></ul></ul></ul><ul><li>Question? </li></ul><ul><ul><li>Should we consider MAGE-TAB DAG to record data processing pipelines (provenance - HLA)? </li></ul></ul><ul><ul><ul><li>e.g. a protocol for each module execution added to MAGE-TAB file outputs </li></ul></ul></ul><ul><ul><ul><ul><li>File growth issues… </li></ul></ul></ul></ul><ul><ul><ul><li>Record all analysis for a publication </li></ul></ul></ul><ul><ul><ul><li>Add additional SDRF file at each step </li></ul></ul></ul>
  14. 14. <ul><li>Collaborations </li></ul><ul><li>caBIG </li></ul><ul><li>MAGNet NCBC </li></ul><ul><li>NCIBI NCBC </li></ul><ul><li>Release Information </li></ul><ul><li>Initially released in March, 2004 </li></ul><ul><li>Current version 3.0, released April 2007 </li></ul><ul><ul><li>3.1 due Feb 08 </li></ul></ul><ul><li>Currently 5900+ users, 500+ organizations, ~90 countries </li></ul><ul><li>Availability </li></ul><ul><li>Freely available </li></ul><ul><li>Windows, Mac OS, and Unix platforms </li></ul><ul><li>Resources </li></ul><ul><li>http://www.genepattern.org </li></ul><ul><li>User workshops, documentation, email help desk, online user forum </li></ul><ul><li>Reich et al. (2006) Nature Genetics </li></ul>GenePattern is a winner of the 2005 BioIT World Best Practices Award

×