j.liu Current status of trans-mart development (1)


Published on

1 Comment
  • Jinlei,
    Do you happen to know if there are any tools, scripts that would convert raw data to transmart's standard format by normalising etc, ie. perform Level 1 in your slide 11? Affymetrix Gene Expression console seems to be limited to Windows, ie. does not appear work on Linux. I am considering to use Bioconductor and prepare a script of R commands, but thought I should first check with you first. Many thanks, Hurriyet
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

j.liu Current status of trans-mart development (1)

  1. 1. Current Status of tranSMART DevelopmentJinlei LiuDeloitte Consulting LLP
  2. 2. Objectives• Core problem to solve• Current development status and challenges• tranSMART platform revisit and enhancement ideas• Community development
  3. 3. Core problem: Scalable Analyses of Integrated Scientific DataCollaborative analysis of medical research data sets needed to make datadriven decisions for translational research is not scalable today. This isbecause groups lack needed standard integration within and betweendata sets across disparate domains including ‘omics, clinical research,and outcomes linked with scientifically meaningful semantics.A platform that enables scientists to share high quality data acrossexperimental data sets with standardized storage, query, analytics, andvisualization models is needed to enable integrative informatics drivenanalyses. -3-
  4. 4. tranSMART – Knowledge Management Platform -4-
  5. 5. tranSMART - Adoption and Emerging CommunityAdoption Emerging CommunityGitHub Activity since Jan 2012 -5-
  6. 6. Features in the Open Source Releases Q4 2011 Feb 2012 July 2012 Dec 2012 Feb 2013 eTRIKS review 0.9 GPL 1.0 RC1 1.0 RC2 1.0 GA 1.1 Alpha 1.1 Beta RC1 Initial Release GA Release Postgres Migration • Search, Dataset • i2b2 upgrade to 1.6 • i2b2 –postgres support Explorer, Sample Explorer and Gene • R analytical plugin with • tranSMART postgres Signature 8+ pipelines migration • Gene Expression, • R native interface • Integration tests RBM and Clinical Trial • Advanced data export • Community build tools Data • SNP data support • Updated ETL scripts – • Gene Pattern Integration more Kettle jobs • Updated ETL scripts – some in Kettle • ETL scripts based on Oracle Technology • Documentation • Legacy i2b2 • Data Curation Tool -6-
  7. 7. Yet More Features on Private or Forked Versions of tranSMART  Faceted Search (3 versions!)  Gene Signature UI Enhancements  New data visualization in search  Integrated DSE and Faceted Search API  GWAS, eQTL, Genetic Variation (VCF) data  New analytic pipelines in R  Across Study pilot  Study Data and Metadata tagging  Data Upload UI and Tools  Enrichment Analysis and Metacore integration  NCIBI tool integration  Installation scripts  New ETL pipelines and bioportal integration  New grid view  Saved Reports … -7-
  8. 8. Knowledge Sharing Requires Collaborative Development Effort Forked Development Branch Master Branch Feature left on branch Private Repo 2 Private Repo 1 -8-
  9. 9. Feedback From the Community Requires Platform RevampUsers Developers• Intuitive UI to visualize data • Best architecture - Extension and customization requires significant• Powerful data export tool core code changes• Support NGS and other new data • UI and code clean up – Mixed ExtJS types and Jquery• Better performance • Best system integration via Service API• Self data management capability • Better data curation and ETL –• Meta-analysis ideally automated pipelines• More analytic pipeline integration • Better packaging• Integration with other systems • Better code management and testing
  10. 10. tranSMART Platform Revisit – Architecture Overview Internal Applicatio n - 10 -
  11. 11. tranSMART Platform Revisit – Data Categories and Storage Category Type Description Example Usage Storage • Raw data from Affymetrix CEL filesLevel 1 Raw source platform Data processing pipeline File system • Not normalized • Normalized data • Clinical trial data Database: Processed through curation or • RMA or MAS5 normalized Dataset ExplorerLevel 2 DeApp, data processing gene expression data i2b2DemoData pipelines • SNP data with Calls and CNV • Interpreted or • Z-scores for gene expression • Dataset Explorer Database:Level 3 Interpreted aggregated data from data • Search DeApp, BioMart processed data • ANOVA analysis results • Quantified • Across trial analysis association and Summary and • Data association results from Database:Level 4 analysis across Search Findings publications BioMart multiple samples. • Published results • Data about key business entities in • Study design, platform Database: Slow changing the system. Data specification, Subject Dataset Explorer i2b2Metadata,Master Data data might be from internal Demographics, ontology Search i2b2DemoData, or external data trees, user defined gene lists BioMart, SearchApp source. • Data from other Slow changingReference system that’s used as • Affymetrix annotation files, Dataset Explorer Database: data used asData identifier or reference GeneID from Entrez Search DeApp, BioMart reference to other systemsMetaData - • Data descripts data • Data dictionary, Schema Metadata Documentation FileStructural structure guideMetaData – • Data associated withAdministrative • ETL auditing and QC results, Database: Metadata application/data Search(Operational) Application access results searchApp, rdc_cz access and operation - 11 -
  12. 12. tranSMART Platform Revisit - Data Storage Single access point for tranSMART app. Contains database SYNONYMS BIOMART_US Clinical, subjects and low dimension data in Clinical trial ontology ER STAR schema and security I2B2 I2B2 concept_cd, DEMODATA METADATA SEARCHAPP Biomarker UIDs BIOMART ontology UID, subject, study subject, sample, concept_cd, trial Projects/ontologyApplication user data such asuser accounts, the queries Core data warehousetheyve run, gene signatures and datamart withand the study permissions master data(study, I2B2 DEAPP platform etc), analyzed HIVELanding zone where and curated summarydata is stored in dataoriginal format Omic mart stores high dimension I2b2 project and user TM_CZ data(Gex/SNP/Proteo database TM_WZ mics), subject and TM_LZ sample association, and security extension for clinical trials. Working zone contains ETL job control, qc and intermediate ETL results auditing zone - 12 -
  13. 13. Data Store Redesign Transactional Clinical and Finding High Dimension/ Big Data Files and External links Documentation and User and Level 3, Level 4 and Level 2 and 3 Data Indexing on File Application Data Clinical Data in In No-SQL DB System In RDBS RDBS Meta data and Master Data Reference and Operational Data - 13 -
  14. 14. tranSMART Platform Revisit – Data Curation and ETL Feedback Loop Define Data Common Common Metadata ETL Quality tranSMART study/data Data Ontology Tagging Process Control Source to be Format Process loaded Determine Original The Data is then Data is Quality- Data is Data is which source curation organized tagged for approved analyzed available in study to research process into a future data sent and tranSMART load into data. Is begins by common referencing through the compared for analysis the copied as converting structure and ETL against by end system. the data from and searching, Pipeline. similarly users. This is preliminary original common at the tagged decided by process sources ontology or record level data, and the step. into a vocabulary by any Principal common Concepts unusual Scientist / format. (disease, features a System tissue, noted. Product platform) Manager Principal Data Steward Analyst Analyst Analyst ETL Quality control Scientist Engineer - 14 -
  15. 15. Curation and ETL Enhancement• Data ingestion templates and services• Curation tool with metadata integration• Data upload and services• Automated data processing pipelines• Data security• Data sharing API and services - 15 -
  16. 16. tranSMART Platform Revisit - N tier ArchitecturePresentation tier Ajax Javascript Framework Web based user interface GSP/JSP Json/XMLBusiness tier Security (with Plugins)Data processing External Systemsand business Controller Web Services Pluginslogic evaluation. SOAP RModuleMoves and Model Restful Container Programming APItransforms databetween Web Servicepresentation and Service Analysisdata tier I2b2 CRC Plugin Reg Search Data Integration Filter i2b2PM Async Job Data Import Doc Index I2b2 Ontology Data Retrieve Data ExportData tier GORM/HibernateData is storedand retrieved inthe database orfile system. Knowledge Oracle/Post File Storage Inventory gres - 16 -
  17. 17. tranSMART Platform Revisit - Analytic Integration via R Plugin Data Server App Server Analytic Server Register module Input Doc Render module response Store Plugin Reg Plugin SendData /manage Output RModule Analytic job Rserve Render Packages Plugin Biomart Async Job Modules Data Request and Retrieval Retrieve Data Data Export tranSMAR ROracle T Clinical RInterface Mart(i2b2) R backend Direct access to Data Store via OCI - 17 -
  18. 18. Service and Plugin Based Architecture PLUGINS SERVICES Ideas • Leverage Grails plugin framework • tranSMART core as a Data Data Grails plugin Ingestion Visualization and Export and Explorer • Service and plugin registration in Core CORE • Extension as grails Data plugin Data Integration Analysis and Storage KEY PLUGINS - 18 -
  19. 19. Great Opportunity - Knowledge Sharing and Community Development Knowledge Unknown Knowledge Silo Limited Trust Storming No Trust Forming Collaboration Synergy Performing Norming Knowledge Creation Knowledge Sharing - 19 -
  20. 20. Another Popular Knowledge Management Community! tranSMART - 20 -
  21. 21. Thank You - 21 -