Your SlideShare is downloading. ×
0
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data

211

Published on

tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data …

tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
Management of tranSMART's Environment
Gustavo Lopes
The Hyve B.V.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
211
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. transmart-data Management of tranSMART’s Environment Gustavo Lopes The Hyve B.V. November 6, 2013 Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 1 / 22
  • 2. Outline 1 Problems Reproductibility Versioning Control Automation Why?! tranSMART Foundation’s Version 2 3 Gustavo Lopes (The Hyve B.V.) Solution: transmart-data General Description Configuration Database Schema Management Seed Data ETL RModules Analyses’ Rserve Solr transmartApp Configuration Limitations transmart-data November 6, 2013 2 / 22
  • 3. Typical Branch Distribution Grails Code Database transmartApp (without full repo history, always with wrong ancestry information ⇒ merging quite difficult) RModules (if you’re lucky), but analyses definitions in DB not provided SQL scripts on top of GPL 1.0 dump or later. Probably insufficent/won’t apply Stored procedures for ETL. Overlapping definitions with yours, but no history ⇒ merging quite difficult Manual fixups always required (even if just permissions/synonyms) Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 3 / 22
  • 4. Typical Branch Distribution (II) ETL Solr/Rserve/Configuration High variablity in strategies Instructions/sample data rarely provided Solr schemas/dataimport.xml perpetually forgotten Kettle scripts are problematic Idem for information on R packages Sample configuration rarely provided Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 4 / 22
  • 5. Versioning Control Version control used ONLY for Grails Code. . . But often squashed and with wrong ancestor information. Forget about database, Solr, most of ETL. Result Merges are very difficult. Changes cannot easily be tracked Changes’ wherefores are unknown Regressions are introduced (no conflicts) Collaboration is based on e-mail attachments Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 5 / 22
  • 6. Automation Even with all the pieces. . . Setting up a new branch takes days; weeks for non-basic functionality No reproductibility in the process! Result Devs driven away from fully local environment (too much work) Robust environment for CI passed over (too much work) Bugs cannot be reliably reproduced (see also: no consistent usage of VCS) Time wasted with deployment specific mistakes/inconsistencies Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 6 / 22
  • 7. Why?! The “source code” for a work means the preferred form of the work for making modifications to it. — GPL v3, section 1 Is everyone holding back “source code”? More likely explanation: No appropriate tooling being used Guillaume Duchenne (public domain) Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 7 / 22
  • 8. Situation for tranSMART 1.1 The situation is much better! Some problems remain, though. The Good Create/populate DB is easy Most stuff is versioned CI for builds Image available Public issue tracking Gustavo Lopes (The Hyve B.V.) The Bad No Oracle support Changes to DB scripts/seed data are ad hoc (lax structure) No mechanism to support/compare schemas with other branches R analyses are json blobs in TSVs No VCS for Solr or Rserve/images’ setup Set up Sol/Rserve is time-consuming Population of DB with sample data is still time-consuming Config changes required for dev transmart-data November 6, 2013 8 / 22
  • 9. Description of transmart-data We developed transmart-data to address most of these problems: transmart-data is a set of scripts for managing tranSMART’s environment and certain application data (e.g. Solr schemas, DDL, seed data), which is used by scripts and sometimes generated by them. It has a makefile based interface. Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 9 / 22
  • 10. transmart-data: Purposes Purposes of transmart-data: 1 Allow setting up a complete dev environment quickly (< 30 min) 2 Bring versioning to the database schema and Solr files 3 Setup Solr runtime 4 Invoke ETL pipelines 5 Setup Rserve Target audience: Programmers Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 10 / 22
  • 11. transmart-data: Non-purposes Non-purposes of transmart-data: 1 Setup a production environment (some components can be used) 2 New users evaluating tranSMART (use an pre-built image) 3 Building transmartApp or its plugin dependencies (build them yourself or use artifacts from Bamboo/Nexus) Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 11 / 22
  • 12. Configuration Environment variable based configuration cp v a r s . s a m p l e v a r s vim v a r s #e d i t f i l e source v a r s Gustavo Lopes (The Hyve B.V.) PGHOST=/tmp PGPORT=5432 PGDATABASE=t r a n s m a r t PGUSER=$USER PGPASSWORD= TABLESPACES=$HOME/ pg / t a b l e s p a c e s / PGSQL BIN=$HOME/ pg / b i n / ORAHOST=l o c a l h o s t ORAPORT=1521 ORASID=o r c l ORAUSER=” s y s a s s y s d b a ” ORAPASSWORD=mypassword ORACLE MANAGE TABLESPACES=0 #c o n t i n u e s . . . transmart-data November 6, 2013 12 / 22
  • 13. Database Schema Management Support for Oracle and Postgres Oracle Postgres Uses pg dump(all) Queries dba * tables Parses the dump files Dumps DDL w/ DBMS METADATA #Dump make −C p o s t g r e s / d d l dump make −C p o s t g r e s / d d l / GLOBAL e x t e n s i o n s . s q l roles . sql #Dump make −C o r a c l e / d d l dump #Load make o r a c l e #Load make −C p o s t g r e s / d d l l o a d Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 13 / 22
  • 14. Seed Data Only Postgres for now #Dump #T a b l e s t o dump i n p o s t g r e s / d a t a/<schema> l s t make −C p o s t g r e s / d a t a dump make −C p o s t g r e s /common m i n i m i z e d i f f s #Load make −C p o s t g r e s / d a t a l o a d #Load DDL and d a t a make p o s t g r e s Only for basic stuff with no ETL! Pretty fast (DDL+data loaded in 10s) Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 14 / 22
  • 15. ETL (I) Unified interface for ETL Prepare dataset Load dataset 1 Prepare ETL-specific source files 2 Prepare file with ETL specific params 3 Upload dataset to CDN (optional) For each new ETL pipeline, support must be added Gustavo Lopes (The Hyve B.V.) make −C s a m p l e s /{ o r a c l e , p o s t g r e s } l o a d <type> <s t u d y i d > #Example : make −C s a m p l e s / p o s t g r e s load clinical GSE8581 Everything is automated! transmart-data November 6, 2013 15 / 22
  • 16. ETL (II) Show TM CZ logs: $ make -C samples/postgres showdblog make: Entering directory `/home/gustavo/repos/transmart-data/samples/postgres' groovy -cp postgresql-9.2-1003.jdbc4.jar ../common/dump_audit.groovy postgres `tput cols` Procedure | Description | Stat | Recs | Date | Time spent -----------------------------------------------------------------------------------------------------alysis_data.kjb | GSE8581 | DONE | 1 | 2013-10-15 13:23:22. | 0.0 .load_ext_files | Drop null samples rows | Done | 0 | 2013-10-15 13:23:23. | 0.450529 .load_ext_files | Drop null cohorts rows | Done | 0 | 2013-10-15 13:23:23. | 0.043125 .load_ext_files | Drop null analysis rows | Done | 0 | 2013-10-15 13:23:23. | 0.066097 .load_ext_files | Read analysis file | Done | 1 | 2013-10-15 13:23:23. | 0.048055 .load_ext_files | Read cohort file | Done | 3 | 2013-10-15 13:23:23. | 0.085535 .load_ext_files | Read samples file | Done | 57 | 2013-10-15 13:23:23. | 0.049993 .load_ext_files | Write rwg_cohorts_ext | Done | 3 | 2013-10-15 13:23:23. | 0.099452 .load_ext_files | Write rwg_analysis_ext | Done | 1 | 2013-10-15 13:23:23. | 0.047331 .load_ext_files | Write rwg_samples_ext | Done | 57 | 2013-10-15 13:23:23. | 0.044567 .load_ext_files | Read analysis data file | Done | 436898 | 2013-10-15 13:23:27. | 3.911089 .load_ext_files | Drop null analysis_data rows | Done | 382223 | 2013-10-15 13:23:27. | 0.067765 .load_ext_files | Write rwg_analysis_data_ext | Done | 54675 | 2013-10-15 13:23:28. | 1.332746 IMPORT_FROM_EXT | Start FUNCTION | Done | 0 | 2013-10-15 13:23:29. | 0.117319 IMPORT_FROM_EXT | Delete existing records from TM_ | Done | 0 | 2013-10-15 13:23:29. | 0.035825 IMPORT_FROM_EXT | Delete existing records from TM_ | Done | 0 | 2013-10-15 13:23:29. | 6.26E-4 IMPORT_FROM_EXT | Delete existing records from TM_ | Done | 0 | 2013-10-15 13:23:29. | 4.84E-4 IMPORT_FROM_EXT | Insert records from TM_LZ.Rwg_An | Done | 1 | 2013-10-15 13:23:29. | 0.001079 IMPORT_FROM_EXT | Update bio_assay_analysis_id on | Done | 0 | 2013-10-15 13:23:29. | 0.030793 IMPORT_FROM_EXT | Insert records from TM_LZ.Rwg_Co | Done | 3 | 2013-10-15 13:23:29. | 8.28E-4 ... (continues) Errors are also shown (if any) Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 16 / 22
  • 17. RModules Analyses’(tsApp-DB) Situation in transmartApp-DB: u p d a t e searchapp . plugin_module s e t params = ' {" id ":" survivalAnalysis " ," converter ":{" R ":[" source ( ' ' || P L U G I N S C R I P T D I R E C T O R Y || Common / dataBuilders . R ' ') " ," source ( ' ' || P L U G I N S C R I P T D I R E C T O R Y || Common / E xt ra ct Concepts . R ' ') " ," source ( ' ' || P L U G I N S C R I P T D I R E C T O R Y || Common / collapsingData . R ' ') " ," source ( ' ' || P L U G I N S C R I P T D I R E C T O R Y || Common / BinData . R ' ') " ," source ( ' ' || P L U G I N S C R I P T D I R E C T O R Y || Survival / Bui ldS urv iva lDa ta . R ' ') " ," tSurvivalData . build ( n tinput . dataFile = ' ' || T E M P F O L D E RD I R E C T O R Y || Clinical / clinical . i2b2trans ' ' , n tconcept . time = ' ' || TIME || ' ' , n tconcept . category = ' ' || CATEGORY || ' ' , n tconcept . eventYes = ' ' || EVENTYES || ' ' , n tbinning . enabled = ' ' || BINNING || ' ' , n tbinning . bins = ' ' || NUMBERBINS || ' ' , n tbinning . type = ' ' || BINNINGTYPE || ' ' , n tbinning . manual = ' ' || BINNINGMANUAL || ' ' , n tbinning . binrangestring = ' ' || B I NN IN G RA NG E ST R IN G || ' ' , n tbinning . variabletype = ' ' || B IN N I N G V A R I AB L E T Y P E || ' ' , n tinput . gexFile = ' ' || T E M P F O L D E R D I R E CT O R Y || mRNA / Processed_Data / mRNA . trans ' ' , n tinput . snpFile = ' ' || T E M P F O L D E R D I R E CT O R Y || SNP / snp . trans ' ' , n tconcept . category . type = ' ' || TYPEDEP || ' ' , n tgenes . category = ' ' || GENESDEP || ' ' , n tgenes . category . aggregate = ' ' || AGGREGATEDEP || ' ' , n tsample . category = ' ' || SAMPLEDEP || ' ' , n ttime . category = ' ' || TIMEPOINTSDEP || ' ' , n tsnptype . category = ' ' || SNPTYPEDEP || ' ') n t "]} ," name ":" Survival Analysis " ," d a t a F i l e I n p u t M a p p i n g ":{" CLINICAL . TXT ":" TRUE " ," SNP . TXT ":" snpData " ," MRNA_DETAILED . TXT ":" mrnaData "} ," dataTypes ":{" subset1 ":[" CLINICAL . TXT "]} ," pivotData ": false ," view ":" S u r v i v a lAnalysis " ," processor ":{" R ":[" source ( ' ' || P L U G I N S C R I P T D I R E C T O R Y || Survival / C o x R e g r e s s i o n L oa d e r . r ' ') " ," CoxRegression . loader ( input . filename = ' ' outputfile ' ') " ," source ( ' ' || P L U G I N S C R I P T D I R E C T O R Y || Survival / S u r v i v a l Cu r v e L o a d e r . r ' ') " ," SurvivalCurve . loader ( input . filename = ' ' outputfile ' ' , concept . time = ' ' || TIME || ' ') "]} ," renderer ":{" GSP ":"/ survivalAnalysis / s u r v i v a l A n a l y s i s O u t p u t "} ,... ( goes on ) ' where module_name = ' p gs u rv iv a lA n al ys i s '; Not very nice... Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 17 / 22
  • 18. RModules Analyses’ (transmart-data) In transmart-data: One file per analysis Files can be generated from DB data Sanely formatted But we really want to remove this from the DB! array ( 'id' => 'heatmap', 'name' => 'Heatmap', 'dataTypes' => array ( 'subset1' => array ( 0 => 'CLINICAL.TXT', ), ), 'dataFileInputMapping' => array ( 'CLINICAL.TXT' => 'FALSE', 'SNP.TXT' => 'snpData', 'MRNA_DETAILED.TXT' => 'TRUE', ), 'pivotData' => false, ... Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 18 / 22
  • 19. Rserve Targets for Rserve: Download/build R Install R packages Start Rserve Install System V init script for Rserve Idem for systemd cd R make - j8 bin / root / R # some packages don ' t support concurrent builds make install_packages make start_Rserve make start_Rserve . dbg TRANSMART_USER = tomcat7 sudo E make i n s ta l l _r s e rv e _ in i t TRANSMART_USER = tomcat7 sudo E make i n s ta l l _r s e rv e _ un i t Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 19 / 22
  • 20. Solr Solr (4.5.0) automatically downloaded and configured Solr cores automatically created User only needs to create a schema file and dataconfig.xml # setup & solr ( psql ) make start # just c o n f i g u r e make solr_home make < core > _full_import make < core > _delta_import make clean_cores ORACLE =1 make start Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 20 / 22
  • 21. transmartApp Configuration Out-of-tree config management: Targets for installing files Zero configuration for dev! Customization allowed without touching the target files Only supports ours branches But a lot of configuration should be in-tree instead! Gustavo Lopes (The Hyve B.V.) # install everything # previous files are backed up make install # just one file : make install_Config . groovy make install_ Bu il dC on fi g . groovy make install _D at aS ou rce . groovy # costumizations in : # Config - extra . php # BuildConfig . groovy ( limited ) transmart-data November 6, 2013 21 / 22
  • 22. Current Limitations DB upgrades not handled Only a few ETL pipelines supported Oracle support is behind PostgreSQL Tooling shares repository with application data © Joost J. Bakker, CC BY 2.0 Gustavo Lopes (The Hyve B.V.) transmart-data November 6, 2013 22 / 22

×