tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data

transmart-data
Management of tranSMART’s Environment

Gustavo Lopes
The Hyve B.V.

November 6, 2013

Gustavo Lopes (The Hyve B.V.)

transmart-data

November 6, 2013

1 / 22

Outline

1

Problems
Reproductibility
Versioning Control
Automation
Why?!
tranSMART Foundation’s
Version

2

3


Solution: transmart-data
General Description
Conﬁguration
Database Schema Management
Seed Data
ETL
RModules Analyses’
Rserve
Solr
transmartApp Conﬁguration
Limitations

transmart-data

November 6, 2013

2 / 22

Typical Branch Distribution

Grails Code

Database

transmartApp (without full
repo history, always with
wrong ancestry information
⇒ merging quite difficult)
RModules (if you’re lucky),
but analyses definitions in
DB not provided

SQL scripts on top of GPL
1.0 dump or later. Probably
insufficent/won’t apply
Stored procedures for ETL.
Overlapping definitions with
yours, but no history ⇒
merging quite difficult
Manual fixups always
required (even if just
permissions/synonyms)


transmart-data

November 6, 2013

3 / 22

Typical Branch Distribution (II)

ETL

Solr/Rserve/Conﬁguration
High variablity in strategies
Instructions/sample data
rarely provided

Solr
schemas/dataimport.xml
perpetually forgotten

Kettle scripts are
problematic

Idem for information on R
packages
Sample conﬁguration rarely
provided


transmart-data

November 6, 2013

4 / 22

Versioning Control

Version control used ONLY for Grails Code. . .
But often squashed and with wrong ancestor information.
Forget about database, Solr, most of ETL.

Result
Merges are very diﬃcult.
Changes cannot easily be tracked
Changes’ wherefores are unknown
Regressions are introduced (no conﬂicts)
Collaboration is based on e-mail attachments


transmart-data

November 6, 2013

5 / 22

Automation
Even with all the pieces. . .
Setting up a new branch takes days;
weeks for non-basic functionality
No reproductibility in the process!

Result
Devs driven away from fully local
environment (too much work)
Robust environment for CI passed over
(too much work)
Bugs cannot be reliably reproduced (see
also: no consistent usage of VCS)
Time wasted with deployment speciﬁc
mistakes/inconsistencies

transmart-data

November 6, 2013

6 / 22

Why?!

The “source code” for a work means
the preferred form of the work for
making modiﬁcations to it.
— GPL v3, section 1

Is everyone holding back “source code”?
More likely explanation:
No appropriate tooling being used
Guillaume Duchenne (public domain)


transmart-data

November 6, 2013

7 / 22

Situation for tranSMART 1.1
The situation is much better!
Some problems remain, though.

The Good
Create/populate DB
is easy
Most stuﬀ is
versioned
CI for builds
Image available
Public issue tracking


The Bad
No Oracle support
Changes to DB scripts/seed data are
ad hoc (lax structure)
No mechanism to support/compare
schemas with other branches
R analyses are json blobs in TSVs
No VCS for Solr or Rserve/images’ setup
Set up Sol/Rserve is time-consuming
Population of DB with sample data is still
time-consuming
Conﬁg changes required for dev

transmart-data

November 6, 2013

8 / 22

Description of transmart-data

We developed transmart-data to address most of these problems:
transmart-data is a set of
scripts for managing tranSMART’s environment and
certain application data (e.g. Solr schemas, DDL, seed data), which
is used by scripts and sometimes generated by them.
It has a makeﬁle based interface.


transmart-data

November 6, 2013

9 / 22

transmart-data: Purposes

Purposes of transmart-data:
1

Allow setting up a complete dev environment quickly (< 30 min)

2

Bring versioning to the database schema and Solr ﬁles

3

Setup Solr runtime

4

Invoke ETL pipelines

5

Setup Rserve

Target audience: Programmers


transmart-data

November 6, 2013

10 / 22

transmart-data: Non-purposes

Non-purposes of transmart-data:
1

Setup a production environment
(some components can be used)

2

New users evaluating tranSMART
(use an pre-built image)

3

Building transmartApp or its plugin dependencies
(build them yourself or use artifacts from Bamboo/Nexus)


transmart-data

November 6, 2013

11 / 22

Conﬁguration
Environment variable based conﬁguration
cp v a r s . s a m p l e v a r s
vim v a r s #e d i t f i l e
source v a r s


PGHOST=/tmp
PGPORT=5432
PGDATABASE=t r a n s m a r t
PGUSER=$USER
PGPASSWORD=
TABLESPACES=$HOME/ pg / t a b l e s p a c e s /
PGSQL BIN=$HOME/ pg / b i n /
ORAHOST=l o c a l h o s t
ORAPORT=1521
ORASID=o r c l
ORAUSER=” s y s a s s y s d b a ”
ORAPASSWORD=mypassword
ORACLE MANAGE TABLESPACES=0
#c o n t i n u e s . . .

transmart-data

November 6, 2013

12 / 22

Database Schema Management
Support for Oracle and Postgres

Oracle

Postgres
Uses pg dump(all)

Queries dba * tables

Parses the dump ﬁles

Dumps DDL w/
DBMS METADATA

#Dump
make −C p o s t g r e s / d d l dump
make −C p o s t g r e s / d d l /
GLOBAL e x t e n s i o n s . s q l
roles . sql

#Dump
make −C o r a c l e / d d l dump
#Load
make o r a c l e

#Load
make −C p o s t g r e s / d d l l o a d


transmart-data

November 6, 2013

13 / 22

Seed Data
Only Postgres for now
#Dump
#T a b l e s t o dump i n p o s t g r e s / d a t a/<schema> l s t
make −C p o s t g r e s / d a t a dump
make −C p o s t g r e s /common m i n i m i z e d i f f s
#Load
make −C p o s t g r e s / d a t a l o a d
#Load DDL and d a t a
make p o s t g r e s

Only for basic stuﬀ with no ETL!
Pretty fast (DDL+data loaded in 10s)


transmart-data

November 6, 2013

14 / 22

ETL (I)
Unified interface for ETL

Prepare dataset

Load dataset

1

Prepare ETL-specific source
files

2

Prepare file with ETL
specific params

3

Upload dataset to CDN
(optional)

For each new ETL pipeline,
support must be added


make −C s a m p l e s /{ o r a c l e ,
p o s t g r e s } l o a d <type>
<s t u d y i d >
#Example :
make −C s a m p l e s / p o s t g r e s
load clinical GSE8581

Everything is automated!

transmart-data

November 6, 2013

15 / 22

ETL (II)
Show TM CZ logs:
$ make -C samples/postgres showdblog
make: Entering directory `/home/gustavo/repos/transmart-data/samples/postgres'
groovy -cp postgresql-9.2-1003.jdbc4.jar ../common/dump_audit.groovy postgres `tput cols`
Procedure
| Description
| Stat |
Recs |
Date | Time spent
-----------------------------------------------------------------------------------------------------alysis_data.kjb | GSE8581
| DONE |
1 | 2013-10-15 13:23:22. |
0.0
.load_ext_files | Drop null samples rows
| Done |
0 | 2013-10-15 13:23:23. |
0.450529
.load_ext_files | Drop null cohorts rows
| Done |
0 | 2013-10-15 13:23:23. |
0.043125
.load_ext_files | Drop null analysis rows
| Done |
0 | 2013-10-15 13:23:23. |
0.066097
.load_ext_files | Read analysis file
| Done |
1 | 2013-10-15 13:23:23. |
0.048055
.load_ext_files | Read cohort file
| Done |
3 | 2013-10-15 13:23:23. |
0.085535
.load_ext_files | Read samples file
| Done |
57 | 2013-10-15 13:23:23. |
0.049993
.load_ext_files | Write rwg_cohorts_ext
| Done |
3 | 2013-10-15 13:23:23. |
0.099452
.load_ext_files | Write rwg_analysis_ext
| Done |
1 | 2013-10-15 13:23:23. |
0.047331
.load_ext_files | Write rwg_samples_ext
| Done |
57 | 2013-10-15 13:23:23. |
0.044567
.load_ext_files | Read analysis data file
| Done | 436898 | 2013-10-15 13:23:27. |
3.911089
.load_ext_files | Drop null analysis_data rows
| Done | 382223 | 2013-10-15 13:23:27. |
0.067765
.load_ext_files | Write rwg_analysis_data_ext
| Done | 54675 | 2013-10-15 13:23:28. |
1.332746
IMPORT_FROM_EXT | Start FUNCTION
| Done |
0 | 2013-10-15 13:23:29. |
0.117319
IMPORT_FROM_EXT | Delete existing records from TM_ | Done |
0 | 2013-10-15 13:23:29. |
0.035825
0 | 2013-10-15 13:23:29. |
6.26E-4
0 | 2013-10-15 13:23:29. |
4.84E-4
IMPORT_FROM_EXT | Insert records from TM_LZ.Rwg_An | Done |
1 | 2013-10-15 13:23:29. |
0.001079
IMPORT_FROM_EXT | Update bio_assay_analysis_id on | Done |
0 | 2013-10-15 13:23:29. |
0.030793
IMPORT_FROM_EXT | Insert records from TM_LZ.Rwg_Co | Done |
3 | 2013-10-15 13:23:29. |
8.28E-4
... (continues)

Errors are also shown (if any)

transmart-data

November 6, 2013

16 / 22

RModules Analyses’(tsApp-DB)
Situation in transmartApp-DB:
u p d a t e searchapp . plugin_module
s e t params = ' {" id ":" survivalAnalysis " ," converter ":{" R ":[" source ( ' ' || P L U G I N S C R I P T D I R E C T O R Y
|| Common / dataBuilders . R ' ') " ," source ( ' ' || P L U G I N S C R I P T D I R E C T O R Y || Common /
E xt ra ct Concepts . R ' ') " ," source ( ' ' || P L U G I N S C R I P T D I R E C T O R Y || Common / collapsingData . R ' ')
" ," source ( ' ' || P L U G I N S C R I P T D I R E C T O R Y || Common / BinData . R ' ') " ," source ( ' ' ||
P L U G I N S C R I P T D I R E C T O R Y || Survival / Bui ldS urv iva lDa ta . R ' ') " ," tSurvivalData . build ( n
tinput . dataFile = ' ' || T E M P F O L D E RD I R E C T O R Y || Clinical / clinical . i2b2trans ' ' , n
tconcept . time = ' ' || TIME || ' ' , n tconcept . category = ' ' || CATEGORY || ' ' , n tconcept .
eventYes = ' ' || EVENTYES || ' ' , n tbinning . enabled = ' ' || BINNING || ' ' , n tbinning . bins = ' ' ||
NUMBERBINS || ' ' , n tbinning . type = ' ' || BINNINGTYPE || ' ' , n tbinning . manual = ' ' ||
BINNINGMANUAL || ' ' , n tbinning . binrangestring = ' ' || B I NN IN G RA NG E ST R IN G || ' ' , n tbinning
. variabletype = ' ' || B IN N I N G V A R I AB L E T Y P E || ' ' , n tinput . gexFile = ' ' ||
T E M P F O L D E R D I R E CT O R Y || mRNA / Processed_Data / mRNA . trans ' ' , n tinput . snpFile = ' ' ||
T E M P F O L D E R D I R E CT O R Y || SNP / snp . trans ' ' , n tconcept . category . type = ' ' || TYPEDEP || ' ' , n
tgenes . category = ' ' || GENESDEP || ' ' , n tgenes . category . aggregate = ' ' || AGGREGATEDEP
|| ' ' , n tsample . category = ' ' || SAMPLEDEP || ' ' , n ttime . category = ' ' || TIMEPOINTSDEP
|| ' ' , n tsnptype . category = ' ' || SNPTYPEDEP || ' ') n t "]} ," name ":" Survival Analysis " ,"
d a t a F i l e I n p u t M a p p i n g ":{" CLINICAL . TXT ":" TRUE " ," SNP . TXT ":" snpData " ," MRNA_DETAILED . TXT
":" mrnaData "} ," dataTypes ":{" subset1 ":[" CLINICAL . TXT "]} ," pivotData ": false ," view ":"
S u r v i v a lAnalysis " ," processor ":{" R ":[" source ( ' ' || P L U G I N S C R I P T D I R E C T O R Y || Survival /
C o x R e g r e s s i o n L oa d e r . r ' ') " ," CoxRegression . loader ( input . filename = ' ' outputfile ' ') " ,"
source ( ' ' || P L U G I N S C R I P T D I R E C T O R Y || Survival / S u r v i v a l Cu r v e L o a d e r . r ' ') " ," SurvivalCurve
. loader ( input . filename = ' ' outputfile ' ' , concept . time = ' ' || TIME || ' ') "]} ," renderer ":{"
GSP ":"/ survivalAnalysis / s u r v i v a l A n a l y s i s O u t p u t "} ,... ( goes on ) '
where module_name = ' p gs u rv iv a lA n al ys i s ';

Not very nice...

transmart-data

November 6, 2013

17 / 22

RModules Analyses’ (transmart-data)
In transmart-data:
One ﬁle per analysis
Files can be generated from DB data
Sanely formatted
But we really want to remove this from the DB!
array (
'id' => 'heatmap',
'name' => 'Heatmap',
'dataTypes' =>
array (
'subset1' =>
array (
0 => 'CLINICAL.TXT',
),
),
'dataFileInputMapping' =>
array (
'CLINICAL.TXT' => 'FALSE',
'SNP.TXT' => 'snpData',
'MRNA_DETAILED.TXT' => 'TRUE',
),
'pivotData' => false,
...


transmart-data

November 6, 2013

18 / 22

Rserve
Targets for Rserve:
Download/build R
Install R packages
Start Rserve
Install System V init
script for Rserve
Idem for systemd

cd R
make - j8 bin / root / R
# some packages don ' t support
concurrent builds
make install_packages
make start_Rserve
make start_Rserve . dbg
TRANSMART_USER = tomcat7 sudo E make i n s ta l l _r s e rv e _ in i t
TRANSMART_USER = tomcat7 sudo E make i n s ta l l _r s e rv e _ un i t


transmart-data

November 6, 2013

19 / 22

Solr
Solr (4.5.0) automatically
downloaded and conﬁgured
Solr cores automatically created
User only needs to create a schema
ﬁle and dataconfig.xml
# setup & solr ( psql )
make start
# just c o n f i g u r e
make solr_home
make < core > _full_import
make < core > _delta_import
make clean_cores
ORACLE =1 make start


transmart-data

November 6, 2013

20 / 22

transmartApp Configuration

Out-of-tree config management:
Targets for installing files
Zero configuration for
dev!
Customization allowed
without touching the target
files
Only supports ours branches
But a lot of configuration
should be in-tree instead!


# install everything
# previous files are backed
up
make install
# just one file :
make install_Config . groovy
make install_ Bu il dC on fi g .
groovy
make install _D at aS ou rce .
groovy
# costumizations in :
# Config - extra . php
# BuildConfig . groovy (
limited )

transmart-data

November 6, 2013

21 / 22

Current Limitations

DB upgrades not handled
Only a few ETL pipelines
supported
Oracle support is behind
PostgreSQL
Tooling shares repository
with application data
© Joost J. Bakker, CC BY 2.0


transmart-data

November 6, 2013

22 / 22

tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data

Similar to tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data (20)

More from David Peyruc

More from David Peyruc (20)

Recently uploaded

Recently uploaded (20)

tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data