Building ETL pipelines for tranSMART 17.X - New tools for the data loader

Building ETL pipelines
to tranSMART 17.X
New tools for the data loader
Alessia Peviani – Data Engineer

TRANSMART
17.X INSTANCE
(SLICE)
TMTK ARBORIST (plugin)
TRANSMART
LOADER
TRANSMART
COPY
GENERIC
DATA
SLICING TOOL
csr
2transmart
fhir
2transmart
ontology
2transmart
claml
2transmart
TRANSMART
17.X INSTANCE
CSR DATA FHIR DATA
GENERIC
ONTOLOGY
CLAML
ONTOLOGY
ETL pipeline
TRANSMART
COPY
Jupyter Notebook
TRANSMART
17.X INSTANCE
Tools overview: data flow and dependencies
GENERIC
ONTOLOGY
ARBORIST
(standalone)
Manual pipeline Automated ETL pipeline
Data flow
Code dependency
GENERIC
DATA

Why Manual Pipelines
Manual pipeline
• One-time / infrequent data loading
• Changing data format
• Initial data exploration / modeling
• Tools developed internally for data scientists, and to
facilitate collaboration with data owners
(researchers, clinicians)
GENERIC
DATA
TRANSMART
17.X INSTANCE
TRANSMART
COPY
Jupyter Notebook
GENERIC
ONTOLOGY
ARBORIST
(standalone)
Data flow
Code dependency

Why Automated ETL Pipelines
Data flow
Code dependency
• Frequent updates
• Stable data format
• Typically large volumes
of data
• Tools developed for
specific projects, but
with flexibility in mind to
enable use in future ETL
pipelines TRANSMART
17.X INSTANCE
(SLICE)
TRANSMART
LOADER
TRANSMART
COPY
SLICING TOOL
csr
2transmart
fhir
2transmart
ontology
2transmart
claml
2transmart
CSR DATA FHIR DATA
GENERIC
ONTOLOGY
CLAML
ONTOLOGY
ETL pipeline
TRANSMART
17.X INSTANCE
Automated ETL pipeline
GENERIC
DATA

1. Tools for manual data loading
TMTK & Arborist, transmart-copy

TMTK: The TranSMART Toolkit
Manual pipeline
• Easier collaboration with data owners:
• Option to load tree structure, word mapping, metadata
from Excel file ( “Template file”)
• Interactive tree structure modeling with the Arborist
• Updated to support TranSMART 17.X features
• Added additional sheets in template file
• Added export in transmart-copy format
GENERIC
DATA
TRANSMART
17.X INSTANCE
TRANSMART
COPY
Jupyter Notebook
GENERIC
ONTOLOGY
ARBORIST
(standalone)
Data flow
Code dependency
https://github.com/thehyve/tmtk

TMTK: The TranSMART Toolkit
Input template file

The Arborist
Manual pipeline
Two flavors of interactive tree modeling:
• Without leaving Jupyter Notebook (embedded)
• By sending tree to web service (standalone)
• Share trees with data owners and let them edit
• Re-import final result into TMTK
• Functionality (both versions):
• Add/remove tree node and metadata tags
• Edit position and name
GENERIC
DATA
TRANSMART
17.X INSTANCE
TRANSMART
COPY
Jupyter Notebook
GENERIC
ONTOLOGY
ARBORIST
(standalone)
Data flow
Code dependency
https://github.com/thehyve/arborist

The Arborist: Embedded vs Standalone

The Arborist: Plug-in vs Standalone
https://arborist-test-trait.thehyve.net/

The Arborist: Plug-in vs Standalone
feedback
Data owner

Data loading with transmart-copy
Manual pipeline
GENERIC
DATA
TRANSMART
17.X INSTANCE
TRANSMART
COPY
Jupyter Notebook
GENERIC
ONTOLOGY
ARBORIST
(standalone)
Data flow
Code dependency
• A simple tool with a simple function
• Batch and incremental data loading
• What you see (data tables) is what you get
(transmart tables)
• Supports new TranSMART 17.X features
https://github.com/thehyve/transmart-core/tree/dev/transmart-copy

transmart-copy
transmart-copy
(TranSMART 17.X database
schemas as seen in pgAdmin)
https://github.com/thehyve/transmart-core/tree/dev/transmart-copy/src/test/resources/examples/SURVEY0

transmart-copy vs transmart-batch
transmart-copy transmart-batch

transmart-copy vs transmart-batch
transmart-copy transmart-batch
• A simple tool with a simple function
• Batch and incremental data loading
• What you see (data tables) is what you get
(transmart tables)
• Supports new TranSMART 17.X features
• Complex, includes validation steps
• Batch loading only (inefficient in some
cases)
• Does not make explicit what your data
will look like once in transmart

2. Tools for automated ETL pipelines
transmart-loader & related tools, hyper-dicer

transmart-loader
Data flow
Code dependency
• Python library encoding
TranSMART entities as well-
defined classes
• Ensures data export to
transmart-copy compatible
format
• Flexible tool, fixed output
but can be adapted to new
input formats (various
flavors) TRANSMART
17.X INSTANCE
(SLICE)
TRANSMART
LOADER
TRANSMART
COPY
SLICING TOOL
csr
2transmart
fhir
2transmart
ontology
2transmart
claml
2transmart
CSR DATA FHIR DATA
GENERIC
ONTOLOGY
CLAML
ONTOLOGY
ETL pipeline
TRANSMART
17.X INSTANCE
GENERIC
DATA

transmart-loader based tools
Data flow
Code dependency
transmart-loader -based
mapping tools:
• csr2transmart
• fhir2transmart
• ontology2transmart
• claml2transmart
TRANSMART
17.X INSTANCE
(SLICE)
TRANSMART
LOADER
TRANSMART
COPY
SLICING TOOL
csr
2transmart
fhir
2transmart
ontology
2transmart
claml
2transmart
CSR DATA FHIR DATA
GENERIC
ONTOLOGY
CLAML
ONTOLOGY
ETL pipeline
TRANSMART
17.X INSTANCE
GENERIC
DATA
https://github.com/thehyve/
transmart-loader

Slicing tool
Data flow
Code dependency
• Developed for DiFuture
consortium to populate
project-specific data
warehouses
• Given a set of constraints,
automatically extracts
corresponding data and
loads them to another
TranSMART instance
TRANSMART
17.X INSTANCE
(SLICE)
TRANSMART
LOADER
TRANSMART
COPY
SLICING TOOL
csr
2transmart
fhir
2transmart
ontology
2transmart
claml
2transmart
CSR DATA FHIR DATA
GENERIC
ONTOLOGY
CLAML
ONTOLOGY
ETL pipeline
TRANSMART
17.X INSTANCE
GENERIC
DATA

Slicing tool
https://github.com/thehyve/transmart-hyper-dicer
• Official name: transmart-hyper-dicer
(short for “TranSMART hypercube dicer”)
Functionality:
• Require input JSON file with query constraints
• Extract relevant part of the ontology for the given set of data
• Populate an EXISTING (empty) TranSMART instance

Conclusions
• Transmart 17.X comes with a range of data loading tools,
both for for simple manual pipelines, and complex
automated ETL pipelines.
• Increasingly modular to improve maintainability (separate
functions in separate tools)
• Development mostly driven by specific ETL projects, but
keeping in mind flexibility to create tools easily adaptable to
new use case scenarios

GENERIC
DATA
TRANSMART
17.X INSTANCE
TRANSMART
COPY
Jupyter Notebook
Workshop this afternoon!
GENERIC
ONTOLOGY
ARBORIST
(standalone)
Manual pipeline
Data flow
Code dependency
PYTHON API
CLIENT
Jupyter Notebook
ROOM B4-221 (next to lunch area) at 14:30
Jupyter Notebook & tools for:
• Data Loading to TranSMART
• TranSMART API calls

Acknowledgements
Gijs Kant
Software Architect
Ewelina Grudzień
Software Engineer
Artur Faizullin
System Administrator
Stefan Payralbe
Data Engineer
Jochem Bijlard
Data Engineer
Ward Weistra
(former Hyver)
Brenda Hijmans, NKI
(TMTK contributor)

Building ETL pipelines for tranSMART 17.X - New tools for the data loader

Building ETL pipelines for tranSMART 17.X - New tools for the data loader

Recommended

Recommended

More Related Content

Similar to Building ETL pipelines for tranSMART 17.X - New tools for the data loader

Similar to Building ETL pipelines for tranSMART 17.X - New tools for the data loader (20)

Recently uploaded

Recently uploaded (20)

Building ETL pipelines for tranSMART 17.X - New tools for the data loader