Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building ETL pipelines for tranSMART 17.X - New tools for the data loader

28 views

Published on

An overview of data loading tools to tranSMART 17.X for Jupyter Notebook and automated ETL pipelines

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Building ETL pipelines for tranSMART 17.X - New tools for the data loader

  1. 1. Building ETL pipelines to tranSMART 17.X New tools for the data loader Alessia Peviani – Data Engineer
  2. 2. TRANSMART 17.X INSTANCE (SLICE) TMTK ARBORIST (plugin) TRANSMART LOADER TRANSMART COPY GENERIC DATA SLICING TOOL csr 2transmart fhir 2transmart ontology 2transmart claml 2transmart TRANSMART 17.X INSTANCE CSR DATA FHIR DATA GENERIC ONTOLOGY CLAML ONTOLOGY ETL pipeline TRANSMART COPY Jupyter Notebook TRANSMART 17.X INSTANCE Tools overview: data flow and dependencies GENERIC ONTOLOGY ARBORIST (standalone) Manual pipeline Automated ETL pipeline Data flow Code dependency GENERIC DATA
  3. 3. Why Manual Pipelines Manual pipeline • One-time / infrequent data loading • Changing data format • Initial data exploration / modeling • Tools developed internally for data scientists, and to facilitate collaboration with data owners (researchers, clinicians) TMTK ARBORIST (plugin) GENERIC DATA TRANSMART 17.X INSTANCE TRANSMART COPY Jupyter Notebook GENERIC ONTOLOGY ARBORIST (standalone) Data flow Code dependency
  4. 4. Why Automated ETL Pipelines Data flow Code dependency • Frequent updates • Stable data format • Typically large volumes of data • Tools developed for specific projects, but with flexibility in mind to enable use in future ETL pipelines TRANSMART 17.X INSTANCE (SLICE) TRANSMART LOADER TRANSMART COPY SLICING TOOL csr 2transmart fhir 2transmart ontology 2transmart claml 2transmart CSR DATA FHIR DATA GENERIC ONTOLOGY CLAML ONTOLOGY ETL pipeline TRANSMART 17.X INSTANCE Automated ETL pipeline GENERIC DATA
  5. 5. 1. Tools for manual data loading TMTK & Arborist, transmart-copy
  6. 6. TMTK: The TranSMART Toolkit Manual pipeline • Easier collaboration with data owners: • Option to load tree structure, word mapping, metadata from Excel file ( “Template file”) • Interactive tree structure modeling with the Arborist • Updated to support TranSMART 17.X features • Added additional sheets in template file • Added export in transmart-copy format TMTK ARBORIST (plugin) GENERIC DATA TRANSMART 17.X INSTANCE TRANSMART COPY Jupyter Notebook GENERIC ONTOLOGY ARBORIST (standalone) Data flow Code dependency https://github.com/thehyve/tmtk
  7. 7. TMTK: The TranSMART Toolkit Input template file
  8. 8. The Arborist Manual pipeline Two flavors of interactive tree modeling: • Without leaving Jupyter Notebook (embedded) • By sending tree to web service (standalone) • Share trees with data owners and let them edit • Re-import final result into TMTK • Functionality (both versions): • Add/remove tree node and metadata tags • Edit position and name TMTK ARBORIST (plugin) GENERIC DATA TRANSMART 17.X INSTANCE TRANSMART COPY Jupyter Notebook GENERIC ONTOLOGY ARBORIST (standalone) Data flow Code dependency https://github.com/thehyve/arborist
  9. 9. The Arborist: Embedded vs Standalone
  10. 10. The Arborist: Embedded vs Standalone
  11. 11. The Arborist: Plug-in vs Standalone https://arborist-test-trait.thehyve.net/
  12. 12. The Arborist: Plug-in vs Standalone feedback Data owner
  13. 13. Data loading with transmart-copy Manual pipeline TMTK ARBORIST (plugin) GENERIC DATA TRANSMART 17.X INSTANCE TRANSMART COPY Jupyter Notebook GENERIC ONTOLOGY ARBORIST (standalone) Data flow Code dependency • A simple tool with a simple function • Batch and incremental data loading • What you see (data tables) is what you get (transmart tables) • Supports new TranSMART 17.X features https://github.com/thehyve/transmart-core/tree/dev/transmart-copy
  14. 14. transmart-copy transmart-copy (TranSMART 17.X database schemas as seen in pgAdmin) https://github.com/thehyve/transmart-core/tree/dev/transmart-copy/src/test/resources/examples/SURVEY0
  15. 15. transmart-copy vs transmart-batch transmart-copy transmart-batch
  16. 16. transmart-copy vs transmart-batch transmart-copy transmart-batch • A simple tool with a simple function • Batch and incremental data loading • What you see (data tables) is what you get (transmart tables) • Supports new TranSMART 17.X features • Complex, includes validation steps • Batch loading only (inefficient in some cases) • Does not make explicit what your data will look like once in transmart
  17. 17. 2. Tools for automated ETL pipelines transmart-loader & related tools, hyper-dicer
  18. 18. transmart-loader Data flow Code dependency • Python library encoding TranSMART entities as well- defined classes • Ensures data export to transmart-copy compatible format • Flexible tool, fixed output but can be adapted to new input formats (various flavors) TRANSMART 17.X INSTANCE (SLICE) TRANSMART LOADER TRANSMART COPY SLICING TOOL csr 2transmart fhir 2transmart ontology 2transmart claml 2transmart CSR DATA FHIR DATA GENERIC ONTOLOGY CLAML ONTOLOGY ETL pipeline TRANSMART 17.X INSTANCE Automated ETL pipeline GENERIC DATA
  19. 19. transmart-loader based tools Data flow Code dependency transmart-loader -based mapping tools: • csr2transmart • fhir2transmart • ontology2transmart • claml2transmart TRANSMART 17.X INSTANCE (SLICE) TRANSMART LOADER TRANSMART COPY SLICING TOOL csr 2transmart fhir 2transmart ontology 2transmart claml 2transmart CSR DATA FHIR DATA GENERIC ONTOLOGY CLAML ONTOLOGY ETL pipeline TRANSMART 17.X INSTANCE Automated ETL pipeline GENERIC DATA https://github.com/thehyve/ transmart-loader
  20. 20. Slicing tool Data flow Code dependency • Developed for DiFuture consortium to populate project-specific data warehouses • Given a set of constraints, automatically extracts corresponding data and loads them to another TranSMART instance TRANSMART 17.X INSTANCE (SLICE) TRANSMART LOADER TRANSMART COPY SLICING TOOL csr 2transmart fhir 2transmart ontology 2transmart claml 2transmart CSR DATA FHIR DATA GENERIC ONTOLOGY CLAML ONTOLOGY ETL pipeline TRANSMART 17.X INSTANCE Automated ETL pipeline GENERIC DATA
  21. 21. Slicing tool https://github.com/thehyve/transmart-hyper-dicer • Official name: transmart-hyper-dicer (short for “TranSMART hypercube dicer”) Functionality: • Require input JSON file with query constraints • Extract relevant part of the ontology for the given set of data • Populate an EXISTING (empty) TranSMART instance
  22. 22. Conclusions • Transmart 17.X comes with a range of data loading tools, both for for simple manual pipelines, and complex automated ETL pipelines. • Increasingly modular to improve maintainability (separate functions in separate tools) • Development mostly driven by specific ETL projects, but keeping in mind flexibility to create tools easily adaptable to new use case scenarios
  23. 23. TMTK ARBORIST (plugin) GENERIC DATA TRANSMART 17.X INSTANCE TRANSMART COPY Jupyter Notebook Workshop this afternoon! GENERIC ONTOLOGY ARBORIST (standalone) Manual pipeline Data flow Code dependency PYTHON API CLIENT Jupyter Notebook ROOM B4-221 (next to lunch area) at 14:30 Jupyter Notebook & tools for: • Data Loading to TranSMART • TranSMART API calls
  24. 24. Acknowledgements Gijs Kant Software Architect Ewelina Grudzień Software Engineer Artur Faizullin System Administrator Stefan Payralbe Data Engineer Jochem Bijlard Data Engineer Ward Weistra (former Hyver) Brenda Hijmans, NKI (TMTK contributor)

×