Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Redefining ETL Pipelines with Apache
Technologies to Accelerate Decision
Making for Clinical Trials
Eran Withana
www.comprehend.com
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Le...
www.comprehend.com
Open Source
Member, PMC member and committer of ASF
Apache Axis2, Web Services, Synapse,
Airavata
E...
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
I...
www.comprehend.com
Clinical Trials – Lay of the land
Number of Drugs in Development Worldwide
(Source: CenterWatch Drugs i...
www.comprehend.com
Clinical Trials – Lay of the Land
Multiple Stakeholders
• Study Managers
• Program Managers
• Monitors
...
www.comprehend.com
For decades, clinical development
was primarily paper-based.
www.comprehend.com
Various Software and Practices Used in Each Layer
medidata
CROs and SIs
Technologies
www.comprehend.com
Clinical Trials with Centralized Monitoring
Clinical
Operations
Sites
Labs
Patients
● Consolidated
● Re...
www.comprehend.com
Providing up-to-date answers
Executives Medical Review
CRAs Data Management
Clinical Operations
EDC
CTM...
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
I...
www.comprehend.com
FDA, HIPAA Compliance
Metadata/Database structure synchronization
Less frequent (once a day)
Data S...
www.comprehend.com
Hardware agnostic for resiliency and better
utilization
Repeatable deployments
Real time processing ...
www.comprehend.com
Events all the way
Shared event bus for multiple consumers
Use of language agnostic data
representat...
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
I...
www.comprehend.com
• Data processing
 Apache Storm and Trident, Apache
Spark and Spark Streaming,
Samza, Summingbird, Sca...
www.comprehend.com
Data Processing Technology Evaluation
Criteria Storm +
Trident
Spark +
Streaming
Samza Summingbird Scal...
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
I...
www.comprehend.com
High Level Architecture
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
I...
www.comprehend.com
Bare Metal Boxes
Partitioned using LXC containers
Use of Mesos to do the resource
allocations as nee...
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
I...
www.comprehend.com
Ansible
Repeatable deployments
Password management
Inventory management
(nodes, dev/staging/product...
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
I...
www.comprehend.com
Adapters – High Level
• Syncher is for DB structural
changes
 Syncher creates a database schema
from the source information
 Runs a generic da...
• Coordination and
Configuration
through Zookeeper
Job configuration
Connection information
Distributed locking and
cou...
www.comprehend.com
Data Adapters - Implementation
www.comprehend.com
 Syncher
 Connectivity to source/sink systems fail
• Retry job N times and alert, if needed
 Schema ...
www.comprehend.com
Can bring in data from more data sources and
more studies effectively
Run real time reports on studie...
www.comprehend.com
A generic framework which
eases integration with new data sources
• For each new source, implement a ...
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
I...
www.comprehend.com
Accessibility
Customers must be able to drop files securely (SFTP like
functionality)
Ability to acc...
www.comprehend.com
Two name nodes running
in HA mode, co-located
with two journal nodes
Third journal node on a
separate...
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
I...
www.comprehend.com
Regulatory requirements
Data encryption requirements for clinical data
Audit trails
Data quality
S...
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
I...
www.comprehend.com
Time travel
Ability to go back in time and run reports at any
given point of time
Trail of data
Con...
www.comprehend.com
Team
www.comprehend.com
Thank You !!
Questions …
Upcoming SlideShare
Loading in …5
×

Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

907 views

Published on

Pharmaceutical and medical device makers spend over $130bn each year collecting and analyzing new data, mostly through clinical trials. It costs over $1.8bn to bring a new drug to market, and over $4bn when factoring in the cost of failures. By more efficiently understanding and analyzing this data, new drugs can reach patients quicker, safer, and at a lower cost.

In this presentation, Eran will discuss how ETL pipelines can be built using the Apache and other open source projects to improve clinical trial development. We will examine how the system is built, the challenges we faced and how we are able to reduce cost, accelerate execution time, and improve results. We will also demonstrate how reliable resource allocation, scalable data ingestion adapters, on-demand and fault tolerant job deployments, and monitoring benefit clinical trial decision-making and execution.

Published in: Engineering
  • Be the first to comment

Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

  1. 1. Redefining ETL Pipelines with Apache Technologies to Accelerate Decision Making for Clinical Trials Eran Withana
  2. 2. www.comprehend.com Clinical Trials – Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System  Challenges  Future Work Overview
  3. 3. www.comprehend.com Open Source Member, PMC member and committer of ASF Apache Axis2, Web Services, Synapse, Airavata Education PhD in Computer Science from Indiana University Software engineer at Comprehend Systems About me …
  4. 4. Clinical Trials – Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  5. 5. www.comprehend.com Clinical Trials – Lay of the land Number of Drugs in Development Worldwide (Source: CenterWatch Drugs in Clinical Trial Database 2014) Source: http://www.phrma.org/innovation/clinical-trials
  6. 6. www.comprehend.com Clinical Trials – Lay of the Land Multiple Stakeholders • Study Managers • Program Managers • Monitors • Data Managers • Bio-statisticians • Executives • Medical Affairs • Regulatory • Vendors • CROs • CRAs Sites Labs Patients Safety EDC Reports ● Latent ● Fragmented Data PV Data Excel Sponsor Contract Research Organization (CRO) Sites and Investigators
  7. 7. www.comprehend.com For decades, clinical development was primarily paper-based.
  8. 8. www.comprehend.com Various Software and Practices Used in Each Layer medidata CROs and SIs Technologies
  9. 9. www.comprehend.com Clinical Trials with Centralized Monitoring Clinical Operations Sites Labs Patients ● Consolidated ● Real-time ● Self-Service ● Mobile Clinical Analytics & Collaboration Data Safet y EDC PV Data Excel
  10. 10. www.comprehend.com Providing up-to-date answers Executives Medical Review CRAs Data Management Clinical Operations EDC CTMS Safety ePro Other Web Ad-Hoc Mobile Collab
  11. 11. Clinical Trials – Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  12. 12. www.comprehend.com FDA, HIPAA Compliance Metadata/Database structure synchronization Less frequent (once a day) Data Synchronization More frequent (multiple times a day) Ability to plugin various data sources RAVE, MERGE, BioClinica, File Imports, DB-to-DB Synchs Real time event propagations Adverse events (AEs) - the need for early identification Business Requirements
  13. 13. www.comprehend.com Hardware agnostic for resiliency and better utilization Repeatable deployments Real time processing and real time events Fault Tolerance In flight and end state metrics for alerting and monitoring Flexible and pluggable adapter architecture Time travel Audit trails Report generations Technical Requirements
  14. 14. www.comprehend.com Events all the way Shared event bus for multiple consumers Use of language agnostic data representations (via protobuf) Automatic datacenter resources management (Mesos/Marathon/Docker) Core Design Principles
  15. 15. Clinical Trials – Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  16. 16. www.comprehend.com • Data processing  Apache Storm and Trident, Apache Spark and Spark Streaming, Samza, Summingbird, Scalding, Apache Falcon, Azkaban • Coordination and Configuration Management  Apache Zookeeper, Redis, Apache Curator • Event Queue  Apache Kafka • Scheduling  Chronos, Apache Mesos, Marathon, Apache Aurora • Database Synchronization  Liquibase, Flyway DB • Data Representations  Apache Thrift, protobuf, Avro • Deployments  Ansible • File Management  Apache HDFS • Monitoring and alerting  Graphite, StatsD • Database  PostgreSQL, Apache Spark • Resource Isolation  LXC, Docker Technologies Evaluated
  17. 17. www.comprehend.com Data Processing Technology Evaluation Criteria Storm + Trident Spark + Streaming Samza Summingbird Scalding Falcon Chronos Aurora Azkaban DAG Support Y DAGScheduler Y Y Y Y Y N Y DAG Nodes Resiliency Y Y Y Y Y Y Y N Y Event Driven Y Y Y Y N N N N N Timed Execution Y Y Y Y Y Y Y Y DAG Extension Y Y Y Y Y Y Y Y Y Inflight and end state metrics Y Y Y Y Y Y Y Y Y Hardware Agnostic Y Y Y Y Y Y Y Y Y
  18. 18. Clinical Trials – Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  19. 19. www.comprehend.com High Level Architecture
  20. 20. Clinical Trials – Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  21. 21. www.comprehend.com Bare Metal Boxes Partitioned using LXC containers Use of Mesos to do the resource allocations as needed for jobs Managing Hardware
  22. 22. Clinical Trials – Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  23. 23. www.comprehend.com Ansible Repeatable deployments Password management Inventory management (nodes, dev/staging/production) Deployments
  24. 24. Clinical Trials – Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  25. 25. www.comprehend.com Adapters – High Level
  26. 26. • Syncher is for DB structural changes  Syncher creates a database schema from the source information  Runs a generic database diff and applies those to the target database • Seeder is for data synchronization  Uses the database schema created by Syncher • Seeders gets jobs from  Syncher or  Timed scheduler Data Adapters
  27. 27. • Coordination and Configuration through Zookeeper Job configuration Connection information Distributed locking and counters Metric Maintenance Last successful run Data Adapters – Coordination and Configuration
  28. 28. www.comprehend.com Data Adapters - Implementation
  29. 29. www.comprehend.com  Syncher  Connectivity to source/sink systems fail • Retry job N times and alert, if needed  Schema changes to the database fails in the middle • Transaction rollback  Seeder  Connectivity to source/sink systems fail • Retry job N times and alert, if needed  If seeding fails midway • Storm retries tuples • Failing tuples are moved to an error queue  Table and row level failues • Option to skip the tables/rows but send a report at the end  Effect on “live” tables during data synchronizations • Option to use transactions or • Use temporary tables and swap with original upon completion Failure Modes
  30. 30. www.comprehend.com Can bring in data from more data sources and more studies effectively Run real time reports on studies and configure alerts (future) Can configure refreshes as needed by each use case Can throttle input and output sources at study/customer level Ability to onboard new customers and deploy new studies with minimal human intervention What Have We Gained
  31. 31. www.comprehend.com A generic framework which eases integration with new data sources • For each new source, implement a method to create a virtual schema and to get data for a given table can scale and fault tolerant has generic monitoring and alerting eases maintenance since its mostly generic code notification of important events through messages runs on any hardware What Have We Gained
  32. 32. Clinical Trials – Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  33. 33. www.comprehend.com Accessibility Customers must be able to drop files securely (SFTP like functionality) Ability to access resources through URLs Data storage Scalability and Redundancy Scale-out by adding nodes Resilience against loss of nodes, data centers and replication Miscellaneous Access control over read/write Performance/usage/resource utilization monitoring Distributed File System - Requirements
  34. 34. www.comprehend.com Two name nodes running in HA mode, co-located with two journal nodes Third journal node on a separate node Data nodes on all bare metal nodes Mounting HDFS with FUSE and enabling SFTP through OS level features Automatic failover through DNS and HA Proxy HDFS with High Availability Mode
  35. 35. Clinical Trials – Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  36. 36. www.comprehend.com Regulatory requirements Data encryption requirements for clinical data Audit trails Data quality Source system constraints Coordination between Synchers and Seeders Distributed locks and counters Automatic fail over when a name node fails in HDFS HDFS HA mode stores active name node in ZK as a java serialized object, yikes !! Challenges
  37. 37. Clinical Trials – Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  38. 38. www.comprehend.com Time travel Ability to go back in time and run reports at any given point of time Trail of data Containerization In-memory query execution with Apache Spark Future Work
  39. 39. www.comprehend.com Team
  40. 40. www.comprehend.com Thank You !! Questions …

×