Redefining ETL Pipelines with Apache
Technologies to Accelerate Decision
Making for Clinical Trials
Eran Withana
www.comprehend.com
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
Implementation
Managing Hardware
Deployments
Data Adapters: Implementation and Failure Modes
Distributed File System
 Challenges
 Future Work
Overview
www.comprehend.com
Open Source
Member, PMC member and committer of ASF
Apache Axis2, Web Services, Synapse,
Airavata
Education
PhD in Computer Science from Indiana
University
Software engineer at Comprehend Systems
About me …
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
Implementation
Managing Hardware
Deployments
Data Adapters: Implementation and Failure Modes
Distributed File System
Challenges
Future Work
Overview
www.comprehend.com
Clinical Trials – Lay of the land
Number of Drugs in Development Worldwide
(Source: CenterWatch Drugs in Clinical Trial
Database 2014)
Source: http://www.phrma.org/innovation/clinical-trials
www.comprehend.com
Clinical Trials – Lay of the Land
Multiple Stakeholders
• Study Managers
• Program Managers
• Monitors
• Data Managers
• Bio-statisticians
• Executives
• Medical Affairs
• Regulatory
• Vendors
• CROs
• CRAs
Sites
Labs
Patients
Safety
EDC
Reports
● Latent
● Fragmented
Data
PV Data
Excel
Sponsor
Contract Research Organization (CRO)
Sites and Investigators
www.comprehend.com
For decades, clinical development
was primarily paper-based.
www.comprehend.com
Various Software and Practices Used in Each Layer
medidata
CROs and SIs
Technologies
www.comprehend.com
Clinical Trials with Centralized Monitoring
Clinical
Operations
Sites
Labs
Patients
● Consolidated
● Real-time
● Self-Service
● Mobile
Clinical
Analytics &
Collaboration
Data
Safet
y
EDC
PV Data
Excel
www.comprehend.com
Providing up-to-date answers
Executives Medical Review
CRAs Data Management
Clinical Operations
EDC
CTMS
Safety
ePro
Other
Web
Ad-Hoc
Mobile
Collab
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
Implementation
Managing Hardware
Deployments
Data Adapters: Implementation and Failure Modes
Distributed File System
Challenges
Future Work
Overview
www.comprehend.com
FDA, HIPAA Compliance
Metadata/Database structure synchronization
Less frequent (once a day)
Data Synchronization
More frequent (multiple times a day)
Ability to plugin various data sources
RAVE, MERGE, BioClinica, File Imports, DB-to-DB
Synchs
Real time event propagations
Adverse events (AEs) - the need for early
identification
Business Requirements
www.comprehend.com
Hardware agnostic for resiliency and better
utilization
Repeatable deployments
Real time processing and real time events
Fault Tolerance
In flight and end state metrics for alerting and
monitoring
Flexible and pluggable adapter architecture
Time travel
Audit trails
Report generations
Technical Requirements
www.comprehend.com
Events all the way
Shared event bus for multiple consumers
Use of language agnostic data
representations (via protobuf)
Automatic datacenter resources
management (Mesos/Marathon/Docker)
Core Design Principles
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
Implementation
Managing Hardware
Deployments
Data Adapters: Implementation and Failure Modes
Distributed File System
Challenges
Future Work
Overview
www.comprehend.com
• Data processing
 Apache Storm and Trident, Apache
Spark and Spark Streaming,
Samza, Summingbird, Scalding,
Apache Falcon, Azkaban
• Coordination and Configuration
Management
 Apache Zookeeper, Redis, Apache
Curator
• Event Queue
 Apache Kafka
• Scheduling
 Chronos, Apache Mesos, Marathon,
Apache Aurora
• Database Synchronization
 Liquibase, Flyway DB
• Data Representations
 Apache Thrift, protobuf, Avro
• Deployments
 Ansible
• File Management
 Apache HDFS
• Monitoring and alerting
 Graphite, StatsD
• Database
 PostgreSQL, Apache Spark
• Resource Isolation
 LXC, Docker
Technologies Evaluated
www.comprehend.com
Data Processing Technology Evaluation
Criteria Storm +
Trident
Spark +
Streaming
Samza Summingbird Scalding Falcon Chronos Aurora Azkaban
DAG
Support
Y DAGScheduler
Y Y Y Y Y N Y
DAG Nodes
Resiliency
Y Y Y Y Y Y Y N Y
Event
Driven
Y Y Y Y N N N N N
Timed
Execution
Y Y Y Y Y Y Y Y
DAG
Extension
Y Y Y Y Y Y Y Y Y
Inflight and
end state
metrics
Y Y Y Y Y Y Y Y Y
Hardware
Agnostic
Y Y Y Y Y Y Y Y Y
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
Implementation
Managing Hardware
Deployments
Data Adapters: Implementation and Failure Modes
Distributed File System
Challenges
Future Work
Overview
www.comprehend.com
High Level Architecture
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
Implementation
Managing Hardware
Deployments
Data Adapters: Implementation and Failure Modes
Distributed File System
Challenges
Future Work
Overview
www.comprehend.com
Bare Metal Boxes
Partitioned using LXC containers
Use of Mesos to do the resource
allocations as needed for jobs
Managing Hardware
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
Implementation
Managing Hardware
Deployments
Data Adapters: Implementation and Failure Modes
Distributed File System
Challenges
Future Work
Overview
www.comprehend.com
Ansible
Repeatable deployments
Password management
Inventory management
(nodes, dev/staging/production)
Deployments
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
Implementation
Managing Hardware
Deployments
Data Adapters: Implementation and Failure Modes
Distributed File System
Challenges
Future Work
Overview
www.comprehend.com
Adapters – High Level
• Syncher is for DB structural
changes
 Syncher creates a database schema
from the source information
 Runs a generic database diff and
applies those to the target database
• Seeder is for data
synchronization
 Uses the database schema created
by Syncher
• Seeders gets jobs from
 Syncher or
 Timed scheduler
Data Adapters
• Coordination and
Configuration
through Zookeeper
Job configuration
Connection information
Distributed locking and
counters
Metric Maintenance
Last successful run
Data Adapters – Coordination and Configuration
www.comprehend.com
Data Adapters - Implementation
www.comprehend.com
 Syncher
 Connectivity to source/sink systems fail
• Retry job N times and alert, if needed
 Schema changes to the database fails in the middle
• Transaction rollback
 Seeder
 Connectivity to source/sink systems fail
• Retry job N times and alert, if needed
 If seeding fails midway
• Storm retries tuples
• Failing tuples are moved to an error queue
 Table and row level failues
• Option to skip the tables/rows but send a report at the end
 Effect on “live” tables during data synchronizations
• Option to use transactions or
• Use temporary tables and swap with original upon completion
Failure Modes
www.comprehend.com
Can bring in data from more data sources and
more studies effectively
Run real time reports on studies and configure
alerts (future)
Can configure refreshes as needed by each
use case
Can throttle input and output sources at
study/customer level
Ability to onboard new customers and deploy
new studies with minimal human intervention
What Have We Gained
www.comprehend.com
A generic framework which
eases integration with new data sources
• For each new source, implement a method to create a
virtual schema and to get data for a given table
can scale and fault tolerant
has generic monitoring and alerting
eases maintenance since its mostly generic code
notification of important events through messages
runs on any hardware
What Have We Gained
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
Implementation
Managing Hardware
Deployments
Data Adapters: Implementation and Failure Modes
Distributed File System
Challenges
Future Work
Overview
www.comprehend.com
Accessibility
Customers must be able to drop files securely (SFTP like
functionality)
Ability to access resources through URLs
Data storage
Scalability and Redundancy
Scale-out by adding nodes
Resilience against loss of nodes, data centers and
replication
Miscellaneous
Access control over read/write
Performance/usage/resource utilization monitoring
Distributed File System - Requirements
www.comprehend.com
Two name nodes running
in HA mode, co-located
with two journal nodes
Third journal node on a
separate node
Data nodes on all bare
metal nodes
Mounting HDFS with
FUSE and enabling SFTP
through OS level features
Automatic failover through
DNS and HA Proxy
HDFS with High Availability Mode
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
Implementation
Managing Hardware
Deployments
Data Adapters: Implementation and Failure Modes
Distributed File System
Challenges
Future Work
Overview
www.comprehend.com
Regulatory requirements
Data encryption requirements for clinical data
Audit trails
Data quality
Source system constraints
Coordination between Synchers and Seeders
Distributed locks and counters
Automatic fail over when a name node fails in
HDFS
HDFS HA mode stores active name node in ZK as a
java serialized object, yikes !!
Challenges
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
Implementation
Managing Hardware
Deployments
Data Adapters: Implementation and Failure Modes
Distributed File System
Challenges
Future Work
Overview
www.comprehend.com
Time travel
Ability to go back in time and run reports at any
given point of time
Trail of data
Containerization
In-memory query execution with Apache
Spark
Future Work
www.comprehend.com
Team
www.comprehend.com
Thank You !!
Questions …

Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

  • 1.
    Redefining ETL Pipelineswith Apache Technologies to Accelerate Decision Making for Clinical Trials Eran Withana
  • 2.
    www.comprehend.com Clinical Trials –Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System  Challenges  Future Work Overview
  • 3.
    www.comprehend.com Open Source Member, PMCmember and committer of ASF Apache Axis2, Web Services, Synapse, Airavata Education PhD in Computer Science from Indiana University Software engineer at Comprehend Systems About me …
  • 4.
    Clinical Trials –Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  • 5.
    www.comprehend.com Clinical Trials –Lay of the land Number of Drugs in Development Worldwide (Source: CenterWatch Drugs in Clinical Trial Database 2014) Source: http://www.phrma.org/innovation/clinical-trials
  • 6.
    www.comprehend.com Clinical Trials –Lay of the Land Multiple Stakeholders • Study Managers • Program Managers • Monitors • Data Managers • Bio-statisticians • Executives • Medical Affairs • Regulatory • Vendors • CROs • CRAs Sites Labs Patients Safety EDC Reports ● Latent ● Fragmented Data PV Data Excel Sponsor Contract Research Organization (CRO) Sites and Investigators
  • 7.
    www.comprehend.com For decades, clinicaldevelopment was primarily paper-based.
  • 8.
    www.comprehend.com Various Software andPractices Used in Each Layer medidata CROs and SIs Technologies
  • 9.
    www.comprehend.com Clinical Trials withCentralized Monitoring Clinical Operations Sites Labs Patients ● Consolidated ● Real-time ● Self-Service ● Mobile Clinical Analytics & Collaboration Data Safet y EDC PV Data Excel
  • 10.
    www.comprehend.com Providing up-to-date answers ExecutivesMedical Review CRAs Data Management Clinical Operations EDC CTMS Safety ePro Other Web Ad-Hoc Mobile Collab
  • 11.
    Clinical Trials –Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  • 12.
    www.comprehend.com FDA, HIPAA Compliance Metadata/Databasestructure synchronization Less frequent (once a day) Data Synchronization More frequent (multiple times a day) Ability to plugin various data sources RAVE, MERGE, BioClinica, File Imports, DB-to-DB Synchs Real time event propagations Adverse events (AEs) - the need for early identification Business Requirements
  • 13.
    www.comprehend.com Hardware agnostic forresiliency and better utilization Repeatable deployments Real time processing and real time events Fault Tolerance In flight and end state metrics for alerting and monitoring Flexible and pluggable adapter architecture Time travel Audit trails Report generations Technical Requirements
  • 14.
    www.comprehend.com Events all theway Shared event bus for multiple consumers Use of language agnostic data representations (via protobuf) Automatic datacenter resources management (Mesos/Marathon/Docker) Core Design Principles
  • 15.
    Clinical Trials –Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  • 16.
    www.comprehend.com • Data processing Apache Storm and Trident, Apache Spark and Spark Streaming, Samza, Summingbird, Scalding, Apache Falcon, Azkaban • Coordination and Configuration Management  Apache Zookeeper, Redis, Apache Curator • Event Queue  Apache Kafka • Scheduling  Chronos, Apache Mesos, Marathon, Apache Aurora • Database Synchronization  Liquibase, Flyway DB • Data Representations  Apache Thrift, protobuf, Avro • Deployments  Ansible • File Management  Apache HDFS • Monitoring and alerting  Graphite, StatsD • Database  PostgreSQL, Apache Spark • Resource Isolation  LXC, Docker Technologies Evaluated
  • 17.
    www.comprehend.com Data Processing TechnologyEvaluation Criteria Storm + Trident Spark + Streaming Samza Summingbird Scalding Falcon Chronos Aurora Azkaban DAG Support Y DAGScheduler Y Y Y Y Y N Y DAG Nodes Resiliency Y Y Y Y Y Y Y N Y Event Driven Y Y Y Y N N N N N Timed Execution Y Y Y Y Y Y Y Y DAG Extension Y Y Y Y Y Y Y Y Y Inflight and end state metrics Y Y Y Y Y Y Y Y Y Hardware Agnostic Y Y Y Y Y Y Y Y Y
  • 18.
    Clinical Trials –Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  • 19.
  • 20.
    Clinical Trials –Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  • 21.
    www.comprehend.com Bare Metal Boxes Partitionedusing LXC containers Use of Mesos to do the resource allocations as needed for jobs Managing Hardware
  • 22.
    Clinical Trials –Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  • 23.
  • 24.
    Clinical Trials –Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  • 25.
  • 26.
    • Syncher isfor DB structural changes  Syncher creates a database schema from the source information  Runs a generic database diff and applies those to the target database • Seeder is for data synchronization  Uses the database schema created by Syncher • Seeders gets jobs from  Syncher or  Timed scheduler Data Adapters
  • 27.
    • Coordination and Configuration throughZookeeper Job configuration Connection information Distributed locking and counters Metric Maintenance Last successful run Data Adapters – Coordination and Configuration
  • 28.
  • 29.
    www.comprehend.com  Syncher  Connectivityto source/sink systems fail • Retry job N times and alert, if needed  Schema changes to the database fails in the middle • Transaction rollback  Seeder  Connectivity to source/sink systems fail • Retry job N times and alert, if needed  If seeding fails midway • Storm retries tuples • Failing tuples are moved to an error queue  Table and row level failues • Option to skip the tables/rows but send a report at the end  Effect on “live” tables during data synchronizations • Option to use transactions or • Use temporary tables and swap with original upon completion Failure Modes
  • 30.
    www.comprehend.com Can bring indata from more data sources and more studies effectively Run real time reports on studies and configure alerts (future) Can configure refreshes as needed by each use case Can throttle input and output sources at study/customer level Ability to onboard new customers and deploy new studies with minimal human intervention What Have We Gained
  • 31.
    www.comprehend.com A generic frameworkwhich eases integration with new data sources • For each new source, implement a method to create a virtual schema and to get data for a given table can scale and fault tolerant has generic monitoring and alerting eases maintenance since its mostly generic code notification of important events through messages runs on any hardware What Have We Gained
  • 32.
    Clinical Trials –Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  • 33.
    www.comprehend.com Accessibility Customers must beable to drop files securely (SFTP like functionality) Ability to access resources through URLs Data storage Scalability and Redundancy Scale-out by adding nodes Resilience against loss of nodes, data centers and replication Miscellaneous Access control over read/write Performance/usage/resource utilization monitoring Distributed File System - Requirements
  • 34.
    www.comprehend.com Two name nodesrunning in HA mode, co-located with two journal nodes Third journal node on a separate node Data nodes on all bare metal nodes Mounting HDFS with FUSE and enabling SFTP through OS level features Automatic failover through DNS and HA Proxy HDFS with High Availability Mode
  • 35.
    Clinical Trials –Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  • 36.
    www.comprehend.com Regulatory requirements Data encryptionrequirements for clinical data Audit trails Data quality Source system constraints Coordination between Synchers and Seeders Distributed locks and counters Automatic fail over when a name node fails in HDFS HDFS HA mode stores active name node in ZK as a java serialized object, yikes !! Challenges
  • 37.
    Clinical Trials –Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  • 38.
    www.comprehend.com Time travel Ability togo back in time and run reports at any given point of time Trail of data Containerization In-memory query execution with Apache Spark Future Work
  • 39.
  • 40.

Editor's Notes

  • #6 Dose 20-100 Efficacy and safety 100-300 > 1000