Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

Redefining ETL Pipelines with Apache
Technologies to Accelerate Decision
Making for Clinical Trials
Eran Withana

www.comprehend.com
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
Implementation
Managing Hardware
Deployments
Data Adapters: Implementation and Failure Modes
Distributed File System
 Challenges
 Future Work
Overview

www.comprehend.com
Open Source
Member, PMC member and committer of ASF
Apache Axis2, Web Services, Synapse,
Airavata
Education
PhD in Computer Science from Indiana
University
Software engineer at Comprehend Systems
About me …

Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
Implementation
Managing Hardware
Deployments
Data Adapters: Implementation and Failure Modes
Distributed File System
Challenges
Future Work
Overview

www.comprehend.com
Clinical Trials – Lay of the land
Number of Drugs in Development Worldwide
(Source: CenterWatch Drugs in Clinical Trial
Database 2014)
Source: http://www.phrma.org/innovation/clinical-trials

www.comprehend.com
Clinical Trials – Lay of the Land
Multiple Stakeholders
• Study Managers
• Program Managers
• Monitors
• Data Managers
• Bio-statisticians
• Executives
• Medical Affairs
• Regulatory
• Vendors
• CROs
• CRAs
Sites
Labs
Patients
Safety
EDC
Reports
● Latent
● Fragmented
Data
PV Data
Excel
Sponsor
Contract Research Organization (CRO)
Sites and Investigators

www.comprehend.com
For decades, clinical development
was primarily paper-based.

www.comprehend.com
Various Software and Practices Used in Each Layer
medidata
CROs and SIs
Technologies

www.comprehend.com
Clinical Trials with Centralized Monitoring
Clinical
Operations
Sites
Labs
Patients
● Consolidated
● Real-time
● Self-Service
● Mobile
Clinical
Analytics &
Collaboration
Data
Safet
y
EDC
PV Data
Excel

www.comprehend.com
Providing up-to-date answers
Executives Medical Review
CRAs Data Management
Clinical Operations
EDC
CTMS
Safety
ePro
Other
Web
Ad-Hoc
Mobile
Collab

www.comprehend.com
FDA, HIPAA Compliance
Metadata/Database structure synchronization
Less frequent (once a day)
Data Synchronization
More frequent (multiple times a day)
Ability to plugin various data sources
RAVE, MERGE, BioClinica, File Imports, DB-to-DB
Synchs
Real time event propagations
Adverse events (AEs) - the need for early
identification
Business Requirements

www.comprehend.com
Hardware agnostic for resiliency and better
utilization
Repeatable deployments
Real time processing and real time events
Fault Tolerance
In flight and end state metrics for alerting and
monitoring
Flexible and pluggable adapter architecture
Time travel
Audit trails
Report generations
Technical Requirements

www.comprehend.com
Events all the way
Shared event bus for multiple consumers
Use of language agnostic data
representations (via protobuf)
Automatic datacenter resources
management (Mesos/Marathon/Docker)
Core Design Principles

www.comprehend.com
• Data processing
 Apache Storm and Trident, Apache
Spark and Spark Streaming,
Samza, Summingbird, Scalding,
Apache Falcon, Azkaban
• Coordination and Configuration
Management
 Apache Zookeeper, Redis, Apache
Curator
• Event Queue
 Apache Kafka
• Scheduling
 Chronos, Apache Mesos, Marathon,
Apache Aurora
• Database Synchronization
 Liquibase, Flyway DB
• Data Representations
 Apache Thrift, protobuf, Avro
• Deployments
 Ansible
• File Management
 Apache HDFS
• Monitoring and alerting
 Graphite, StatsD
• Database
 PostgreSQL, Apache Spark
• Resource Isolation
 LXC, Docker
Technologies Evaluated

www.comprehend.com
Data Processing Technology Evaluation
Criteria Storm +
Trident
Spark +
Streaming
Samza Summingbird Scalding Falcon Chronos Aurora Azkaban
DAG
Support
Y DAGScheduler
Y Y Y Y Y N Y
DAG Nodes
Resiliency
Y Y Y Y Y Y Y N Y
Event
Driven
Y Y Y Y N N N N N
Timed
Execution
Y Y Y Y Y Y Y Y
DAG
Extension
Y Y Y Y Y Y Y Y Y
Inflight and
end state
metrics
Y Y Y Y Y Y Y Y Y
Hardware
Agnostic
Y Y Y Y Y Y Y Y Y

www.comprehend.com
High Level Architecture

www.comprehend.com
Bare Metal Boxes
Partitioned using LXC containers
Use of Mesos to do the resource
allocations as needed for jobs
Managing Hardware

www.comprehend.com
Ansible
Repeatable deployments
Password management
Inventory management
(nodes, dev/staging/production)
Deployments

www.comprehend.com
Adapters – High Level

• Syncher is for DB structural
changes
 Syncher creates a database schema
from the source information
 Runs a generic database diff and
applies those to the target database
• Seeder is for data
synchronization
 Uses the database schema created
by Syncher
• Seeders gets jobs from
 Syncher or
 Timed scheduler
Data Adapters

• Coordination and
Configuration
through Zookeeper
Job configuration
Connection information
Distributed locking and
counters
Metric Maintenance
Last successful run
Data Adapters – Coordination and Configuration

www.comprehend.com
Data Adapters - Implementation

www.comprehend.com
 Syncher
 Connectivity to source/sink systems fail
• Retry job N times and alert, if needed
 Schema changes to the database fails in the middle
• Transaction rollback
 Seeder
 Connectivity to source/sink systems fail
• Retry job N times and alert, if needed
 If seeding fails midway
• Storm retries tuples
• Failing tuples are moved to an error queue
 Table and row level failues
• Option to skip the tables/rows but send a report at the end
 Effect on “live” tables during data synchronizations
• Option to use transactions or
• Use temporary tables and swap with original upon completion
Failure Modes

www.comprehend.com
Can bring in data from more data sources and
more studies effectively
Run real time reports on studies and configure
alerts (future)
Can configure refreshes as needed by each
use case
Can throttle input and output sources at
study/customer level
Ability to onboard new customers and deploy
new studies with minimal human intervention
What Have We Gained

www.comprehend.com
A generic framework which
eases integration with new data sources
• For each new source, implement a method to create a
virtual schema and to get data for a given table
can scale and fault tolerant
has generic monitoring and alerting
eases maintenance since its mostly generic code
notification of important events through messages
runs on any hardware
What Have We Gained

www.comprehend.com
Accessibility
Customers must be able to drop files securely (SFTP like
functionality)
Ability to access resources through URLs
Data storage
Scalability and Redundancy
Scale-out by adding nodes
Resilience against loss of nodes, data centers and
replication
Miscellaneous
Access control over read/write
Performance/usage/resource utilization monitoring
Distributed File System - Requirements

www.comprehend.com
Two name nodes running
in HA mode, co-located
with two journal nodes
Third journal node on a
separate node
Data nodes on all bare
metal nodes
Mounting HDFS with
FUSE and enabling SFTP
through OS level features
Automatic failover through
DNS and HA Proxy
HDFS with High Availability Mode

www.comprehend.com
Regulatory requirements
Data encryption requirements for clinical data
Audit trails
Data quality
Source system constraints
Coordination between Synchers and Seeders
Distributed locks and counters
Automatic fail over when a name node fails in
HDFS
HDFS HA mode stores active name node in ZK as a
java serialized object, yikes !!
Challenges

www.comprehend.com
Time travel
Ability to go back in time and run reports at any
given point of time
Trail of data
Containerization
In-memory query execution with Apache
Spark
Future Work

www.comprehend.com
Thank You !!
Questions …

Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

More Related Content

What's hot

Viewers also liked

Similar to Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

More from Eran Chinthaka Withana

Recently uploaded

Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

Editor's Notes