Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

Redefining ETL Pipelines with Apache
Technologies to Accelerate Decision
Making for Clinical Trials
Eran Withana

www.comprehend.com
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
Implementation
Managing Hardware
Deployments
Data Adapters: Implementation and Failure Modes
Distributed File System
 Challenges
 Future Work
Overview

www.comprehend.com
Open Source
Member, PMC member and committer of ASF
Apache Axis2, Web Services, Synapse,
Airavata
Education
PhD in Computer Science from Indiana
University
Software engineer at Comprehend Systems
About me …

Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
Implementation
Managing Hardware
Deployments
Data Adapters: Implementation and Failure Modes
Distributed File System
Challenges
Future Work
Overview

www.comprehend.com
Clinical Trials – Lay of the land
Number of Drugs in Development Worldwide
(Source: CenterWatch Drugs in Clinical Trial
Database 2014)
Source: http://www.phrma.org/innovation/clinical-trials

www.comprehend.com
Clinical Trials – Lay of the Land
Multiple Stakeholders
• Study Managers
• Program Managers
• Monitors
• Data Managers
• Bio-statisticians
• Executives
• Medical Affairs
• Regulatory
• Vendors
• CROs
• CRAs
Sites
Labs
Patients
Safety
EDC
Reports
● Latent
● Fragmented
Data
PV Data
Excel
Sponsor
Contract Research Organization (CRO)
Sites and Investigators

www.comprehend.com
For decades, clinical development
was primarily paper-based.

www.comprehend.com
Various Software and Practices Used in Each Layer
medidata
CROs and SIs
Technologies

www.comprehend.com
Clinical Trials with Centralized Monitoring
Clinical
Operations
Sites
Labs
Patients
● Consolidated
● Real-time
● Self-Service
● Mobile
Clinical
Analytics &
Collaboration
Data
Safet
y
EDC
PV Data
Excel

www.comprehend.com
Providing up-to-date answers
Executives Medical Review
CRAs Data Management
Clinical Operations
EDC
CTMS
Safety
ePro
Other
Web
Ad-Hoc
Mobile
Collab

www.comprehend.com
FDA, HIPAA Compliance
Metadata/Database structure synchronization
Less frequent (once a day)
Data Synchronization
More frequent (multiple times a day)
Ability to plugin various data sources
RAVE, MERGE, BioClinica, File Imports, DB-to-DB
Synchs
Real time event propagations
Adverse events (AEs) - the need for early
identification
Business Requirements

www.comprehend.com
Hardware agnostic for resiliency and better
utilization
Repeatable deployments
Real time processing and real time events
Fault Tolerance
In flight and end state metrics for alerting and
monitoring
Flexible and pluggable adapter architecture
Time travel
Audit trails
Report generations
Technical Requirements

www.comprehend.com
Events all the way
Shared event bus for multiple consumers
Use of language agnostic data
representations (via protobuf)
Automatic datacenter resources
management (Mesos/Marathon/Docker)
Core Design Principles

www.comprehend.com
• Data processing
 Apache Storm and Trident, Apache
Spark and Spark Streaming,
Samza, Summingbird, Scalding,
Apache Falcon, Azkaban
• Coordination and Configuration
Management
 Apache Zookeeper, Redis, Apache
Curator
• Event Queue
 Apache Kafka
• Scheduling
 Chronos, Apache Mesos, Marathon,
Apache Aurora
• Database Synchronization
 Liquibase, Flyway DB
• Data Representations
 Apache Thrift, protobuf, Avro
• Deployments
 Ansible
• File Management
 Apache HDFS
• Monitoring and alerting
 Graphite, StatsD
• Database
 PostgreSQL, Apache Spark
• Resource Isolation
 LXC, Docker
Technologies Evaluated

www.comprehend.com
Data Processing Technology Evaluation
Criteria Storm +
Trident
Spark +
Streaming
Samza Summingbird Scalding Falcon Chronos Aurora Azkaban
DAG
Support
Y DAGScheduler
Y Y Y Y Y N Y
DAG Nodes
Resiliency
Y Y Y Y Y Y Y N Y
Event
Driven
Y Y Y Y N N N N N
Timed
Execution
Y Y Y Y Y Y Y Y
DAG
Extension
Y Y Y Y Y Y Y Y Y
Inflight and
end state
metrics
Y Y Y Y Y Y Y Y Y
Hardware
Agnostic
Y Y Y Y Y Y Y Y Y

www.comprehend.com
High Level Architecture

www.comprehend.com
Bare Metal Boxes
Partitioned using LXC containers
Use of Mesos to do the resource
allocations as needed for jobs
Managing Hardware

www.comprehend.com
Ansible
Repeatable deployments
Password management
Inventory management
(nodes, dev/staging/production)
Deployments

www.comprehend.com
Adapters – High Level

• Syncher is for DB structural
changes
 Syncher creates a database schema
from the source information
 Runs a generic database diff and
applies those to the target database
• Seeder is for data
synchronization
 Uses the database schema created
by Syncher
• Seeders gets jobs from
 Syncher or
 Timed scheduler
Data Adapters

• Coordination and
Configuration
through Zookeeper
Job configuration
Connection information
Distributed locking and
counters
Metric Maintenance
Last successful run
Data Adapters – Coordination and Configuration

www.comprehend.com
Data Adapters - Implementation

www.comprehend.com
 Syncher
 Connectivity to source/sink systems fail
• Retry job N times and alert, if needed
 Schema changes to the database fails in the middle
• Transaction rollback
 Seeder
 Connectivity to source/sink systems fail
• Retry job N times and alert, if needed
 If seeding fails midway
• Storm retries tuples
• Failing tuples are moved to an error queue
 Table and row level failues
• Option to skip the tables/rows but send a report at the end
 Effect on “live” tables during data synchronizations
• Option to use transactions or
• Use temporary tables and swap with original upon completion
Failure Modes

www.comprehend.com
Can bring in data from more data sources and
more studies effectively
Run real time reports on studies and configure
alerts (future)
Can configure refreshes as needed by each
use case
Can throttle input and output sources at
study/customer level
Ability to onboard new customers and deploy
new studies with minimal human intervention
What Have We Gained

www.comprehend.com
A generic framework which
eases integration with new data sources
• For each new source, implement a method to create a
virtual schema and to get data for a given table
can scale and fault tolerant
has generic monitoring and alerting
eases maintenance since its mostly generic code
notification of important events through messages
runs on any hardware
What Have We Gained

www.comprehend.com
Accessibility
Customers must be able to drop files securely (SFTP like
functionality)
Ability to access resources through URLs
Data storage
Scalability and Redundancy
Scale-out by adding nodes
Resilience against loss of nodes, data centers and
replication
Miscellaneous
Access control over read/write
Performance/usage/resource utilization monitoring
Distributed File System - Requirements

www.comprehend.com
Two name nodes running
in HA mode, co-located
with two journal nodes
Third journal node on a
separate node
Data nodes on all bare
metal nodes
Mounting HDFS with
FUSE and enabling SFTP
through OS level features
Automatic failover through
DNS and HA Proxy
HDFS with High Availability Mode

www.comprehend.com
Regulatory requirements
Data encryption requirements for clinical data
Audit trails
Data quality
Source system constraints
Coordination between Synchers and Seeders
Distributed locks and counters
Automatic fail over when a name node fails in
HDFS
HDFS HA mode stores active name node in ZK as a
java serialized object, yikes !!
Challenges

www.comprehend.com
Time travel
Ability to go back in time and run reports at any
given point of time
Trail of data
Containerization
In-memory query execution with Apache
Spark
Future Work

www.comprehend.com
Thank You !!
Questions …

Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (19)

Similar to Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

Similar to Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials (20)

More from Eran Chinthaka Withana

More from Eran Chinthaka Withana (7)

Recently uploaded

Recently uploaded (20)

Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

Editor's Notes