Advancing life sciences with IBM reference architecture for genomics
1. IBM Systems and Technology
Solution Brief
Life Sciences
Advancing life sciences
with the IBM reference
architecture for genomics
A forward-looking, end-to-end and collaborative
solution for genomics research and medicine
Highlights
●● ● ●
End-to-end, unified solution for genomics
research, translational medicine and
personalized medicine
●● ● ●
Scalable, software-defined, and data-
centric architecture designed for
cutting-edge research and clinical use
●● ● ●
Developed in collaboration with leading
researchers and partners worldwide
Genomic medicine is revolutionizing medical research and clinical care,
enabling scientists and clinicians to identify individuals at risk of disease,
provide early diagnoses, and recommend better treatments. Prolific use
of next generation sequencers is flooding researchers and clinicians with
data. Infrastructures often cannot scale to process, store and distribute
this data in time. Many rely on third parties to process and store the raw
data, which slows access for analysis. In addition, context and mapping
of genomic, phenotypic and environmental data must be available.
IBM, in collaboration with key researchers and partners, created the
IBM reference architecture for genomics. The end-to-end reference
architecture defines the enterprise data management, workflow orches-
tration and global access capabilities across key genomics, translational
and personalized medicine platforms. It supports large-scale genomics
sequencing and downstream data analytics, providing:
●● ●
Data lifecycle management to support large scale data growth
●● ●
Software-based abstraction layers for compute, storage, big data
and cloud
●● ●
Workload and workflow orchestrator for applications
By investigating the human genome in the context of biological pathways
and environmental factors, genomic scientists and clinicians can now
identify individuals at risk of disease, provide early diagnoses based on
biomarkers and recommend effective treatments.
2. 2
Solution Brief
Life SciencesIBM Systems and Technology
Due to new technology and research methods, the field of
genomics is being flooded with data from next-generation
sequencers. Furthermore, this data must be quickly stored,
analyzed, shared and archived. Many genomics, cancer and
pharmaceutical research institutions are now generating so
much data that they can no longer process or even transmit
the information over standard communication lines in a timely
fashion. Often researchers must physically ship raw data to an
external computing center for processing and storage, slowing
down and inhibiting ready access to and analysis of data. In
addition to scale and speed, this information needs to be linked
based on data models and taxonomy, or to be curated with
machine or human knowledge. Only then can this “smart”
data be factored into the equation when dealing with genomic,
phenotypic and environmental data and made available to a
common analytical platform.
The IBM reference architecture for genomics is an end-to-end
reference architecture for genomic medicine. It defines the
enterprise capability of data management, workflow orchestra-
tion and global access across key platforms for genomics,
translational and personalized medicine. Based on this architec-
ture, IBM has built a data-centric, software-defined, and
application-ready infrastructure in support of large-scale
genomics sequencing and downstream data analytics:
●● ●
Data-centric: Helps you meet the challenge of managing the
explosive growth of genomics and clinical data with data
lifecycle management capabilities.
●● ●
Software-defined: Defines the architecture with software-
based abstraction layers for computation, storage, big data
and the cloud.
●● ●
Application-ready: Integrates a multitude of applications with
a workload and workflow orchestrator.
Data management using a data hub
Data lifecycle management is critical for genomics due to data
volume, velocity and variety. Genomic data volume is surging as
the cost of sequencing drops precipitously. The I/O throughput
on a genomic system can be extremely demanding due to data
volume and to the large number of file and directory objects.
Many data formats, with varying degrees of lifecycle manage-
ment requirements, ranging from transient files in scratch space
to VCF variant-calling files, must perpetually remain online.
Leveraging IBM Elastic Storage (see Figure 1), the data man-
agement layer, or a data hub, defines an enterprise capability to
meet these challenges based on its scalable, extensible and high-
performance architecture. First developed and optimized as a
high performance computing (HPC) file system, IBM Elastic
Storage serves large volumes of data at a high bandwidth and
in parallel to all the compute nodes in the computing system(s).
As genomic pipeline can consist of hundreds of applications
engaged in concurrent data processing on large number of files,
this capability is critical to feeding data to the computational
genomics workflow.
Figure 1. The data management layer or data hub of the IBM reference
architecture for genomics. The data hub provides a global name space
and high-performance access to data stored on SSD/Flash, fast disk, slow
disk and tape archive. The key functions are high data Input/Output (I/O),
policy-based Information Lifecycle Management (ILM), data sharing through
replication and caching, and big data
PACS
LIMS
NGS
Ref DB
Publication
Assembly & Alignment Variant Calling Annotation Bioinformatics
Datahub ILMI/O Sharing Big Data
SSD
Flash
Fast
Disk
Slow
Disk
Tape
3. 3
Solution Brief
Life SciencesIBM Systems and Technology
As the genomics pipeline can generate petabytes of metadata
and data, a system pool built upon solid state drive (SSD) and
flash disk with high-IOPS capability, can be dedicated to store
metadata for files and directories, and in some cases for storing
small size files directly. This feature drastically improves file
system performance and responsiveness to metadata-heavy
operations such as the listing of all files in any given directory.
As a file system with a connector to Hadoop MapReduce, the
data hub can also serve Hadoop MapReduce big data jobs on
the same set as compute nodes, thus eliminating the need for
and complexity of another file system—in this case the Hadoop
Distributed File System (HDFS). As the bioinformatics indus-
try adopts big data technologies including Hadoop MapReduce,
this data hub feature will enable firms to gain both economies
of scale and performance by the sharing of nodes for both
compute and big data analytics.
The policy-based data life cycle management capability allows
the data hub to move data from one storage pool to the others,
maximizing I/O performance, storage utilization and minimiz-
ing operational cost. These storage pools can range from the
high-I/O flash disk to high-capacity storage appliance (GPFS™
storage server) to low-cost tape media (through integration
with tape management solution including IBM TSM/HSM
and LTFS).
The increasingly distributed nature of genomics infrastructure
requires data management on a much larger and global scale.
Data not only needs to be moved or shared across different
sites, its movement or sharing needs to be coordinated with
the computational workload and workflow. To achieve this
coordination, the data hub leverages a key function of the
IBM Elastic Storage called the Advanced File Placement
(AFM). It enables IBM Elastic Storage to extend the global
name space to multiple sites, allowing them to share both a
common metadata catalogue and a cache copy of home data
thus allowing local access for remote client sites. For example,
a genomic center can own, operate, and version-control all ref-
erence databases or datasets, while the affiliated or partnering
sites or centers can access the reference dataset through AFM.
When the centralized copy of database gets updated, so are the
cache copies of the other sites.
With the data hub, a system-wide metadata engine can be built
to index and search all the genomic and clinical data, enabling
faster, more powerful downstream analytics and translational
research.
Workflow management with an
orchestrator
The workflow for genomics is complex yet important. A grow-
ing number of genomic applications have varying degrees of
maturity and types of programming models. Many are single-
threaded (R, for example) or embarrassingly parallel (BWA,
for example) while others are multi-threaded or MPI-enabled
(e.g. MPI BLAST). All of these applications need to work in
concert or tandem in a high throughput and high performance
mode in order to generate final results.
The IBM reference architecture for genomics includes a
workload management or orchestrator layer which defines the
capability to orchestrate applications and workflow. A unique
combination of the IBM® Platform™ LSF® workload man-
ager and Platform Process Manager workflow engine links,
coordinates and shepherds a spectrum of computational and
analytical jobs into fully automated pipelines that can be easily
built, customized, shared and run on a common platform.
This capability provides the necessary abstraction of applica-
tions from the underlying infrastructure including HPC clusters
with graphical unit processors (GPUs) and a big data cluster
in the cloud.
4. 4
Solution Brief
Life SciencesIBM Systems and Technology
The orchestrator distinguishes itself from scripted workflow
tools with its capability to handle complex workflows
dynamically—individual workloads or jobs can be defined
through a user-friendly interface, incorporating variables,
parameters and data definition using standard templates.
The workload manager transparently handles the submission,
placement, monitoring and completion of each job. The
workflow engine connects jobs in linear progressions, condi-
tional branches, or loops, based on user-defined criteria and
requirements for completing and advancing them.
To maximize the throughput of the workflow for genomics
sequencing analysis, a special type of workload can be defined
by using job arrays so data can be split and processed by many
jobs in parallel.
In another innovative use case for genomics processing, multi-
ple sub-flows can be defined as a parallel pipeline for variant
analysis following the alignment of the genome. The results
from each sub-flow can then be merged into a single output
and provide analysts with a comparative view of multiple tools
or settings.
The workflow can also be designed as a module and embedded
into larger workflows as a dynamic building block. Not
only will this approach enable the efficient building and reuse
of the pipelines, it will also encourage collaborative sharing of
genomic pipelines among a group of users or within larger
scientific communities.
Figure 3 shows a screenshot of an end-to-end workflow created
using the orchestrator to process raw sequence data (BCL) into
variants (VCF) using a combination of applications and tools.
As more institutions are deploying hybrid cloud solutions
with distributed resources, the orchestrator can coordinate
the distribution of workloads based on data localities, pre-
defined policies, thresholds and real-time inputs of resource
availabilities. For example, a workflow can be designed
for processing genomic raw data closer to sequencers, and fol-
lowed by sequence alignment and assembly using the Hadoop
MapReduce framework on a remote big data cluster. In another
use case, a workflow can be designed to launch a proxy event of
moving data from a satellite system to the central HPC cluster
when the genomic processing reaches 50 percent of the com-
pletion rate. The computation and data movement can happen
concurrently to save time and costs.
Figure 3. The orchestrator is implemented as a genomic workflow pipeline.
Starting from the left in the pipeline - Box 1: the arrival of data such as bcl
files will automatically trigger CASAVA as the first step of the workflow;
Box 2: a dynamic subflow will use BWA for sequence alignment;
Box 3: Samtool will perform post-processing in a job array;
Box 4: different variant analysis subflows can be triggered in parallel
Figure 2. The workload management or orchestrator layer in the
IBM genomic medicine reference architecture. The orchestrator provides
workload management, workflow engine and provenance capabilities to
orchestrate complex sets of applications and workflows for genomics pipe-
lines, and to abstract applications from underlying computational resources.
PACS
Assembly & Alignment
Orchestrator
LIMS
NGS
Ref DB
Publication
Variant Calling Annotation Bioinformatics
Workload Workflow Provenance
Local Cluster 1 Local Cluster 2 Remote Cluster
(Grid/Cloud)
Base Conversion Alignment Pre-processing
Variant Analysis
GATK
Mutect
CaVEMan
Pindel
BWA
BowtieCASAVA
Samtool
Picard
GATK
BCLBCL FASTQ SAM/BAM Recalibrated BAM SNP
Indels
Translocation
Low-pass copy number
Exome copy number
5. 5
Solution Brief
Life SciencesIBM Systems and Technology
Manage global access with an appcenter
In the IBM reference architecture for genomics, an appcenter is
the user interface into the genomics platform. It provides an
enterprise portal with role-based access and security controls
while allowing researchers and clinicians easy access to data and
workflow tools.
Built with Platform Application Center, this appcenter has
advanced logging capabilities for tracking activities including
jobs, workflow provenance and data access. This is a critical
feature for event reporting, performance analysis, or rerunning
analysis with prior settings.
To harvest all knowledge and information from users and
enable sharing, the appcenter can function as a catalogue of
pre-built and pre-tested workflow definitions and application
templates so users can easily launch them directly from the
portal after uploading the data. The portal can also be config-
ured with a metadata index and search engine enabling users
to browse, query or search pre-indexed data or reference
documents.
Extensible reference architecture
The IBM reference architecture for genomics extends beyond
genomics to cover translational and personalized medicine
platforms as well. These platforms also leverage the data hub,
orchestrator and appcenter as common enterprise capabilities.
The integrated architecture and platforms are shown in
figure 4.
Figure 4. IBM reference architecture for genomics. The blue section depicts the genomics platform. The green section depicts the translational platform.
The purple section highlights depicts the personalized medicine platform
Data Source Data Service Analytics Access
Personalized
Medicine
Patient Portal
Clinical Portal
EMR
Publication
EMR Workflow Disease Registry Surveys Cognitive Analytics Outcome Evaluation
Translational Clinical Knowledge
Orchestrator
Clinical
RWE
Ongtologies
MDM ETL NLP Predictive Analytics
Associative Analytics
Parallel Modeling
Cohort Query
Data Explore
Translational
Clinical DW Omics DW Translational DW
Datahub
Genomics Imaging Proteomics Cytometry
Orchestrator
Datahub
PACS
LIMS
NGS
Ref DB
Publication
Alignment & Assembly Variant Analysis Annotation Functional Genomics
AppCenter
Visualization
Monitoring
Genomics
fastq bam vcf