SlideShare a Scribd company logo
1 of 6
Download to read offline
IBM Systems and Technology
Solution Brief
Life Sciences
Advancing life sciences
with the IBM reference
architecture for genomics
A forward-looking, end-to-end and collaborative
solution for genomics research and medicine
Highlights
●● ● ●
End-to-end, unified solution for genomics
research, translational medicine and
personalized medicine
●● ● ●
Scalable, software-defined, and data-
centric architecture designed for
cutting-edge research and clinical use
●● ● ●
Developed in collaboration with leading
researchers and partners worldwide
Genomic medicine is revolutionizing medical research and clinical care,
enabling scientists and clinicians to identify individuals at risk of disease,
provide early diagnoses, and recommend better treatments. Prolific use
of next generation sequencers is flooding researchers and clinicians with
data. Infrastructures often cannot scale to process, store and distribute
this data in time. Many rely on third parties to process and store the raw
data, which slows access for analysis. In addition, context and mapping
of genomic, phenotypic and environmental data must be available.
IBM, in collaboration with key researchers and partners, created the
IBM reference architecture for genomics. The end-to-end reference
architecture defines the enterprise data management, workflow orches-
tration and global access capabilities across key genomics, translational
and personalized medicine platforms. It supports large-scale genomics
sequencing and downstream data analytics, providing:
●● ●
Data lifecycle management to support large scale data growth
●● ●
Software-based abstraction layers for compute, storage, big data
and cloud
●● ●
Workload and workflow orchestrator for applications
By investigating the human genome in the context of biological pathways
and environmental factors, genomic scientists and clinicians can now
identify individuals at risk of disease, provide early diagnoses based on
biomarkers and recommend effective treatments.
2
Solution Brief
Life SciencesIBM Systems and Technology
Due to new technology and research methods, the field of
genomics is being flooded with data from next-generation
sequencers. Furthermore, this data must be quickly stored,
analyzed, shared and archived. Many genomics, cancer and
pharmaceutical research institutions are now generating so
much data that they can no longer process or even transmit
the information over standard communication lines in a timely
fashion. Often researchers must physically ship raw data to an
external computing center for processing and storage, slowing
down and inhibiting ready access to and analysis of data. In
addition to scale and speed, this information needs to be linked
based on data models and taxonomy, or to be curated with
machine or human knowledge. Only then can this “smart”
data be factored into the equation when dealing with genomic,
phenotypic and environmental data and made available to a
common analytical platform.
The IBM reference architecture for genomics is an end-to-end
reference architecture for genomic medicine. It defines the
enterprise capability of data management, workflow orchestra-
tion and global access across key platforms for genomics,
translational and personalized medicine. Based on this architec-
ture, IBM has built a data-centric, software-defined, and
application-ready infrastructure in support of large-scale
genomics sequencing and downstream data analytics:
●● ●
Data-centric: Helps you meet the challenge of managing the
explosive growth of genomics and clinical data with data
lifecycle management capabilities.
●● ●
Software-defined: Defines the architecture with software-
based abstraction layers for computation, storage, big data
and the cloud.
●● ●
Application-ready: Integrates a multitude of applications with
a workload and workflow orchestrator.
Data management using a data hub
Data lifecycle management is critical for genomics due to data
volume, velocity and variety. Genomic data volume is surging as
the cost of sequencing drops precipitously. The I/O throughput
on a genomic system can be extremely demanding due to data
volume and to the large number of file and directory objects.
Many data formats, with varying degrees of lifecycle manage-
ment requirements, ranging from transient files in scratch space
to VCF variant-calling files, must perpetually remain online.
Leveraging IBM Elastic Storage (see Figure 1), the data man-
agement layer, or a data hub, defines an enterprise capability to
meet these challenges based on its scalable, extensible and high-
performance architecture. First developed and optimized as a
high performance computing (HPC) file system, IBM Elastic
Storage serves large volumes of data at a high bandwidth and
in parallel to all the compute nodes in the computing system(s).
As genomic pipeline can consist of hundreds of applications
engaged in concurrent data processing on large number of files,
this capability is critical to feeding data to the computational
genomics workflow.
Figure 1. The data management layer or data hub of the IBM reference
architecture for genomics. The data hub provides a global name space
and high-performance access to data stored on SSD/Flash, fast disk, slow
disk and tape archive. The key functions are high data Input/Output (I/O),
policy-based Information Lifecycle Management (ILM), data sharing through
replication and caching, and big data
PACS
LIMS
NGS
Ref DB
Publication
Assembly & Alignment Variant Calling Annotation Bioinformatics
Datahub ILMI/O Sharing Big Data
SSD
Flash
Fast
Disk
Slow
Disk
Tape
3
Solution Brief
Life SciencesIBM Systems and Technology
As the genomics pipeline can generate petabytes of metadata
and data, a system pool built upon solid state drive (SSD) and
flash disk with high-IOPS capability, can be dedicated to store
metadata for files and directories, and in some cases for storing
small size files directly. This feature drastically improves file
system performance and responsiveness to metadata-heavy
operations such as the listing of all files in any given directory.
As a file system with a connector to Hadoop MapReduce, the
data hub can also serve Hadoop MapReduce big data jobs on
the same set as compute nodes, thus eliminating the need for
and complexity of another file system—in this case the Hadoop
Distributed File System (HDFS). As the bioinformatics indus-
try adopts big data technologies including Hadoop MapReduce,
this data hub feature will enable firms to gain both economies
of scale and performance by the sharing of nodes for both
compute and big data analytics.
The policy-based data life cycle management capability allows
the data hub to move data from one storage pool to the others,
maximizing I/O performance, storage utilization and minimiz-
ing operational cost. These storage pools can range from the
high-I/O flash disk to high-capacity storage appliance (GPFS™
storage server) to low-cost tape media (through integration
with tape management solution including IBM TSM/HSM
and LTFS).
The increasingly distributed nature of genomics infrastructure
requires data management on a much larger and global scale.
Data not only needs to be moved or shared across different
sites, its movement or sharing needs to be coordinated with
the computational workload and workflow. To achieve this
coordination, the data hub leverages a key function of the
IBM Elastic Storage called the Advanced File Placement
(AFM). It enables IBM Elastic Storage to extend the global
name space to multiple sites, allowing them to share both a
common metadata catalogue and a cache copy of home data
thus allowing local access for remote client sites. For example,
a genomic center can own, operate, and version-control all ref-
erence databases or datasets, while the affiliated or partnering
sites or centers can access the reference dataset through AFM.
When the centralized copy of database gets updated, so are the
cache copies of the other sites.
With the data hub, a system-wide metadata engine can be built
to index and search all the genomic and clinical data, enabling
faster, more powerful downstream analytics and translational
research.
Workflow management with an
orchestrator
The workflow for genomics is complex yet important. A grow-
ing number of genomic applications have varying degrees of
maturity and types of programming models. Many are single-
threaded (R, for example) or embarrassingly parallel (BWA,
for example) while others are multi-threaded or MPI-enabled
(e.g. MPI BLAST). All of these applications need to work in
concert or tandem in a high throughput and high performance
mode in order to generate final results.
The IBM reference architecture for genomics includes a
workload management or orchestrator layer which defines the
capability to orchestrate applications and workflow. A unique
combination of the IBM® Platform™ LSF® workload man-
ager and Platform Process Manager workflow engine links,
coordinates and shepherds a spectrum of computational and
analytical jobs into fully automated pipelines that can be easily
built, customized, shared and run on a common platform.
This capability provides the necessary abstraction of applica-
tions from the underlying infrastructure including HPC clusters
with graphical unit processors (GPUs) and a big data cluster
in the cloud.
4
Solution Brief
Life SciencesIBM Systems and Technology
The orchestrator distinguishes itself from scripted workflow
tools with its capability to handle complex workflows
dynamically—individual workloads or jobs can be defined
through a user-friendly interface, incorporating variables,
parameters and data definition using standard templates.
The workload manager transparently handles the submission,
placement, monitoring and completion of each job. The
workflow engine connects jobs in linear progressions, condi-
tional branches, or loops, based on user-defined criteria and
requirements for completing and advancing them.
To maximize the throughput of the workflow for genomics
sequencing analysis, a special type of workload can be defined
by using job arrays so data can be split and processed by many
jobs in parallel.
In another innovative use case for genomics processing, multi-
ple sub-flows can be defined as a parallel pipeline for variant
analysis following the alignment of the genome. The results
from each sub-flow can then be merged into a single output
and provide analysts with a comparative view of multiple tools
or settings.
The workflow can also be designed as a module and embedded
into larger workflows as a dynamic building block. Not
only will this approach enable the efficient building and reuse
of the pipelines, it will also encourage collaborative sharing of
genomic pipelines among a group of users or within larger
scientific communities.
Figure 3 shows a screenshot of an end-to-end workflow created
using the orchestrator to process raw sequence data (BCL) into
variants (VCF) using a combination of applications and tools.
As more institutions are deploying hybrid cloud solutions
with distributed resources, the orchestrator can coordinate
the distribution of workloads based on data localities, pre-
defined policies, thresholds and real-time inputs of resource
availabilities. For example, a workflow can be designed
for processing genomic raw data closer to sequencers, and fol-
lowed by sequence alignment and assembly using the Hadoop
MapReduce framework on a remote big data cluster. In another
use case, a workflow can be designed to launch a proxy event of
moving data from a satellite system to the central HPC cluster
when the genomic processing reaches 50 percent of the com-
pletion rate. The computation and data movement can happen
concurrently to save time and costs.
Figure 3. The orchestrator is implemented as a genomic workflow pipeline.
Starting from the left in the pipeline - Box 1: the arrival of data such as bcl
files will automatically trigger CASAVA as the first step of the workflow;
Box 2: a dynamic subflow will use BWA for sequence alignment;
Box 3: Samtool will perform post-processing in a job array;
Box 4: different variant analysis subflows can be triggered in parallel
Figure 2. The workload management or orchestrator layer in the
IBM genomic medicine reference architecture. The orchestrator provides
workload management, workflow engine and provenance capabilities to
orchestrate complex sets of applications and workflows for genomics pipe-
lines, and to abstract applications from underlying computational resources.
PACS
Assembly & Alignment
Orchestrator
LIMS
NGS
Ref DB
Publication
Variant Calling Annotation Bioinformatics
Workload Workflow Provenance
Local Cluster 1 Local Cluster 2 Remote Cluster
(Grid/Cloud)
Base Conversion Alignment Pre-processing
Variant Analysis
GATK
Mutect
CaVEMan
Pindel
BWA
BowtieCASAVA
Samtool
Picard
GATK
BCLBCL FASTQ SAM/BAM Recalibrated BAM SNP
Indels
Translocation
Low-pass copy number
Exome copy number
5
Solution Brief
Life SciencesIBM Systems and Technology
Manage global access with an appcenter
In the IBM reference architecture for genomics, an appcenter is
the user interface into the genomics platform. It provides an
enterprise portal with role-based access and security controls
while allowing researchers and clinicians easy access to data and
workflow tools.
Built with Platform Application Center, this appcenter has
advanced logging capabilities for tracking activities including
jobs, workflow provenance and data access. This is a critical
feature for event reporting, performance analysis, or rerunning
analysis with prior settings.
To harvest all knowledge and information from users and
enable sharing, the appcenter can function as a catalogue of
pre-built and pre-tested workflow definitions and application
templates so users can easily launch them directly from the
portal after uploading the data. The portal can also be config-
ured with a metadata index and search engine enabling users
to browse, query or search pre-indexed data or reference
documents.
Extensible reference architecture
The IBM reference architecture for genomics extends beyond
genomics to cover translational and personalized medicine
platforms as well. These platforms also leverage the data hub,
orchestrator and appcenter as common enterprise capabilities.
The integrated architecture and platforms are shown in
figure 4.
Figure 4. IBM reference architecture for genomics. The blue section depicts the genomics platform. The green section depicts the translational platform.
The purple section highlights depicts the personalized medicine platform
Data Source Data Service Analytics Access
Personalized
Medicine
Patient Portal
Clinical Portal
EMR
Publication
EMR Workflow Disease Registry Surveys Cognitive Analytics Outcome Evaluation
Translational Clinical Knowledge
Orchestrator
Clinical
RWE
Ongtologies
MDM ETL NLP Predictive Analytics
Associative Analytics
Parallel Modeling
Cohort Query
Data Explore
Translational
Clinical DW Omics DW Translational DW
Datahub
Genomics Imaging Proteomics Cytometry
Orchestrator
Datahub
PACS
LIMS
NGS
Ref DB
Publication
Alignment & Assembly Variant Analysis Annotation Functional Genomics
AppCenter
Visualization
Monitoring
Genomics
fastq bam vcf
Please Recycle
Why IBM?
IBM Platform Computing software offerings enable IT scal-
ability, performance and agility that help firms work smarter,
faster and outperform their peers. Our comprehensive portfolio
of solutions speeds and scales large-scale simulations, predictive
analytics and visualization.
Platform Computing high-performance infrastructure
management software automatically optimizes IT resources
usage—on-premise and in the cloud—enabling 24x7, real-time
operating agility that can improve the bottom line faster.
For more information
To learn more about the IBM reference architecture from
genomics, please contact your IBM representative or
IBM Business Partner, or visit the following website:
ibm.com/platformcomputing
Additionally, IBM Global Financing can help you acquire the
IT solutions that your business needs in the most cost-effective
and strategic way possible. We’ll partner with credit-qualified
clients to customize an IT financing solution to suit your busi-
ness goals, enable effective cash management, and improve your
total cost of ownership. IBM Global Financing is your smartest
choice to fund critical IT investments and propel your business
forward. For more information, visit: ibm.com/financing
© Copyright IBM Corporation 2014
IBM Corporation
Systems and Technology
Route 100
Somers, NY 10589
Produced in the United States of America
August 2014
IBM, the IBM logo, ibm.com, GPFS, Platform, and LSF are trademarks
of International Business Machines Corp., registered in many jurisdictions
worldwide. Other product and service names might be trademarks
of IBM or other companies. A current list of IBM trademarks is
available on the web at “Copyright and trademark information” at
ibm.com/legal/copytrade.shtml
This document is current as of the initial date of publication and may be
changed by IBM at any time. Not all offerings are available in every country
in which IBM operates.
It is the user’s responsibility to evaluate and verify the operation of any
other products or programs with IBM products and programs.
THE INFORMATION IN THIS DOCUMENT IS PROVIDED
“AS IS” WITHOUT ANY WARRANTY, EXPRESS OR
IMPLIED, INCLUDING WITHOUT ANY WARRANTIES
OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
PURPOSE AND ANY WARRANTY OR CONDITION OF
NON-INFRINGEMENT. IBM products are warranted according to the
terms and conditions of the agreements under which they are provided.
Statements regarding IBM’s future direction and intent are subject
to change or withdrawal without notice, and represent goals and
objectives only.
DCS03062-USEN-00

More Related Content

What's hot (6)

IJARCCE_49
IJARCCE_49IJARCCE_49
IJARCCE_49
 
Data warehouse presentaion
Data warehouse presentaionData warehouse presentaion
Data warehouse presentaion
 
[IJET-V1I5P5] Authors: T.Jalaja, M.Shailaja
[IJET-V1I5P5] Authors: T.Jalaja, M.Shailaja[IJET-V1I5P5] Authors: T.Jalaja, M.Shailaja
[IJET-V1I5P5] Authors: T.Jalaja, M.Shailaja
 
Distributed Databases Overview
Distributed Databases OverviewDistributed Databases Overview
Distributed Databases Overview
 
Elimination of data redundancy before persisting into dbms using svm classifi...
Elimination of data redundancy before persisting into dbms using svm classifi...Elimination of data redundancy before persisting into dbms using svm classifi...
Elimination of data redundancy before persisting into dbms using svm classifi...
 
B131626
B131626B131626
B131626
 

Viewers also liked

Magical french project
Magical french projectMagical french project
Magical french projectflamehair
 
La vida-anarquica-de-florencio-sanchez-sanchez1
La vida-anarquica-de-florencio-sanchez-sanchez1La vida-anarquica-de-florencio-sanchez-sanchez1
La vida-anarquica-de-florencio-sanchez-sanchez1Eloy Libertad
 
How To Use Pagemodo
How To Use PagemodoHow To Use Pagemodo
How To Use PagemodoJayson Ijalo
 
You Suck at Email Presentation by Julia Roy
You Suck at Email Presentation by Julia RoyYou Suck at Email Presentation by Julia Roy
You Suck at Email Presentation by Julia RoyJulia Roy
 
Sistemas integrados
Sistemas integradosSistemas integrados
Sistemas integradosTessie Alejo
 
Official flyer BOclassic 2016
Official flyer BOclassic 2016Official flyer BOclassic 2016
Official flyer BOclassic 2016Alberto Stretti
 
Requerimientos técnicos exposición "La lucidez de la luz"
Requerimientos técnicos exposición "La lucidez de la luz"Requerimientos técnicos exposición "La lucidez de la luz"
Requerimientos técnicos exposición "La lucidez de la luz"Fundación Montemadrid
 
Sri et al,2014_Two-Artificial-Diet-Formulations-For-Troides-Helena-Linne-Larv...
Sri et al,2014_Two-Artificial-Diet-Formulations-For-Troides-Helena-Linne-Larv...Sri et al,2014_Two-Artificial-Diet-Formulations-For-Troides-Helena-Linne-Larv...
Sri et al,2014_Two-Artificial-Diet-Formulations-For-Troides-Helena-Linne-Larv...Sri Ngatimin
 
Human-centric Software Development Tools
Human-centric Software Development ToolsHuman-centric Software Development Tools
Human-centric Software Development ToolsGail Murphy
 
Galápagos, el primer paso para innovar los censos en el Ecuador
Galápagos, el primer paso para innovar los censos en el EcuadorGalápagos, el primer paso para innovar los censos en el Ecuador
Galápagos, el primer paso para innovar los censos en el EcuadorCarlos Mena
 
HotmixPro Creative
HotmixPro CreativeHotmixPro Creative
HotmixPro CreativeHotmixPRO
 
Sabadell: nuestra ciudad
Sabadell: nuestra ciudadSabadell: nuestra ciudad
Sabadell: nuestra ciudadTurnUpTheMusic
 
Vitamina D en els nens. Més enllà del raquitisme
Vitamina D en els nens. Més enllà del raquitismeVitamina D en els nens. Més enllà del raquitisme
Vitamina D en els nens. Més enllà del raquitismePediatriadeponent
 
Kairos group apr. ingles
Kairos group   apr. inglesKairos group   apr. ingles
Kairos group apr. inglesEduardo Carità
 
Responsabilidad social1
Responsabilidad social1Responsabilidad social1
Responsabilidad social1fmichelle
 

Viewers also liked (20)

Magical french project
Magical french projectMagical french project
Magical french project
 
Ilona Mataradze
Ilona MataradzeIlona Mataradze
Ilona Mataradze
 
La vida-anarquica-de-florencio-sanchez-sanchez1
La vida-anarquica-de-florencio-sanchez-sanchez1La vida-anarquica-de-florencio-sanchez-sanchez1
La vida-anarquica-de-florencio-sanchez-sanchez1
 
How To Use Pagemodo
How To Use PagemodoHow To Use Pagemodo
How To Use Pagemodo
 
You Suck at Email Presentation by Julia Roy
You Suck at Email Presentation by Julia RoyYou Suck at Email Presentation by Julia Roy
You Suck at Email Presentation by Julia Roy
 
Sistemas integrados
Sistemas integradosSistemas integrados
Sistemas integrados
 
Official flyer BOclassic 2016
Official flyer BOclassic 2016Official flyer BOclassic 2016
Official flyer BOclassic 2016
 
Requerimientos técnicos exposición "La lucidez de la luz"
Requerimientos técnicos exposición "La lucidez de la luz"Requerimientos técnicos exposición "La lucidez de la luz"
Requerimientos técnicos exposición "La lucidez de la luz"
 
Sri et al,2014_Two-Artificial-Diet-Formulations-For-Troides-Helena-Linne-Larv...
Sri et al,2014_Two-Artificial-Diet-Formulations-For-Troides-Helena-Linne-Larv...Sri et al,2014_Two-Artificial-Diet-Formulations-For-Troides-Helena-Linne-Larv...
Sri et al,2014_Two-Artificial-Diet-Formulations-For-Troides-Helena-Linne-Larv...
 
Human-centric Software Development Tools
Human-centric Software Development ToolsHuman-centric Software Development Tools
Human-centric Software Development Tools
 
Galápagos, el primer paso para innovar los censos en el Ecuador
Galápagos, el primer paso para innovar los censos en el EcuadorGalápagos, el primer paso para innovar los censos en el Ecuador
Galápagos, el primer paso para innovar los censos en el Ecuador
 
HotmixPro Creative
HotmixPro CreativeHotmixPro Creative
HotmixPro Creative
 
Sabadell: nuestra ciudad
Sabadell: nuestra ciudadSabadell: nuestra ciudad
Sabadell: nuestra ciudad
 
Co incidir 26
Co incidir 26 Co incidir 26
Co incidir 26
 
PLE y EPA
PLE y EPAPLE y EPA
PLE y EPA
 
Vitamina D en els nens. Més enllà del raquitisme
Vitamina D en els nens. Més enllà del raquitismeVitamina D en els nens. Més enllà del raquitisme
Vitamina D en els nens. Més enllà del raquitisme
 
Kairos group apr. ingles
Kairos group   apr. inglesKairos group   apr. ingles
Kairos group apr. ingles
 
887w
887w887w
887w
 
Responsabilidad social1
Responsabilidad social1Responsabilidad social1
Responsabilidad social1
 
Samper del Salz
Samper del SalzSamper del Salz
Samper del Salz
 

Similar to Advancing life sciences with IBM reference architecture for genomics

Hitachi high-performance-accelerates-life-sciences-research
Hitachi high-performance-accelerates-life-sciences-researchHitachi high-performance-accelerates-life-sciences-research
Hitachi high-performance-accelerates-life-sciences-researchHitachi Vantara
 
Standardization of “Drug Safety” Reporting Applications-doc file
Standardization of “Drug Safety” Reporting Applications-doc fileStandardization of “Drug Safety” Reporting Applications-doc file
Standardization of “Drug Safety” Reporting Applications-doc filehalleyzand
 
Dataintensive
DataintensiveDataintensive
Dataintensivesulfath
 
Data-Intensive Technologies for Cloud Computing
Data-Intensive Technologies for CloudComputingData-Intensive Technologies for CloudComputing
Data-Intensive Technologies for Cloud Computinghuda2018
 
Big Data Technology Accelerate Genomics Precision Medicine
Big Data Technology Accelerate Genomics Precision MedicineBig Data Technology Accelerate Genomics Precision Medicine
Big Data Technology Accelerate Genomics Precision Medicinecscpconf
 
BIG DATA TECHNOLOGY ACCELERATE GENOMICS PRECISION MEDICINE
BIG DATA TECHNOLOGY ACCELERATE GENOMICS PRECISION MEDICINE BIG DATA TECHNOLOGY ACCELERATE GENOMICS PRECISION MEDICINE
BIG DATA TECHNOLOGY ACCELERATE GENOMICS PRECISION MEDICINE csandit
 
EMC Greenplum Database version 4.2
EMC Greenplum Database version 4.2 EMC Greenplum Database version 4.2
EMC Greenplum Database version 4.2 EMC
 
Healthcare payer - Big data integration
Healthcare payer - Big data integrationHealthcare payer - Big data integration
Healthcare payer - Big data integrationRajasekaran kandhasamy
 
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop ClusterParalyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop ClusterIOSR Journals
 
A Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and ChallengesA Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and Challengesijcisjournal
 
Memory Management in BigData: A Perpective View
Memory Management in BigData: A Perpective ViewMemory Management in BigData: A Perpective View
Memory Management in BigData: A Perpective Viewijtsrd
 
spectrum Storage Whitepaper
spectrum Storage Whitepaperspectrum Storage Whitepaper
spectrum Storage WhitepaperCarina Kordan
 
Achieving Cost-Effective, Vendor-Neutral Archiving
Achieving Cost-Effective, Vendor-Neutral ArchivingAchieving Cost-Effective, Vendor-Neutral Archiving
Achieving Cost-Effective, Vendor-Neutral ArchivingCarestream
 
4870 ibm-storage-solutions-final_nov26_18_34019934_usen
4870  ibm-storage-solutions-final_nov26_18_34019934_usen4870  ibm-storage-solutions-final_nov26_18_34019934_usen
4870 ibm-storage-solutions-final_nov26_18_34019934_usenduc_spt
 
Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy snehal parikh
 
Query Evaluation Techniques for Large Databases.pdf
Query Evaluation Techniques for Large Databases.pdfQuery Evaluation Techniques for Large Databases.pdf
Query Evaluation Techniques for Large Databases.pdfRayWill4
 
A Framework for Geospatial Web Services for Public Health by Dr. Leslie Lenert
A Framework for Geospatial Web Services for Public Health by Dr. Leslie LenertA Framework for Geospatial Web Services for Public Health by Dr. Leslie Lenert
A Framework for Geospatial Web Services for Public Health by Dr. Leslie LenertWansoo Im
 

Similar to Advancing life sciences with IBM reference architecture for genomics (20)

Hitachi high-performance-accelerates-life-sciences-research
Hitachi high-performance-accelerates-life-sciences-researchHitachi high-performance-accelerates-life-sciences-research
Hitachi high-performance-accelerates-life-sciences-research
 
Standardization of “Drug Safety” Reporting Applications-doc file
Standardization of “Drug Safety” Reporting Applications-doc fileStandardization of “Drug Safety” Reporting Applications-doc file
Standardization of “Drug Safety” Reporting Applications-doc file
 
Dataintensive
DataintensiveDataintensive
Dataintensive
 
Data-Intensive Technologies for Cloud Computing
Data-Intensive Technologies for CloudComputingData-Intensive Technologies for CloudComputing
Data-Intensive Technologies for Cloud Computing
 
Big Data Technology Accelerate Genomics Precision Medicine
Big Data Technology Accelerate Genomics Precision MedicineBig Data Technology Accelerate Genomics Precision Medicine
Big Data Technology Accelerate Genomics Precision Medicine
 
BIG DATA TECHNOLOGY ACCELERATE GENOMICS PRECISION MEDICINE
BIG DATA TECHNOLOGY ACCELERATE GENOMICS PRECISION MEDICINE BIG DATA TECHNOLOGY ACCELERATE GENOMICS PRECISION MEDICINE
BIG DATA TECHNOLOGY ACCELERATE GENOMICS PRECISION MEDICINE
 
EMC Greenplum Database version 4.2
EMC Greenplum Database version 4.2 EMC Greenplum Database version 4.2
EMC Greenplum Database version 4.2
 
Connect July-Aug 2014
Connect July-Aug 2014Connect July-Aug 2014
Connect July-Aug 2014
 
Healthcare payer - Big data integration
Healthcare payer - Big data integrationHealthcare payer - Big data integration
Healthcare payer - Big data integration
 
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop ClusterParalyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
 
A Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and ChallengesA Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and Challenges
 
Memory Management in BigData: A Perpective View
Memory Management in BigData: A Perpective ViewMemory Management in BigData: A Perpective View
Memory Management in BigData: A Perpective View
 
spectrum Storage Whitepaper
spectrum Storage Whitepaperspectrum Storage Whitepaper
spectrum Storage Whitepaper
 
FC Brochure & Insert
FC Brochure & InsertFC Brochure & Insert
FC Brochure & Insert
 
Achieving Cost-Effective, Vendor-Neutral Archiving
Achieving Cost-Effective, Vendor-Neutral ArchivingAchieving Cost-Effective, Vendor-Neutral Archiving
Achieving Cost-Effective, Vendor-Neutral Archiving
 
4870 ibm-storage-solutions-final_nov26_18_34019934_usen
4870  ibm-storage-solutions-final_nov26_18_34019934_usen4870  ibm-storage-solutions-final_nov26_18_34019934_usen
4870 ibm-storage-solutions-final_nov26_18_34019934_usen
 
Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy
 
Query Evaluation Techniques for Large Databases.pdf
Query Evaluation Techniques for Large Databases.pdfQuery Evaluation Techniques for Large Databases.pdf
Query Evaluation Techniques for Large Databases.pdf
 
A Framework for Geospatial Web Services for Public Health by Dr. Leslie Lenert
A Framework for Geospatial Web Services for Public Health by Dr. Leslie LenertA Framework for Geospatial Web Services for Public Health by Dr. Leslie Lenert
A Framework for Geospatial Web Services for Public Health by Dr. Leslie Lenert
 
Ibm and zato health
Ibm and zato healthIbm and zato health
Ibm and zato health
 

Advancing life sciences with IBM reference architecture for genomics

  • 1. IBM Systems and Technology Solution Brief Life Sciences Advancing life sciences with the IBM reference architecture for genomics A forward-looking, end-to-end and collaborative solution for genomics research and medicine Highlights ●● ● ● End-to-end, unified solution for genomics research, translational medicine and personalized medicine ●● ● ● Scalable, software-defined, and data- centric architecture designed for cutting-edge research and clinical use ●● ● ● Developed in collaboration with leading researchers and partners worldwide Genomic medicine is revolutionizing medical research and clinical care, enabling scientists and clinicians to identify individuals at risk of disease, provide early diagnoses, and recommend better treatments. Prolific use of next generation sequencers is flooding researchers and clinicians with data. Infrastructures often cannot scale to process, store and distribute this data in time. Many rely on third parties to process and store the raw data, which slows access for analysis. In addition, context and mapping of genomic, phenotypic and environmental data must be available. IBM, in collaboration with key researchers and partners, created the IBM reference architecture for genomics. The end-to-end reference architecture defines the enterprise data management, workflow orches- tration and global access capabilities across key genomics, translational and personalized medicine platforms. It supports large-scale genomics sequencing and downstream data analytics, providing: ●● ● Data lifecycle management to support large scale data growth ●● ● Software-based abstraction layers for compute, storage, big data and cloud ●● ● Workload and workflow orchestrator for applications By investigating the human genome in the context of biological pathways and environmental factors, genomic scientists and clinicians can now identify individuals at risk of disease, provide early diagnoses based on biomarkers and recommend effective treatments.
  • 2. 2 Solution Brief Life SciencesIBM Systems and Technology Due to new technology and research methods, the field of genomics is being flooded with data from next-generation sequencers. Furthermore, this data must be quickly stored, analyzed, shared and archived. Many genomics, cancer and pharmaceutical research institutions are now generating so much data that they can no longer process or even transmit the information over standard communication lines in a timely fashion. Often researchers must physically ship raw data to an external computing center for processing and storage, slowing down and inhibiting ready access to and analysis of data. In addition to scale and speed, this information needs to be linked based on data models and taxonomy, or to be curated with machine or human knowledge. Only then can this “smart” data be factored into the equation when dealing with genomic, phenotypic and environmental data and made available to a common analytical platform. The IBM reference architecture for genomics is an end-to-end reference architecture for genomic medicine. It defines the enterprise capability of data management, workflow orchestra- tion and global access across key platforms for genomics, translational and personalized medicine. Based on this architec- ture, IBM has built a data-centric, software-defined, and application-ready infrastructure in support of large-scale genomics sequencing and downstream data analytics: ●● ● Data-centric: Helps you meet the challenge of managing the explosive growth of genomics and clinical data with data lifecycle management capabilities. ●● ● Software-defined: Defines the architecture with software- based abstraction layers for computation, storage, big data and the cloud. ●● ● Application-ready: Integrates a multitude of applications with a workload and workflow orchestrator. Data management using a data hub Data lifecycle management is critical for genomics due to data volume, velocity and variety. Genomic data volume is surging as the cost of sequencing drops precipitously. The I/O throughput on a genomic system can be extremely demanding due to data volume and to the large number of file and directory objects. Many data formats, with varying degrees of lifecycle manage- ment requirements, ranging from transient files in scratch space to VCF variant-calling files, must perpetually remain online. Leveraging IBM Elastic Storage (see Figure 1), the data man- agement layer, or a data hub, defines an enterprise capability to meet these challenges based on its scalable, extensible and high- performance architecture. First developed and optimized as a high performance computing (HPC) file system, IBM Elastic Storage serves large volumes of data at a high bandwidth and in parallel to all the compute nodes in the computing system(s). As genomic pipeline can consist of hundreds of applications engaged in concurrent data processing on large number of files, this capability is critical to feeding data to the computational genomics workflow. Figure 1. The data management layer or data hub of the IBM reference architecture for genomics. The data hub provides a global name space and high-performance access to data stored on SSD/Flash, fast disk, slow disk and tape archive. The key functions are high data Input/Output (I/O), policy-based Information Lifecycle Management (ILM), data sharing through replication and caching, and big data PACS LIMS NGS Ref DB Publication Assembly & Alignment Variant Calling Annotation Bioinformatics Datahub ILMI/O Sharing Big Data SSD Flash Fast Disk Slow Disk Tape
  • 3. 3 Solution Brief Life SciencesIBM Systems and Technology As the genomics pipeline can generate petabytes of metadata and data, a system pool built upon solid state drive (SSD) and flash disk with high-IOPS capability, can be dedicated to store metadata for files and directories, and in some cases for storing small size files directly. This feature drastically improves file system performance and responsiveness to metadata-heavy operations such as the listing of all files in any given directory. As a file system with a connector to Hadoop MapReduce, the data hub can also serve Hadoop MapReduce big data jobs on the same set as compute nodes, thus eliminating the need for and complexity of another file system—in this case the Hadoop Distributed File System (HDFS). As the bioinformatics indus- try adopts big data technologies including Hadoop MapReduce, this data hub feature will enable firms to gain both economies of scale and performance by the sharing of nodes for both compute and big data analytics. The policy-based data life cycle management capability allows the data hub to move data from one storage pool to the others, maximizing I/O performance, storage utilization and minimiz- ing operational cost. These storage pools can range from the high-I/O flash disk to high-capacity storage appliance (GPFS™ storage server) to low-cost tape media (through integration with tape management solution including IBM TSM/HSM and LTFS). The increasingly distributed nature of genomics infrastructure requires data management on a much larger and global scale. Data not only needs to be moved or shared across different sites, its movement or sharing needs to be coordinated with the computational workload and workflow. To achieve this coordination, the data hub leverages a key function of the IBM Elastic Storage called the Advanced File Placement (AFM). It enables IBM Elastic Storage to extend the global name space to multiple sites, allowing them to share both a common metadata catalogue and a cache copy of home data thus allowing local access for remote client sites. For example, a genomic center can own, operate, and version-control all ref- erence databases or datasets, while the affiliated or partnering sites or centers can access the reference dataset through AFM. When the centralized copy of database gets updated, so are the cache copies of the other sites. With the data hub, a system-wide metadata engine can be built to index and search all the genomic and clinical data, enabling faster, more powerful downstream analytics and translational research. Workflow management with an orchestrator The workflow for genomics is complex yet important. A grow- ing number of genomic applications have varying degrees of maturity and types of programming models. Many are single- threaded (R, for example) or embarrassingly parallel (BWA, for example) while others are multi-threaded or MPI-enabled (e.g. MPI BLAST). All of these applications need to work in concert or tandem in a high throughput and high performance mode in order to generate final results. The IBM reference architecture for genomics includes a workload management or orchestrator layer which defines the capability to orchestrate applications and workflow. A unique combination of the IBM® Platform™ LSF® workload man- ager and Platform Process Manager workflow engine links, coordinates and shepherds a spectrum of computational and analytical jobs into fully automated pipelines that can be easily built, customized, shared and run on a common platform. This capability provides the necessary abstraction of applica- tions from the underlying infrastructure including HPC clusters with graphical unit processors (GPUs) and a big data cluster in the cloud.
  • 4. 4 Solution Brief Life SciencesIBM Systems and Technology The orchestrator distinguishes itself from scripted workflow tools with its capability to handle complex workflows dynamically—individual workloads or jobs can be defined through a user-friendly interface, incorporating variables, parameters and data definition using standard templates. The workload manager transparently handles the submission, placement, monitoring and completion of each job. The workflow engine connects jobs in linear progressions, condi- tional branches, or loops, based on user-defined criteria and requirements for completing and advancing them. To maximize the throughput of the workflow for genomics sequencing analysis, a special type of workload can be defined by using job arrays so data can be split and processed by many jobs in parallel. In another innovative use case for genomics processing, multi- ple sub-flows can be defined as a parallel pipeline for variant analysis following the alignment of the genome. The results from each sub-flow can then be merged into a single output and provide analysts with a comparative view of multiple tools or settings. The workflow can also be designed as a module and embedded into larger workflows as a dynamic building block. Not only will this approach enable the efficient building and reuse of the pipelines, it will also encourage collaborative sharing of genomic pipelines among a group of users or within larger scientific communities. Figure 3 shows a screenshot of an end-to-end workflow created using the orchestrator to process raw sequence data (BCL) into variants (VCF) using a combination of applications and tools. As more institutions are deploying hybrid cloud solutions with distributed resources, the orchestrator can coordinate the distribution of workloads based on data localities, pre- defined policies, thresholds and real-time inputs of resource availabilities. For example, a workflow can be designed for processing genomic raw data closer to sequencers, and fol- lowed by sequence alignment and assembly using the Hadoop MapReduce framework on a remote big data cluster. In another use case, a workflow can be designed to launch a proxy event of moving data from a satellite system to the central HPC cluster when the genomic processing reaches 50 percent of the com- pletion rate. The computation and data movement can happen concurrently to save time and costs. Figure 3. The orchestrator is implemented as a genomic workflow pipeline. Starting from the left in the pipeline - Box 1: the arrival of data such as bcl files will automatically trigger CASAVA as the first step of the workflow; Box 2: a dynamic subflow will use BWA for sequence alignment; Box 3: Samtool will perform post-processing in a job array; Box 4: different variant analysis subflows can be triggered in parallel Figure 2. The workload management or orchestrator layer in the IBM genomic medicine reference architecture. The orchestrator provides workload management, workflow engine and provenance capabilities to orchestrate complex sets of applications and workflows for genomics pipe- lines, and to abstract applications from underlying computational resources. PACS Assembly & Alignment Orchestrator LIMS NGS Ref DB Publication Variant Calling Annotation Bioinformatics Workload Workflow Provenance Local Cluster 1 Local Cluster 2 Remote Cluster (Grid/Cloud) Base Conversion Alignment Pre-processing Variant Analysis GATK Mutect CaVEMan Pindel BWA BowtieCASAVA Samtool Picard GATK BCLBCL FASTQ SAM/BAM Recalibrated BAM SNP Indels Translocation Low-pass copy number Exome copy number
  • 5. 5 Solution Brief Life SciencesIBM Systems and Technology Manage global access with an appcenter In the IBM reference architecture for genomics, an appcenter is the user interface into the genomics platform. It provides an enterprise portal with role-based access and security controls while allowing researchers and clinicians easy access to data and workflow tools. Built with Platform Application Center, this appcenter has advanced logging capabilities for tracking activities including jobs, workflow provenance and data access. This is a critical feature for event reporting, performance analysis, or rerunning analysis with prior settings. To harvest all knowledge and information from users and enable sharing, the appcenter can function as a catalogue of pre-built and pre-tested workflow definitions and application templates so users can easily launch them directly from the portal after uploading the data. The portal can also be config- ured with a metadata index and search engine enabling users to browse, query or search pre-indexed data or reference documents. Extensible reference architecture The IBM reference architecture for genomics extends beyond genomics to cover translational and personalized medicine platforms as well. These platforms also leverage the data hub, orchestrator and appcenter as common enterprise capabilities. The integrated architecture and platforms are shown in figure 4. Figure 4. IBM reference architecture for genomics. The blue section depicts the genomics platform. The green section depicts the translational platform. The purple section highlights depicts the personalized medicine platform Data Source Data Service Analytics Access Personalized Medicine Patient Portal Clinical Portal EMR Publication EMR Workflow Disease Registry Surveys Cognitive Analytics Outcome Evaluation Translational Clinical Knowledge Orchestrator Clinical RWE Ongtologies MDM ETL NLP Predictive Analytics Associative Analytics Parallel Modeling Cohort Query Data Explore Translational Clinical DW Omics DW Translational DW Datahub Genomics Imaging Proteomics Cytometry Orchestrator Datahub PACS LIMS NGS Ref DB Publication Alignment & Assembly Variant Analysis Annotation Functional Genomics AppCenter Visualization Monitoring Genomics fastq bam vcf
  • 6. Please Recycle Why IBM? IBM Platform Computing software offerings enable IT scal- ability, performance and agility that help firms work smarter, faster and outperform their peers. Our comprehensive portfolio of solutions speeds and scales large-scale simulations, predictive analytics and visualization. Platform Computing high-performance infrastructure management software automatically optimizes IT resources usage—on-premise and in the cloud—enabling 24x7, real-time operating agility that can improve the bottom line faster. For more information To learn more about the IBM reference architecture from genomics, please contact your IBM representative or IBM Business Partner, or visit the following website: ibm.com/platformcomputing Additionally, IBM Global Financing can help you acquire the IT solutions that your business needs in the most cost-effective and strategic way possible. We’ll partner with credit-qualified clients to customize an IT financing solution to suit your busi- ness goals, enable effective cash management, and improve your total cost of ownership. IBM Global Financing is your smartest choice to fund critical IT investments and propel your business forward. For more information, visit: ibm.com/financing © Copyright IBM Corporation 2014 IBM Corporation Systems and Technology Route 100 Somers, NY 10589 Produced in the United States of America August 2014 IBM, the IBM logo, ibm.com, GPFS, Platform, and LSF are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the web at “Copyright and trademark information” at ibm.com/legal/copytrade.shtml This document is current as of the initial date of publication and may be changed by IBM at any time. Not all offerings are available in every country in which IBM operates. It is the user’s responsibility to evaluate and verify the operation of any other products or programs with IBM products and programs. THE INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS” WITHOUT ANY WARRANTY, EXPRESS OR IMPLIED, INCLUDING WITHOUT ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND ANY WARRANTY OR CONDITION OF NON-INFRINGEMENT. IBM products are warranted according to the terms and conditions of the agreements under which they are provided. Statements regarding IBM’s future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. DCS03062-USEN-00