Your SlideShare is downloading. ×
Enabling next-generation sequencing applications with IBM Storwize V7000 Unified and SONAS Gateway solutions
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Enabling next-generation sequencing applications with IBM Storwize V7000 Unified and SONAS Gateway solutions

2,145
views

Published on

Learn about enabling next-generation sequencing applications with IBM Storwize V7000 Unified and SONAS Gateway solutions. This paper offers recommendations and guidance to facilitate easy …

Learn about enabling next-generation sequencing applications with IBM Storwize V7000 Unified and SONAS Gateway solutions. This paper offers recommendations and guidance to facilitate easy configuration and installation of the solution to ensure an efficient installation with good performance. For more information on IBM Storage Systems, visit http://ibm.co/LIg7gk.


Visit http://on.fb.me/LT4gdu to 'Like' the official Facebook page of IBM India Smarter Computing.

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,145
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Enabling next-generation sequencing applications With IBM Storwize V7000 Unified and SONAS Gateway solutions Dr. Tzy-Hwa Tzeng, Dr. Ruzhu Chen, Justin Morosi, Prashant Avashia IBM Systems and Technology Group ISV Enablement October 2012 © Copyright IBM Corporation, 2012
  • 2. Table of Contents Abstract..................................................................................................................................... 1 Introduction: DNA and RNA sequencing applications........................................................... 2 DNA, RNA, and next-generation sequencing (NGS) technologies ......................................................... 3 Analysis tools ........................................................................................................................................... 3 Introduction: IBM Storwize V7000 Unified and SONAS systems .......................................... 5 IBM Storwize V7000 Unified system overview ........................................................................................ 5 IBM SONAS Gateway system overview .................................................................................................. 6 Differences: IBM Storwize V7000 Unified and SONAS Gateway as NAS systems ................................ 7 Architectural assumptions ...................................................................................................... 9 IBM Storwize V7000 Unified: Configurations, tests, and results ......................................... 10 IBM SONAS Gateway: NAS configurations, tests, and results ........................................... 13 File systems layout: Best practice recommendations ......................................................... 17 Solution benefits: IBM Storwize V7000 Unified and SONAS Gateway system ................... 18 Summary ................................................................................................................................. 19 Acknowledgments .................................................................................................................. 19 Appendices ............................................................................................................................. 20 Appendix A: Typical server and storage configuration sizing recommendations ................................. 20 Appendix B: Resources ......................................................................................................................... 21 About the authors................................................................................................................... 22 Trademarks and special notices ........................................................................................... 23 Enabling next-generation sequencing applications
  • 3. Abstract Next generation genomic sequencing technologies have been instrumental in significantly accelerating biological research and discovery of genomes for humans, mice, snakes, plants, bacteria, virus, cancer cells, and so on. Now, researchers process immense data sets, build analytical deoxyribonucleic acid (DNA) models for large genomes, use reference-based analytic methods, and further their understanding of genomic models for drug discovery, personalized medicine, toxicology, forensics, agriculture, nanotechnology, and other emerging use cases. IBM has now partnered with CLC bio Inc. to bring validated, and integrated smart computing solutions that combine intelligent Assembly Cell software and optimized open systems software from public domains, together with IBM Smarter Storage. This joint solution incorporates IBM industry expertise, best practices, and IBM Technologies to help research institutions and pharmaceutical companies to manage, query, analyze, and better understand integrated genotypic and phenotypic data for medical research and patient treatment. This paper validates that IBM Storwize V7000 Unified and Scale Out Network Attached Storage (SONAS) Gateway based Smarter Storage solutions offer good application performance, and availability for de novo Assembly and reference-based mapping algorithms, under the following circumstances: • • • Access to genomic data from DNA and ribonucleic acid (RNA) sequences is configured on IBM Storwize V7000 Unified or SONAS Gateway solutions. The CLC Assembly Cell or open systems software applications are configured on Red Hat Enterprise Linux (RHEL) servers. The Network File System (NFS), v3 services are configured and delivered over the Internet Protocol (IP) network. This paper offers easy recommendations guidance to facilitate easy configuration and installation of the solution to ensure an efficient installation with good performance. Enabling next-generation sequencing applications 1
  • 4. Introduction: DNA and RNA sequencing applications Genetic concepts and interesting facts All humans, animals, plants, and living organisms are comprised of cells. Inside any, and each cell, resides a nucleus. The nucleus is a self-contained unit that acts as a central entity, managing the functions and activity inside, and outside the cell. The nucleus contains most of the cell's genetic information, organized as multiple long linear DNA molecules that are co-existent with a large variety of proteins, to form chromosomes. The genes within these chromosomes make up the cell's genome. The function of the nucleus is to maintain the integrity of these genes and control the cell activities. The nucleus is, therefore, the control center of the cell. Genes, DNA, and RNA Genes are made up of various lengths of DNA, which contains four chemicals: adenine (A), guanine (G), cytosine (C), and thymine (T). These chemicals line up similar to beads on a necklace to form strands of code. They also pair up with each other to form the double strands that are characteristic of DNA. The sequence of a nucleic acid is the composition of atoms that make up the nucleic acid and the chemical bonds that bond those atoms. DNA is a nucleic acid containing the genetic instructions used in the development and functioning of all known living organisms (with the exception of RNA viruses). The DNA segments carrying this genetic information are called genes. Likewise, other DNA sequences have structural purposes, or are involved in regulating the use of this genetic information. Along with RNA and proteins, DNA is one of the three major macromolecules that are essential for all known forms of life. RNA is also a nucleic acid, and is one of the four major macromolecules (along with lipids, carbohydrates, and proteins) essential for all known forms of life. Similar to DNA, RNA is made up of a long chain of components called nucleotides. Each nucleotide consists of a nucleobase, a ribose sugar, and a phosphate group. The sequence of nucleotides allows RNA to encode genetic information. In addition, many viruses use RNA instead of DNA as their genetic material. The chemical structure of RNA is very similar to that of DNA, with two differences: (a) RNA contains the sugar ribose, while DNA contains the slightly different sugar deoxyribose (a type of ribose that lacks one oxygen atom), and (b) RNA has the nucleobase uracil while DNA contains thymine. The other three nucleobases namely, adenine (A), guanine (G), and cytosine (C) are the same in both DNA and RNA. Unlike DNA, most RNA molecules are single-stranded and can adopt very complex three-dimensional structures. Human genome The human genome includes a complete set of human genetic information stored as separate DNA sequences in 23 chromosome pairs of the human cell nucleus and a small amount of mitochondrial DNA, which are used as a source of chemical energy required for the cell to survive. The human genome is estimated to be about 3.2 billion base pairs long and it contains about 20,000 to 25,000 distinct genes. There are 23 chromosome pairs in each cell. The twenty third chromosome pair is a sex determining chromosome. If it is a pair of X chromosomes, then in many animal species, it determines a female. If it is Enabling next-generation sequencing applications 2
  • 5. a combination of X and Y chromosomes, it determines a male. The X chromosome has more than 153 million base pairs and represents about 2000 of the 20,000 to 25,000 genes in the human genome (or about 10% of the total gene population). The Y chromosome has about 58 million base pairs and represents about 200 to 500 of the 20,000 to 25,000 genes in the human genome. The largest human chromosome is chromosome 1, and is approximately 220 million base pairs long. The smallest chromosome is the mitochondrial DNA, and is approximately 16,000 base pairs long. DNA, RNA, and next-generation sequencing (NGS) technologies In genetics, the sequencing processes determine the primary structure of an unbranched biopolymer. The sequencing process results in a symbolic linear depiction (known as a sequence), which clearly summarizes much of the atomic-level structure of the sequenced molecule. DNA sequencing is the process of reading the nucleotide bases in a DNA molecule. It includes any method or technology that is used to determine the order of the four bases—adenine, guanine, cytosine, and thymine, (A,G,C,T)—in a strand of DNA. RNA sequencing is the process of reading the nucleotide bases in a RNA molecule. It includes any method or technology that is used to determine the order of the four bases—adenine, guanine, cytosine, and uracil, (A,G,C,U)—in a strand of RNA. Next-generation sequencing technologies parallelize the sequencing process, producing thousands or millions of sequences at a time. These technologies are intended to lower the cost of sequencing beyond what is possible with standard dye-terminator methods. High-throughput sequencing technologies generate millions of short reads from a library of nucleotide sequences; whether they come from DNA, RNA, or a mixture, the sequencing mechanism of each platform does not vary. Analysis tools The next-generation sequencing technologies read the biological specimen or the tissue sample, and create hundreds of thousands (or even millions) of base pairs for analysis. A typical sequencing run can range from a single day (24 hours) to a single week (162 hours) in the year 2012, and can generate data between the ranges of 100 MB to 3 GB. In the next few years, this effort will only improve to generate significantly more precise results even sooner, than available, with current processes, methods, and technologies. There are several open source, high performance next-generation sequencing tools, such as BurrowsWheeler Aligner (BWA) and Trinity, that can analyze the genomic DNA and RNA data from the sequencers. On a commercial license, CLC bio offers the most-comprehensive, high-performance computing solution for the Life Sciences industry. This section explains all the three applications. Enabling next-generation sequencing applications 3
  • 6. CLC Assembly Cell CLC Assembly Cell is available on a commercial license. It is a high-performance computing solution for read mapping and de novo assembling of next-generation sequencing data. It includes native color-space support. The command-line interface (CLI) of CLC Assembly Cell enables the functionalities to be easily included in scripts and other next-generation sequencing workflows. CLC Assembly Cell uses single-instruction, multiple-data (SIMD) compute instructions to parallelize and accelerate the assembly algorithms, making the program the fastest next-generation sequencing assembler in the market. For reference, visit the following URL: http://www.clcbio.com/wp-content/uploads/2012/09/CLCAssemblyCell12.pdf Burrows-Wheeler Aligner Burrows-Wheeler Aligner (BWA) is an open-source, high-performance tool, and is available freely, with no software licensing restrictions. It is an efficient program that aligns relatively short nucleotide sequences against a long reference sequence, such as the human genome. It implements two algorithms, BWASHORT and BWA-SW. The former works for query sequences shorter than 200 base-pairsand the latter for longer sequences up to around 100,000 base-pairsp. Both algorithms do gapped alignment. They are usually more accurate and faster on queries with low error rates. Trinity Trinity, developed at the Broad Institute (a collaboration of MIT and Harvard Universities), is also a widely used open-source, high-performance tool. It represents a novel method for the efficient and robust de novo reconstruction of transcriptomes from RNA-Seq data. Trinity combines three independent software modules: Inchworm, Chrysalis, and Butterfly, applied sequentially to process large volumes of RNA-Seq reads. Trinity partitions the sequence data into many individual de Bruijn graphs, each representing the transcriptional complexity at a given gene or locus, and then processes each graph independently to extract full-length splicing isoforms and to tease apart transcripts derived from paralogous genes. NGS solution benefits at a glance: The NGS tools are enabled, tested, validated, and certified. They are then included in optimized solutions by IBM®. IBM has used technology, industry expertise, best practices, and leading analytical partner applications into a tightly integrated solution. With this solution, research institutions and pharmaceutical companies can easily manage, query, analyze, and better understand integrated genotypic and phenotypic data for medical research and patient treatment. They can: • • • Organize, integrate, and manage different kinds of data to enable focused clinical research, including: diagnostic, clinical, demographic, genomic, phenotypic, imaging, environmental, and more. Enable secure, cross-department collection and sharing of clinical and research data. Ensure flexibility and growth with open and industry-standards based architecture. Enabling next-generation sequencing applications 4
  • 7. Introduction: IBM Storwize V7000 Unified and SONAS systems This section provides introductory details and highlights of IBM Storwize® V7000 Unified and IBM SONAS Gateway systems. IBM Storwize V7000 Unified system overview Many users have deployed storage area network (SAN) attached storage for their applications requiring the highest levels of performance while separately deploying network-attached storage (NAS) for its ease of use and lower cost networking. This divided approach adds complexity by introducing multiple management points and also creates islands of storage that reduce efficiency. The Storwize V7000 Unified system provides the ability to combine both block and file storage into a single system. By consolidating storage systems, multiple management points can be eliminated and storage capacity can be shared across both types of access, helping to improve overall storage utilization. The Storwize V7000 Unified system also presents a single, easy-to-use management interface that supports both block and file storage, helping to simplify administration further. The Storwize V7000 Unified system builds on the functions and high-performance design of the Storwize V7000 system and integrates proven IBM software capabilities to deliver new levels of efficiency. The Storwize V7000 Unified system provides identical software capabilities as the IBM SONAS system, as follows: • • • • • • Massive scalability: − Supports billions of files (up to 21 petabytes of storage) in a single file system − Supports upto 256 file systems per single SONAS platform Flexibility: − Allows access to data in a single global namespace, allowing all users a single, logical view of files through a single drive letter such as a Z drive − Provides efficient distribution of files, images, and application updates and fixes to multiple locations quickly and cost effectively − Provides multiple storage tiers for flexible, efficient management of petabytes of files. − Supports industry-standard protocols: Common Internet File System (CIFS), Network File System (NFS), File Transfer Protocol (FTP), Hypertext Transfer Protocol Secure (HTTPS), and Secure Copy Protocol (SCP) Performance: Leverages two dual-port (all ports active) 10 GbE interface cards offering high bandwidth and additional connectivity in each SONAS interface node to manage multiple data streams and functions (for example, backup, replication, antivirus). Data protection: File system and fileset-level snapshots (up to 256 per file system) provide a way to partition the namespace into smaller, more manageable units. Management: CLI and browser-based, simple, intuitive, and state-of-the-art administrative GUI provide icon-based navigation, informative graphics, and SONAS visualizations that streamline storage tasks and display real-time capacity, performance, and system health. Antivirus: Integrates with McAfee and Symantec Antivirus, enabling users to secure data from malware and use the most commonly deployed ISV antivirus applications. Enabling next-generation sequencing applications 5
  • 8. • • (Clarification, for purposes of this particular paper: In Life Sciences, there is a separate definition for antivirus – An ultramicroscopic (20 to 200 nm in diameter), infectious agent that replicates within host cells. It is composed of a DNA, RNA core, and a protein coat. The authors do not refer to the Life Sciences definition, in this paper. Cloud features: Self-managing, autonomic system enables capacity, provisioning, and other IT service management decisions to be made dynamically, without human intervention or increased administrative costs. IBM Active Cloud Engine™ enables ubiquitous access to files from across the globe quickly and cost effectively. Operational savings and total cost of ownership (TCO): − Consolidates multiple individual filers and their management, thereby avoiding problems associated with administering an array of disparate NAS systems − Automates file placement by transparently moving files to another internal or external storage pool, optimizes your storage resources, and offers tremendous time and cost savings in administering petabytes of files − Helps conserve floor space (up to a petabyte of data in less than a square meter), is highly scalable and can help reduce capital expenditure and enhance operational efficiency; its advanced architecture virtualizes and consolidates file space into a single, enterprise-wide file system, which can translate into reduced TCO IBM SONAS Gateway system overview The IBM SONAS Gateway system is designed to manage vast repositories of information in enterprise environments requiring very large capacities, high levels of performance, and high availability. SONAS Gateway uses a mature technology from the IBM high-performance computing (HPC) experience. It is based upon the IBM General Parallel File System (IBM GPFS™), a highly scalable clustered file system. SONAS Gateway is an easy-to-install, turnkey, modular, scale out NAS solution. It provides the performance, clustered scalability, high availability, and functionality that are essential for meeting strategic multi-petabyte age and cloud storage requirements. SONAS Gateway currently offers the following features and capabilities: • • Massive scalability: − Supports billions of files (up to 21 petabytes of storage) in a single file system − Supports upto 256 file systems per single SONAS platform Flexibility: − Allows access to data in a single global namespace, allowing all users a single, logical view of files through a single drive letter such as a Z drive − Provides efficient distribution of files, images, and application updates and fixes to multiple locations quickly and cost effectively − Provides multiple storage tiers for flexible, efficient management of petabytes of files − Supports industry-standard protocols: CIFS, NFS, FTP, HTTPS, and SCP Enabling next-generation sequencing applications 6
  • 9. • • • • • • Performance: Leverages two dual-port (all ports active) 10 GbE interface cards offering high bandwidth and additional connectivity in each SONAS interface node to manage multiple data streams and functions (for example, backup, replication, antivirus). Data protection: File system and fileset-level snapshots (up to 256 per file system) provide a way to partition the namespace into smaller, more manageable units. Management: CLI and browser-based, simple, intuitive, and state-of-the-art administrative GUI provide icon-based navigation, informative graphics and SONAS visualizations that streamline storage tasks and display real-time capacity, performance, and system health. Antivirus: Integrates with McAfee and Symantec Antivirus, enabling users to secure data from malware and uses the most commonly deployed ISV antivirus applications. (Clarification, for purposes of this particular paper: In Life Sciences, there is a separate definition for antivirus – An ultramicroscopic (20 to 200 nm in diameter), infectious agent that replicates within host cells. It is composed of a DNA, RNA core, and a protein coat. The authors do not refer to the Life Sciences definition, in this paper. Cloud features: Self-managing, autonomic system enables capacity, provisioning and other IT service management decisions to be made dynamically, without human intervention or increased administrative costs. IBM Active Cloud Engine enables ubiquitous access to files from across the globe quickly and cost effectively. Operational savings and TCO: − Consolidates multiple individual filers and their management, thereby avoiding problems associated with administering an array of disparate NAS systems. − Automates file placement by transparently moving files to another internal or external storage pool, optimizes your storage resources, and offers tremendous time and cost savings in administering petabytes of files − Helps conserve floor space (up to a petabyte of data in less than a square meter), is highly scalable and can help reduce capital expenditure and enhance operational efficiency; its advanced architecture virtualizes and consolidates file space into a single, enterprise-wide file system, which can translate into reduced TCO Differences: IBM Storwize V7000 Unified and SONAS Gateway as NAS systems The difference between the IBM Storwize V7000 Unified and SONAS Gateway systems lies in the workloads that each system can support. The Storwize V7000 Unified system can support smaller and medium-size workloads, while the SONAS Gateway system has the scalability to deliver high performance for extremely large application workloads and capacities, typically for the entire enterprise. Enabling next-generation sequencing applications 7
  • 10. Table 1 offers the comparative product positioning between the Storwize V7000 Unified and SONAS systems: No. Attribute Storwize V7000 Unified SONAS 1 Maximum number of interface nodes 2 30 2 Maximum number of storage nodes N/A 60 3 Maximum raw capacity of file storage 360 TB (3 TB drives x 12 drives per expansion unit x 10 expansion units) 21.6 PB (3TB drives x 240 drives x 30 controllers). 4 Maximum size of single shared file system (GPFS) 8 PB 8 PB 5 Maximum number of file systems within a single system 64 256 6 Maximum size of a single file 8 PB 8 PB 7 Maximum number of files per storage system 4 Billion 4 Billion 8 Maximum number of dependent file sets per file system 256 3000 9 Maximum number of independent file sets 256 1000 10 Maximum number of independent file sets 256 1000 Table 1: Comparative product positioning of Storwize V7000 Unified and SONAS Gateway systems Enabling next-generation sequencing applications 8
  • 11. Architectural assumptions Make a note of the following architectural assumptions and caveats in regard to the technical content of this paper. This paper does: • Offer information and recommendations for tuning adjustments to achieve good performance in normal NAS production environments. • Allow a non-technical customer or user to quickly tune their NAS environment by using recommendations, observations, tips, and best practices, as documented. • Provide information from a non-technical user point of view for fast implementations. This paper does not: • Explain the various technologies and solutions to establish or publish any benchmarks. • Guarantee a specific performance of any technical element. • Provide or offer any information to overcome previously established benchmarks. • Explain or explore newer technologies, standards, and concepts such as 40 GbE connections, NFS V4, cloud multi-tenancy and so on. • Offer any guidance on how to determine hardware sizing or capacity planning for your installation. Caveats: • • • • Use cognizance in making your decisions. Do not take any published numbers literally. For this paper, the tests were run on different IBM equipments located at different IBM data centers. Note that the performance results might vary, depending on unique server / client conditions, architectural configurations, network behaviors, application dependencies, and operational environments. Your performance and mileage might vary from the test results. Recommended best practices sometimes differ from the test configurations. The test configurations were set up to observe certain behavior in specific test situations. The best practices are recommended to run operations in production environments. Enabling next-generation sequencing applications 9
  • 12. IBM Storwize V7000 Unified: Configurations, tests, and results Configuration and tests An IBM Storwize V7000 Unified system was tested with the three NGS applications: CLC Assembly Cell, BWA, and Trinity. The connectivity between the Storwize V7000 Unified system and the single application server was configured as NAS-attached configuration. This configuration was a typical use case for a small research facility, with minimal compute resources, as shown in Figure 1. Figure 1: NAS-attached Storwize V7000 Unified configuration with NGS applications for a small research facility Test results with CLC Assembly Cell The following tables summarize the results of successful testing of de novo assembly and reference assembly with the CLC Assembly Cell software, BWA application software, and Trinity application software using identical server and storage configurations, as demonstrated in Figure 1. Enabling next-generation sequencing applications 10
  • 13. Input gz-fastq Cores (threads) XFS-local (minutes) GPFS (minutes) SSD (minutes) Storwize V7000 Unified (minutes)* 571 575 544 589 32 (32) 386 384 376 385 32 (64) 323 309 320 310 16 (16) 534 520 525 560 32 (32) 351 345 337 363 32 (64) fasta 16 (16) 288 286 267 276 Table 2: CLC Assembly Cell performance results with de novo assembly using non paired-end option Input GPFS (minutes) SSD (minutes) 16 (16) 582 570 571 610 396 374 380 387 32 (64) 333 329 323 317 16 (16) 569 566 535 605 32 (32) 380 365 369 365 32 (64) fasta XFS-local (minutes) 32 (32) gz-fastq Cores (threads) Storwize V7000 Unified (minutes)* 313 315 310 298 Table 3: CLC Assembly Cell performance results with de novo assembly using paired-end information CLC Assembly Cell Version 4 Input 32 (64) Cores (threads) 78.5 XFS-local (minutes) 78 GPFS (minutes) 72 SSD (minutes) 80 Table 4: CLC Assembly Cell performance results with paired-end reference mapping information *Mount Options: rw,bg,hard,rsize=1048576,wsize=1048576,proto=tcp,vers=3,noac,nocto,actimeo=0 0 0 Enabling next-generation sequencing applications 11
  • 14. Test results with the BWA application When the BWA application was run on the same server with the same Storwize V7000 Unified system as the storage back-end, the following test results were obtained, as shown in Table 5. Input Threads Storwize V7000 1 Unified (No cache) Storwize 2 V7000 Unified (with cache) Local Comparing BWA reads_100m.fq with humangenome.fa 8 44 min 46 s 44 min 57.483 s 44 min 59 s 16 26 min 29 s 25 min 38.118 s 26 min 38 s 24 20 min 50 s 20 min 9.843 s 21 min 34 s 32 18 min 40 s 19 min 0.600 s 18 min 40 s 64 24 min 58 s 26 min 47.676 s 26 min 20 s Table 5: BWA performance results with various file system options 1 2 Mount options: rw,bg,hard,rsize=1048576,wsize=1048576,proto=tcp,vers=3,noac,nocto,actimeo=0 0 0 Mount options: rw, noatime, nodiratime, rsize=1048576, wsize=1048576, proto=tcp, vers=3, timeo=600, addr=9.11.82.103 Enabling next-generation sequencing applications 12
  • 15. Test results with the Trinity application When the Trinity application was run on the same server with the same Storwize V7000 Unified system as the storage back-end, the following test results were obtained, as shown in Table 6. Mount options Duration fm1p1:/ibm/gpfs_15k/ngsfs on /gpfs0 type nfs (rw,addr=9.11.82.103) 869 min 4.618 s fm1p1:/ibm/gpfs_15k/ngsfs on /gpfs0 type nfs (rw,noatime,nodiratime,rsize=1048576,wsize=1048576,proto=tcp,vers=3,timeo=60 0,addr=9.11.82.103) 866 min 47.335 s rw,bg,hard,rsize=1048576,wsize=1048576,proto=tcp,vers=3,noac,nocto,actimeo=0 00 More than 4 days /dev/sdb on /xfs type xfs (rw,nobarrier) 787 min 54.544 s 9.11.83.71:/ibm/gpfs1/Life_sciences_bak on /NGS type nfs (rw,noatime,nodiratime,rsize=1048576,wsize=1048576,proto=tcp,vers=3,timeo=60 0,addr=9.11.83.71) 875 min 36.453 s rw,bg,hard,rsize=1048576,wsize=1048576,proto=tcp,vers=3,noac,nocto,actimeo=0 00 More than 4 days Table 6: Trinity application performance results with various file system mount point options Note: Run times are large as the Trinity application creates millions of files, ranging from 0 MB to 10 MB in size. This is a typical behavior of Trinity applications. IBM SONAS Gateway: NAS configurations, tests, and results Configurations and tests An IBM SONAS Gateway system was tested with the three NGS applications: CLC Assembly Cell, BWA, and Trinity. The connectivity between the SONAS Gateway system and 14 IBM BladeCenter® blade servers was configured as a NAS-attached configuration. The blade servers represented application services. This configuration was a typical use case for a medium- to large-research facility, with adequate compute and performance resources, as shown in Figure 2. Enabling next-generation sequencing applications 13
  • 16. Figure 2: NAS-attached SONAS Gateway configuration with NGS applications for a medium- to large-research facility Enabling next-generation sequencing applications 14
  • 17. Test results with CLC Assembly Cell The following tables summarize the results of successful testing of de novo assembly and reference assembly with the CLC Assembly Cell software, BWA application software, and Trinity application software using identical server and storage configurations, as demonstrated in Figure 2. Input Cores (threads) SONAS Gateway (minutes)* gz-fastq 16 (16) 573 16 (32) 439 16 (16) 547 16 (32) 406 fasta Table 7: CLC Assembly Cell performance results with de novo assembly using the non paired-end option Input Cores (threads) SONAS Gateway (minutes)* gz-fastq 16 (16) 591 16 (32) 449 16 (16) 588 16 (32) 437 fasta Table 8: CLC Assembly Cell performance results with de novo assembly using paired-end information CLC Assembly Cell Cores (threads) SONAS Gateway (minutes)* Version 4 16 (32) 148 Table 9: CLC Assembly Cell performance results with paired-end reference mapping information *Mount options:rw,bg,hard,rsize=1048576,wsize=1048576,proto=tcp,vers=3,noac,nocto,actimeo=0 0 0 Enabling next-generation sequencing applications 15
  • 18. Test results with BWA application Table 10 summarizes the results of successful testing with the BWA application software. Input Threads SONAS Gateway Hx9203** No cache SONAS Gateway Hx9201** No cache SONAS Gateway Hx9201*** cache BWA reads_10 0m.fq vs humange nome.fa 8 50 min 31 s 43 min 58.889 s 44 min 6.455 s 16 30 min 7 s 25 min 30.709 s 26 min 30.983 s 24 26 min 45 s 22 min 49.733 s 22 min 17.789 s 32 25 min 46 s 22 min 38.095 s 23 min 14.785 s Hx9202 Hx9205 Hx9206 Hx9207 Hx9208 Hx9210 Hx9211 Hx9212 26 min 34 s 30 min 14 s 30 min 50 s 30 min 1 s 31 min 12 s 30 min 32 s 32 min 13 s 30 min 54 s Table 10: Results of successful testing of BWA applications on 14 servers attached to the SONAS Gateway system The following mount options were documented, with the results as listed in Table 10. ** rw,bg,hard,rsize=1048576,wsize=1048576,proto=tcp,vers=3,noac,nocto,actimeo=0 0 0 *** rw,noatime,nodiratime,rsize=1048576,wsize=1048576,proto=tcp,vers=3,timeo=600,addr=9.11.82.103 Enabling next-generation sequencing applications 16
  • 19. Test results with Trinity application When the Trinity application was run on the same server, with the same Storwize V7000 Unified system as a storage back-end, the following test results were obtained, as in Table 11, below: Mount options Duration 172.26.39.180:/ibm/gpfs136gb_15k on /lifesci type nfs (rw,addr=172.26.39.180) 804 min 2.373 s 172.26.39.180:/ibm/gpfs136gb_15k on /lifesci type nfs (rw,noatime,nodiratime,rsize=1048576,wsize=1048576,proto=tcp,vers=3,time o=600,addr=172.26.39.180) 812 min 7.432 s 172.26.39.180:/ibm/gpfs136gb_15k on /lifesci type nfs (rw,noatime,nodiratime,rsize=1048576,wsize=1048576,proto=tcp,vers=3,time o=600,addr=172.26.39.180) 760 min 50.722 s 774 min 20.714 s Table 11: Results of successful testing of Trinity Application on 14 servers attached to SONAS Gateway system File systems layout: Best practice recommendations To ensure good application performance, with optimal runtimes, when server(s) are connected over the 10 GbE Ethernet to IBM Storwize V7000 Unified or SONAS Gateway systems the following considerations should be noted: • • • Proper sizing and stability of application nodes and servers is extremely important to drive the required levels of workloads for various different types of algorithms, such as de novo, or reference-based mapping. Proper selection of valid mount options affects the performance and runtime characteristics of NGS applications. Incorrect selection of mount options results in long running jobs, as these jobs will create millions of files ranging from 0 MB to 10 MB in size. It was observed that all these applications did not saturate the network, the IBM Storwize V7000 Unified system, or the IBM SONAS Gateway system. For improved performance in a normal and a typical production environment, lay out the file systems for NGS applications as per the following guidelines and best practice recommendations: • • • • Different NGS applications require different types of mount options for increased performance and optimal response time. Create the GPFS on the SONAS Gateway or Storwize V7000 Unified system by using the cluster method of creating the block allocation maps to achieve a uniform disk performance across all storage capacities. Create the GPFS on the SONAS Gateway or Storwize V7000 Unified system by using logfileplacement value = striped to stripe the log file of the file system, across all metadata disks. Recommend using the block size as 256 K for both, short-term, and long-term storage. Enabling next-generation sequencing applications 17
  • 20. • • • As a best practice, run all RHEL 6.2 servers with dual 10 GbE bonded network channel connections, with MTU=9000. To support various NGS application workloads, two interface nodes are recommended on the Storwize V7000 Unified system for increased availability. To support various NGS application workloads, at least two interface nodes are recommended on the SONAS Gateway system for increased availability. Solution benefits: IBM Storwize V7000 Unified and SONAS Gateway system Both, SONAS and Storwize V7000 Unified systems offer the following significant benefits, for clients running NGS Applications for efficient analysis of genomic data from DNA and RNA sequences: • Easily examine a large group of potential gene candidates by using typical applications such as blast, linkage analysis, mascot etc., that can quickly search and rapidly screen targets in genomic databases, genomes and assays. • Efficiently create targeted drug treatments. Easily enable the scale-up development of new drug molecules developed through Drug Discovery (Research, Synthesis), PreClinical Development (Preparation, Formulation, Pre-dosage design), Pre-FDA (new drug formulation, standards). • Delivers tight integration between ERP and Pharma Supply Chains - SONAS easily supports pharmaceutical processes to scale-up of API’s (active pharmaceutical ingredients) from Milligram to Kilogram quantities for commercial manufacturing and distribution of drugs, with improved visibility into process optimization and consistent yield variability across batches. • Lowers TCO by efficiently reducing drug discovery costs through use/reuse of databases, common analytical data, processes and standards throughout the pharmaceutical operational chain. • Deliver on-demand cloud computing models to rapidly address changing levels of analytical computational capacities and facilitate self service of analytical tools, pooling of analytic, research development, pharmaceutical manufacturing resources, and common and scalable transactional processes and standards. Enabling next-generation sequencing applications 18
  • 21. Summary This paper validates that IBM Storwize V7000 Unified and SONAS Gateway based solutions offer good application performance with excellent virtualization and availability under the following circumstances: • • • Access to genomic data from DNA and RNA sequences is configured on the IBM Storwize V7000 Unified or SONAS Gateway system. The CLC Assembly Cell or open systems software applications are configured on RHEL servers. The NFS v3 services are configured and delivered over the IP network. This paper offers recommendations and guidance to facilitate easy configuration and installation of the solution to ensure an efficient installation with good performance. Acknowledgments Special thanks to the teams from CLC bio in Denmark for loaning the software licenses of the CLC Assembly Cell software, which enabled the IBM test team to create a representative operational test environment in IBM data centers and run tests to document real-life results. Many Thanks to the IBM client executives, IBM Systems and Technology Group members, and other team members who contributed with their recommendations during the test run and review process, and enabled successful completion and validation so that CLC bio software applications can run successfully over various environments facilitated by IBM Storwize V7000 Unified and SONAS Gateway systems. The IBM team also acknowledges with special thanks to Connie Borton, Michael Nelson, Cathy Drews, Daniel Drinnon, and Larry Garibay for their invaluable help and assistance, without which the software validation of three independently different software applications would not have been successful. Enabling next-generation sequencing applications 19
  • 22. Appendices Appendix A: Typical server and storage configuration sizing recommendations This section includes a typical recommendation and a guideline for server and storage configuration sizing for small, medium, and large research facilities. While this information is typical, the authors do understand that there will be differences in various organizations in terms of the following criteria: • • • • • • • Different number and types of sequencers in the facility Different types of genomes being worked on in the laboratory / organization Different processes being pursued within the organization – be it reference mapping, assembly or transcriptions, or downstream analytics The amount of data that is required to be kept active The amount of data that is required to be kept archived The response time required in terms of the number of genomes per day, per week, or per month And many other factors Tier 1: 1 to 2 human size genomes per week, for both de novo and reference-based mapping Single server and internal storage configuration • • • • IBM system x3750 with 2.4 GHz E5 4640, ½ TB RAM, 16 TB internal disks 4 sockets, 32 cores, 2.4 GHz Intel® Xeon® processor E5 4640 32 x 16 GB 1600MHz DDR3 DIMMs, 16 x 2.5-inch 1 TB SAS drive Tier 2: 3 to 10 human size genomes per week or need more than 15 TB online for both de novo and reference-based mapping Multiple server and external storage configuration • • • • • • IBM BladeCenter HS23 frame enclosed with14 blade servers. Each blade server configured with 2.6 GHz Intel Xeon processor E5 2670, 128 GB RAM and dual10 GbE connection ports 2 sockets, 16 cores, 2.6 GHz Intel Xeon processor E5 2670 16 x 8 GB 1333MHz DDR3 DIMMs 2 x 2.5-inch 300 GB SAS drive 96 x 2.5-inch 600 GB 10 K rpm drives in four enclosures of IBM Storwize V7000 Unified. The IBM Storwize V7000 Unified system can host up to 10 enclosures, and therefore, if more capacity is needed in the future, more disks can be added to the remaining six enclosures. 1 external switch (or customer supplied switch) to support 10 GbE connectivity Tier 3: More than 10 human size genomes per week or need more than 100 TB online or need for downstream analysis. Enabling next-generation sequencing applications 20
  • 23. This is a custom configuration. You can contact IBM. Appendix B: Resources The following websites provide useful references to supplement the information contained in this paper: • Introduction to Genetics en.wikipedia.org/wiki/Introduction_to_genetics • DNA en.wikipedia.org/wiki/DNA • DNA Sequencing en.wikipedia.org/wiki/DNA_sequencing • Cell Nucleus en.wikipedia.org/wiki/Cell_nucleus • Human Genome en.wikipedia.org/wiki/Human_genome_map • RNA en.wikipedia.org/wiki/RNA • RNA-Seq en.wikipedia.org/wiki/RNA-Seq • X Chromosome en.wikipedia.org/wiki/X_chromosome • Y Chromosome en.wikipedia.org/wiki/Y_chromosome • Trinity www.broadinstitute.org/scientific-community/software/trinity • Burroughs Wheeler Aligner bio-bwa.sourceforge.net/ • CLC bio Applications www.clcbio.com/ • IBM Redbooks® ibm.com/redbooks Enabling next-generation sequencing applications 21
  • 24. • IBM Publications Center www.elink.ibmlink.ibm.com/public/applications/publications/cgibin/pbi.cgi?CTY=US • IBM Scale Out Network Attached Storage Architecture, Planning and Implementation Basics [SG24-7875-00] ibm.com/redbooks/abstracts/sg247875.html?Open • IBM Scale Out Network Attached Storage Concepts [SG24-7874-00] ibm.com/redbooks/abstracts/sg247874.html?Open • IBM Storwize V7000 Introduction and Implementation Guide [SG247938] ibm.com/redbooks/redpieces/abstracts/sg247938.html?Open About the authors Dr. Tzy-Hwa K. (Kathy) Tzeng, is a Senior Technical Staff Member (STSM) for IBM Systems and Technology Group ISV Strategy and Enablement Organization. She received her Ph.D. in Genetics and Plant Pathology from Iowa State University. Prior to IBM, she led drug discovery projects in bioinformatics, proteomics, and genomics. At IBM, she is responsible for the strategy and content of IBM Life Sciences application plans, portfolio, and product positioning. You can reach Dr. Kathy Tzeng at tzy@us.ibm.com Dr. Ruzhu Chen is an IBM Certified Expert IT Specialist for IBM Systems and Technology Group, focusing on computational chemistry and NGS applications. Over the last ten years, he has successfully tuned, benchmarked, and optimized solutions for IBM worldwide partners, and customers. Ruzhu earned his Masters degree in Biochemistry from University of Sciences and Technology of China, a second Masers degree in Computer Science and a Ph.D. in Molecular Biology, both from the University of Oklahoma. You can reach Dr. Ruzhu Chen at ruzhuchen@us.ibm.com. Justin Morosi is a Consulting IT Specialist working for IBM Systems and Technology Group as a Worldwide Technical Architect focusing on HPC solutions. He has worked for IBM for over 14 years and has more than 20 years of consulting and solution design experience. He holds numerous industryrecognized certifications from Cisco, Microsoft®, VMware, Red Hat, and IBM. His areas of expertise include high-performance computing/storage, high availability, disaster recovery, and virtualization. You can reach Justin Morosi at jmorosi@us.ibm.com. Prashant Avashia is a software engineer in IBM Systems and Technology Group ISV Strategy and Enablement Organization. With more than 15 years of experience, he has successfully architected, engineered, and implemented enterprise infrastructure solutions for key global clients in healthcare, financial, and software industries. He earned his master's degree in Industrial Engineering from Kansas State University, and a bachelor's degree in Mechanical Engineering from Osmania University, India. You can reach Prashant Avashia at pavashia@us.ibm.com. Enabling next-generation sequencing applications 22
  • 25. Trademarks and special notices © Copyright IBM Corporation 2012. References in this document to IBM products or services do not imply that IBM intends to make them available in every country. IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at www.ibm.com/legal/copytrade.shtml. Java and all Java-based trademarks and logos are trademarks of registered trademarks of Oracle and/or its affiliates. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Intel, Intel Inside (logos), MMX, and Pentium are trademarks of Intel Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others. Information is provided "AS IS" without warranty of any kind. All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Information concerning non-IBM products was obtained from a supplier of these products, published announcement material, or other publicly available sources and does not constitute an endorsement of such products by IBM. Sources for non-IBM list prices and performance numbers are taken from publicly available information, including vendor announcements and vendor worldwide homepages. IBM has not tested these products and cannot confirm the accuracy of performance, capability, or any other claims related to non-IBM products. Questions on the capability of non-IBM products should be addressed to the supplier of those products. All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. Contact your local IBM office or IBM authorized reseller for the full text of the specific Statement of Direction. Some information addresses anticipated future capabilities. Such information is not intended as a definitive statement of a commitment to specific levels of performance, function or delivery schedules with respect to any future products. Such commitments are only made in IBM product announcements. The information is Enabling next-generation sequencing applications 23
  • 26. presented here to communicate IBM's current investment and development activities as a good faith effort to help with our customers' future planning. Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve throughput or performance improvements equivalent to the ratios stated here. Photographs shown are of engineering prototypes. Changes may be incorporated in production models. Any references in this information to non-IBM websites are provided for convenience only and do not in any manner serve as an endorsement of those websites. The materials at those websites are not part of the materials for this IBM product and use of those websites is at your own risk. Enabling next-generation sequencing applications 24