Embed presentation
Downloaded 106 times

















![Total RAM size: NGS analysis requires large file processing, including functions
related to string processing, clustering of large files, and statistical quality measures,
and thus easily becomes memory-bound. As a result, a large DDR3-based RAM pool
is optimal.
Network infrastructure parameters
TCP MTU: The default Maximum Transmission Unit (MTU) (or frame size) of current
Ethernet systems is 1500 B. However, higher bandwidth network infrastructures can
handle a much higher MTU of 9000 B (called ―jumbo frames‖) for efficient data transfer.
Please note that the jumbo frame setting needs to be completed both on the HPC server
node(s) and the switch(es).
Ethernet Bonding (LACP): Ethernet Bonding using the Link Aggregation Control
Protocol (LACP) is a method used to alleviate bandwidth limitations and port-cable-
port failure issues. By combining several Ethernet interfaces to a virtual ―bond‖ interface,
the network bandwidth can be increased since LACP splits the communications and
sends frames among all the Ethernet links. Bonding 2x 1 GbE interfaces provides the
required bandwidth between HPC server nodes and NAS file storage.
Isilon storage configuration parameters
NFS Master OS: By default, EMC Isilon OneFS operating system is the NFS server. It
is recommended that this default be maintained since SmartConnect and other OneFS
features may be affected if the HPC master node OS is chosen as the NFS server.
NFS V4: NFS V4 provides improved performance, security, and robustness vis-à-vis
NFS V3. These include support of multiple operations per RPC operation (vs. a single
operation per RPC in NFS V3), use of Kerberos and access control lists (ACLs) for
security (vs. UNIX file permissions in NFS V3), use of TCP transport (vs. UDP in NFS
V3), and integrated file locking (vs. use of the adjunct Network Lock Manager protocol
for NFS V3). As a result, it is recommended that sites utilize NFS V4 for the NGS
environments. Please note that initial setting up of NFSv4 can be cumbersome.
NFS async: The NFS async (asynchronous) mode allows the server to reply to client
requests as soon as it has processed the request and handed it off to the local file
system, without waiting for the data to be written to stable storage. However, write
performance is better when synchronous mode is used (also called ‘noasync’), especially
for smaller file sizes. This is the recommended mode, especially since NFSv4 uses
TCP connectivity.
NFS number of threads: This is the number of NFS server daemon threads that are
started when the system boots. T he OneFS NFS server usually has 16 threads as its
default setting; this value can be changed via the Command Line Interface (CLI):
isi_sysctl_cluster sysctl vfs.nfsrv.rpc.[minthreads,maxthreads]
Increasing the number of NFS daemon threads improves response minimally; the
maximum number of NFS threads need to be limited to 64.
NFS ACL: The NFS ACL (Access Control List) for NFSv4 is a list of permissions associated
with a set of files or directories which contain one or more Access Control Entries (ACEs).
There are four types of ACEs: Allow, Deny, Audit, and Alarm; with three kinds of
flags: group, inheritance, and administrative. There are 13 file permissions and
Next-Generation Genome Sequencing Using EMC Isilon Scale-out NAS 18](https://image.slidesharecdn.com/h10961-wp-ngs-sizing-perf-120912232830-phpapp01/85/White-Paper-Next-Generation-Genome-Sequencing-Using-EMC-Isilon-Scale-Out-NAS-Sizing-and-Performance-Guidelines-18-320.jpg)

![Conclusion
NGS production processes generate potentially millions of files with terabytes of
aggregate storage impacting the capacity and manageability limits of existing file
server structures. Raw instrument data typically consists of large image files (2-5 TB
per run are the norm), usually in TIFF format. The image file for the experiment is
usually the largest file size in NGS.
Genomics is a data reduction process from the raw instrument information (images or
voltages) to the variants which follows the ―Rule of One-Fifth.‖ Intermediate or
secondary data consists of raw data files including files in BCL format for base calling and
conversion have an aggregate ratio of approximately one-fifth compared to raw
instrument data.
Internal EMC testing has determined that the KPIs that affect the performance of
NGS applications the most are: total RAM size on HPC cluster nodes (recommended
at 3 Gb/core, RAM and SSD on the Isilon storage cluster [typically 1 percent of RAM
storage]), and storage configuration parameters with NFS version V4, NFS async
enabled, TCP MTU (jumbo frames), LACP (2x 1 Gb/s or 4x 1 Gb/s) and a Grid
Engine package.
NGS environments require a file storage infrastructure that is purpose-built to address
the capacity and performance scalability, efficiency, availability, and manageability
challenges of next-generation NGS applications. Cumulative network bandwidth between
HPC and NAS increases with the total number of Isilon nodes on the storage cluster.
Isilon scale-out NAS presents a range of benefits optimal for NGS. The Isilon approach
of enabling storage I/O and capacity growth through addition of cluster nodes is optimal
since NGS requires storage performance and capacity scalability to be implemented
as seamlessly as possible. In addition, dynamic content balancing performed within
Isilon scale-out NAS as nodes are added or data capac ity changes is ideal for an NGS
workflow due to its sustained throughput requirement.
Isilon scale-out NAS also offers an 80 percent efficiency ratio and ―smart pooling‖ of
the data across multiple performance tiers, making dynamic, rule-based data transfer
between storage pools an integral piece of the NGS process. Flexible, multi-dimensional
data protection which occurs within Isilon scale-out NAS during power loss, node or disk
failures, loss of quorum, and storage rebuild enables non-stop data availability for NGS.
Next-Generation Genome Sequencing Using EMC Isilon Scale-out NAS 20](https://image.slidesharecdn.com/h10961-wp-ngs-sizing-perf-120912232830-phpapp01/85/White-Paper-Next-Generation-Genome-Sequencing-Using-EMC-Isilon-Scale-Out-NAS-Sizing-and-Performance-Guidelines-20-320.jpg)
This white paper outlines sizing and performance guidelines for next-generation sequencing (NGS) workflows using EMC Isilon scale-out NAS. It highlights the critical role of data storage capacity and I/O performance in handling large volumes of NGS data produced by sequencers and computational analysis methods. Key performance indicators affecting NGS performance include RAM allocation, storage configuration parameters, and network infrastructure, emphasizing the need for a balanced system to avoid bottlenecks.

















![Total RAM size: NGS analysis requires large file processing, including functions
related to string processing, clustering of large files, and statistical quality measures,
and thus easily becomes memory-bound. As a result, a large DDR3-based RAM pool
is optimal.
Network infrastructure parameters
TCP MTU: The default Maximum Transmission Unit (MTU) (or frame size) of current
Ethernet systems is 1500 B. However, higher bandwidth network infrastructures can
handle a much higher MTU of 9000 B (called ―jumbo frames‖) for efficient data transfer.
Please note that the jumbo frame setting needs to be completed both on the HPC server
node(s) and the switch(es).
Ethernet Bonding (LACP): Ethernet Bonding using the Link Aggregation Control
Protocol (LACP) is a method used to alleviate bandwidth limitations and port-cable-
port failure issues. By combining several Ethernet interfaces to a virtual ―bond‖ interface,
the network bandwidth can be increased since LACP splits the communications and
sends frames among all the Ethernet links. Bonding 2x 1 GbE interfaces provides the
required bandwidth between HPC server nodes and NAS file storage.
Isilon storage configuration parameters
NFS Master OS: By default, EMC Isilon OneFS operating system is the NFS server. It
is recommended that this default be maintained since SmartConnect and other OneFS
features may be affected if the HPC master node OS is chosen as the NFS server.
NFS V4: NFS V4 provides improved performance, security, and robustness vis-à-vis
NFS V3. These include support of multiple operations per RPC operation (vs. a single
operation per RPC in NFS V3), use of Kerberos and access control lists (ACLs) for
security (vs. UNIX file permissions in NFS V3), use of TCP transport (vs. UDP in NFS
V3), and integrated file locking (vs. use of the adjunct Network Lock Manager protocol
for NFS V3). As a result, it is recommended that sites utilize NFS V4 for the NGS
environments. Please note that initial setting up of NFSv4 can be cumbersome.
NFS async: The NFS async (asynchronous) mode allows the server to reply to client
requests as soon as it has processed the request and handed it off to the local file
system, without waiting for the data to be written to stable storage. However, write
performance is better when synchronous mode is used (also called ‘noasync’), especially
for smaller file sizes. This is the recommended mode, especially since NFSv4 uses
TCP connectivity.
NFS number of threads: This is the number of NFS server daemon threads that are
started when the system boots. T he OneFS NFS server usually has 16 threads as its
default setting; this value can be changed via the Command Line Interface (CLI):
isi_sysctl_cluster sysctl vfs.nfsrv.rpc.[minthreads,maxthreads]
Increasing the number of NFS daemon threads improves response minimally; the
maximum number of NFS threads need to be limited to 64.
NFS ACL: The NFS ACL (Access Control List) for NFSv4 is a list of permissions associated
with a set of files or directories which contain one or more Access Control Entries (ACEs).
There are four types of ACEs: Allow, Deny, Audit, and Alarm; with three kinds of
flags: group, inheritance, and administrative. There are 13 file permissions and
Next-Generation Genome Sequencing Using EMC Isilon Scale-out NAS 18](https://image.slidesharecdn.com/h10961-wp-ngs-sizing-perf-120912232830-phpapp01/85/White-Paper-Next-Generation-Genome-Sequencing-Using-EMC-Isilon-Scale-Out-NAS-Sizing-and-Performance-Guidelines-18-320.jpg)

![Conclusion
NGS production processes generate potentially millions of files with terabytes of
aggregate storage impacting the capacity and manageability limits of existing file
server structures. Raw instrument data typically consists of large image files (2-5 TB
per run are the norm), usually in TIFF format. The image file for the experiment is
usually the largest file size in NGS.
Genomics is a data reduction process from the raw instrument information (images or
voltages) to the variants which follows the ―Rule of One-Fifth.‖ Intermediate or
secondary data consists of raw data files including files in BCL format for base calling and
conversion have an aggregate ratio of approximately one-fifth compared to raw
instrument data.
Internal EMC testing has determined that the KPIs that affect the performance of
NGS applications the most are: total RAM size on HPC cluster nodes (recommended
at 3 Gb/core, RAM and SSD on the Isilon storage cluster [typically 1 percent of RAM
storage]), and storage configuration parameters with NFS version V4, NFS async
enabled, TCP MTU (jumbo frames), LACP (2x 1 Gb/s or 4x 1 Gb/s) and a Grid
Engine package.
NGS environments require a file storage infrastructure that is purpose-built to address
the capacity and performance scalability, efficiency, availability, and manageability
challenges of next-generation NGS applications. Cumulative network bandwidth between
HPC and NAS increases with the total number of Isilon nodes on the storage cluster.
Isilon scale-out NAS presents a range of benefits optimal for NGS. The Isilon approach
of enabling storage I/O and capacity growth through addition of cluster nodes is optimal
since NGS requires storage performance and capacity scalability to be implemented
as seamlessly as possible. In addition, dynamic content balancing performed within
Isilon scale-out NAS as nodes are added or data capac ity changes is ideal for an NGS
workflow due to its sustained throughput requirement.
Isilon scale-out NAS also offers an 80 percent efficiency ratio and ―smart pooling‖ of
the data across multiple performance tiers, making dynamic, rule-based data transfer
between storage pools an integral piece of the NGS process. Flexible, multi-dimensional
data protection which occurs within Isilon scale-out NAS during power loss, node or disk
failures, loss of quorum, and storage rebuild enables non-stop data availability for NGS.
Next-Generation Genome Sequencing Using EMC Isilon Scale-out NAS 20](https://image.slidesharecdn.com/h10961-wp-ngs-sizing-perf-120912232830-phpapp01/85/White-Paper-Next-Generation-Genome-Sequencing-Using-EMC-Isilon-Scale-Out-NAS-Sizing-and-Performance-Guidelines-20-320.jpg)