RENCI uses Big Data IT to advance life sciences research

White Paper
Abstract
This white paper explains how the Renaissance Computing
Institute (RENCI) of the University of North Carolina uses
EMC Isilon scale-out NAS storage, Intel processor and system
technology, and iRODS-based data management to tackle Big
Data processing, Hadoop-based analytics, security and privacy
challenges in research and clinical genomics.
July 2013
LIFE SCIENCES AT RENCI
Big Data IT to manage, decipher, and inform

Copyright © 2013 EMC Corporation. All Rights Reserved.
EMC believes the information in this publication is accurate as
of its publication date. The information is subject to change
without notice.
The information in this publication is provided “as is.” EMC
Corporation makes no representations or warranties of any kind
with respect to the information in this publication, and specifically
disclaims implied warranties of merchantability or fitness for a
particular purpose.
Use, copying, and distribution of any EMC software described in
this publication requires an applicable software license.
For the most up-to-date listing of EMC product names, see
EMC Corporation Trademarks on EMC.com.
EMC2
, EMC, the EMC logo, Isilon, and OneFS are registered
trademarks or trademarks of EMC Corporation in the United
States and other countries.
All other trademarks used herein are the property of their
respective owners.
Part Number H11692.1
2Life sciences at RENCI: Big Data IT to manage, decipher, and inform

Table of Contents
Life sciences at RENCI: Big Data IT to manage, decipher, and inform ............4
Tackling clinical and research genomics.........................................................5
Data analysis—Hadoop assists in variant calling ............................................7
Data management—iRODS proving its value ..................................................9
Data security—overcoming UNIX limitations with iRODS .............................10
Protected insight into Big Data: the Secure Medical Workspace...................11
Big Data’s persistent challenges ..................................................................12
What IGS will deliver.................................................................................... 13
EMC Isilon OneFS tames Big Data .................................................................. 13
Intel’s HPC leadership empowers life sciences ................................................. 14
For more information ...................................................................................15

Life sciences at RENCI:
Big Data IT to manage, decipher, and inform
Turning Big Data into insight in the lab and therapy in the clinic is perhaps the
preeminent challenge of modern life sciences. Not only must massive datasets be
managed and analyzed, but the insights gleaned must also be delivered to healthcare
professionals and patients in a way they can understand and use. Kirk Wilhelmsen,
M.D./Ph.D., Charles Schmitt, Ph.D., and their colleagues at the Renaissance
Computing Institute (RENCI) of the University of North Carolina (UNC) are at the
forefront of efforts to create the necessary IT infrastructure and tools to advance
this ambitious goal.
RENCI’s Health & Bioscience initiatives span basic research, advanced genomics,
translational medicine, and clinical decision support. Some are tightly focused, such
as two “knowledge-based medicine” programs that are developing decision support
tools to enhance the way physicians treat epilepsy and prostate cancer. Another,
Secure Medical Workspace (SMW), is creating a platform for providing controlled
access to confidential medical records stored in the Carolina Data Warehouse for
Health (CDW-H). A fourth initiative, Informatics for Genetic Sequencing (IGS) is
the epitome of the Big Data challenge; it’s working on developing the end-to-end IT
infrastructure necessary to support advanced DNA sequencing, genomics research,
and the delivery of genomics-informed healthcare.
Taken as a whole, it’s worth noting the distinct ”translational” bent to RENCI
bioscience efforts, which fits naturally into the Institute’s broad mission to develop
technologies that boost North Carolina competitiveness. “Initially we looked at
traditional bioinformatics and systems biology, but genomics was really starting
to make the transition to medicine and there was a big gap in translational
capabilities. It was a natural place to focus,” said Schmitt, RENCI director
of data sciences and informatics.
The IGS project is an instructive use case in coping with Big Data. On the order of
30 human genomes are sequenced weekly for RENCI’s projects. Just one genome,
depending upon the type of sequencing and the coverage, can generate 100 GB
of data to manage. Capturing, analyzing, storing, and presenting the accumulating
data requires a hybrid HPC (high-performance computing) infrastructure that blends
traditional cluster computing with emerging tools such as iRODS (Integrated Rule-
Oriented Data System) and Hadoop. Unsurprisingly, the HPC infrastructure is always
a work in progress, noted Schmitt.
RENCI/UNC computing resourcesi
are already significant. They include large internal
clusters, links to the Open Science Grid, more than 2 PB of spinning disk storage, and
roughly 3 PB of tape storage. The IGS pipeline/analysis uses a substantial piece of the
overall computing power—RENCI-based DELL blade-based Linux clusters with more
than 1,400 cores; UNC’s Dell and HP blade-based Linux clusters with nearly 1,000
nodes. Primary storage is handled by a 909 TB EMC®
Isilon®
system at UNC, a 1.7 PB
Lustre scratch space at RENCI, and PB-scale tape storage systems at UNC and RENCI.
Intel is another important contributor to RENCI’s computing power, supplying
processor, development, and systems technology used throughout the RENCI/UNC
HPC infrastructure. “Intel is doing much more than just processors in HPC. We bring
domain experts as well as hardware, platforms, software, and HPC leadership to life

sciences and healthcare,” says Ketan Paranjape, Global Director, Healthcare and
Life Sciences (see Intel’s HPC leadership empowers life sciences, page 14). The RENCI
Big Data infrastructure is shown in the Figure 1 below.
Figure 1. RENCI Genomics Big Data infrastructure
Significant investments have also been made in the wet lab. UNC acquired 12 next-
generation high-throughput sequencers (NGS) from Illumina, Pacific Biosciences, and
Life Technologies to support both the clinical-care mission of the UNC healthcare
system and to further basic genomic and biology research.
Tackling clinical and research genomics
“There are two primary projects we are working on now,” said Schmitt. One is NCGENES
(North Carolina Clinical Genomic Evaluation for NextGen Exome Sequencing). Its official
description is, “a multidisciplinary effort to create a bioinformatics infrastructure and a
systematic process for using whole-exome sequencing (WES) as a tool in diagnosing
disease, revealing genetic markers for disease, and helping people understand the
relationship between their genotype and diseases they have or are at risk of developing.”
Much of that infrastructure has been built and is in production use for NCGENES.
In whole-exome sequencing, only those regions of the genome coding for expressed
proteins—roughly 1.5 percent of the human genome—are sequenced. Patients of the
UNC health system are the subjects. The direct goal here is to identify known mutations
in those sequences that are associated with disease risk or health and provide that

information to clinicians and patients. More broadly, it’s also intended to explore ethical
and psychological issues of explaining risk—sometimes when there is no treatment—
to patients. NCGENES is a good example of efforts to deliver translational medicine.
The second project is for the National Institute of Drug Abuse (NIDA) and involves
whole-genome sequencing. Its purpose is to investigate the genetics of drug addiction.
It takes about 10–15 days to sequence a full genome and costs $5–$10K per genome—
roughly 10 times the cost to sequence a whole exome. In terms of sequencing coverageii
,
the NCGENES program is considered moderate at 50X whereas the NIDA project at
~10X is considered low coverage, but, of course, it is sequencing the entire 3-billion
base-pair human genome. The NIDA work seeks to discover low-frequency, novel
variants and relies heavily on statistical imputation.
“Over a period of a year we’ve sequenced ~1,000 whole genomes for NIDA and are
now processing another round of 2,500 whole genomes to be completed by end of
2013,” noted Schmitt. “NCGENES has about 250 people in process right now, and
we’ll probably do another 750 over the grant period.” The size of the data per sample
(person) varies considerably between the projects. A typical whole genome sequenced
NIDA sample averages 100 GB, versus 15 GB1
per sample for NCGENES’ exome
sequenced samples. Currently, RENCI has on the order of 400 TB of genomics data
stored on the EMC Isilon system and projects growing to 600 TB by the end of 2013.
Here’s a snapshot of the three-stage analysis pipeline RENCI has developed:
• DNA sequencing. DNA extracted from tissue samples is run through the high-
throughput NGS instruments. These modern sequencers generate hundreds of
millions of short DNA sequences for each patient, which must then be “assembled”
into proper order to determine the genome. Researchers use parallelized computational
workflows to assemble the genome and perform quality control on the reassembly—
fixing errors in the reassembly.
• Variant calling. DNA variations (SNPs, haplotypes, indels, etc.) for an individual
are detected, often using large patient populations to help resolve ambiguities in
the individual’s sequence data. Data is organized into a hybrid solution that uses
a relational database to store canonical variations, high-performance file systems
to hold data, and a Hadoop-based approach for specialized data-intensive analysis.
Links to public and private databases help researchers identify the impact of
variations including, for example, whether variants have known associations
with clinically relevant conditions.
• Clinical binning. The final step in the NCGENES project is the report to the
physicians. Key to this stage is a process termed “clinical binning,” which is
performed using custom UNC-developed software. It assigns a clinical relevancy
to each variant, shown in Figure 2, allowing clinicians and patients to determine
which variants they care about. Once variants are “binned,” a website delivers the
information to physicians and patients (via the Secure Medical Workspace). The
overall process, from blood-draw to analysis to reporting, including several stages
that provide independent validation of the identified variants, is managed through
a custom workflow solution developed by RENCI.
1
These are FASTQ, BAM, and VCF files with ancillary log and metric files.

Figure 2. Clinical Binning: assigning a clinical relevancy to each variant
Criteria
Loci with clinical
utility
Loci with clinical validity
Loci with
unknown
clinical
implications
Loci with
important
reproductive
implications
Genes
Bins Bin 1
Genes, which when
mutated, result in
high risk of clinically
actionable condition
Bin 2A
Low risk
incidental
information
Bin 2B
Medium risk
incidental
information
Bin 2C
High risk
incidental
information
Bin3
All other Loci
Bin R
Carrier status
for severe AR
disease
Examples BRCA1/2
MLH1, MSH2
FBN1
NF1
Loci with proven
PGx
clinical utility
PGx variants
and common
risk SNPs
with no
proven
clinical utility
APOE, genes
associated with
Mendelian
disease for
which clinical
recommendations
exist
Huntington’s
disease
Prion
diseases
SCA, PS1,
PS2, APP
Tay Sachs
Familial
Dysautonomia
CF, etc.
Estimated
number of
Genes/Loci
Dozen(s) ~20
(eventually
100s-1000s)
100s Dozen(s) >20,000 Hundreds
“Most of what we do is traditional HPC,” said Schmitt. “There’s the analytical pipeline
most people associate with genomics sequencing, which is stitching (assembly) the
genome back together up to the point of starting to call variations. This can be handled
by the type of HPC clusters we have in place. In terms of disk space, more is always
better and our usage will grow several hundred terabytes this year. At the same time,
our usage per sample has dropped as we focus what we store more precisely on the
needs of downstream analysis and leverage tape for archiving.”
Data analysis—Hadoop assists in variant calling
Calling variations is relatively straightforward for NCGENES because of the manageable
size of exome datasets, the ready availability of software analysis tools, and well-
characterized reference genomes. “For NCGENES, we call variants in a very traditional
way using the GATKiii
software package from Broad Institute,” said Schmitt. Variant
calling is done in batches of 50 or 100, something easily handled by HPC clusters.
“It takes a week or less, depending upon the batch size.”
For the NIDA project, identifying meaningful variation is far more challenging. The
much larger datasets, the lower coverage, the search for novel variants across the
entire genome, and the need to characterize variations against a pool of genomes—
not just against a single reference—all combine to make variant calling for NIDA a
memory-intensive, computationally demanding task.
“NIDA is actually investigating new approaches to calling variants and finding haplotypes.
It’s doing something called imputing genotypesiv
, and the calculations can take up to
a month,” said Schmitt. “Of course, you don’t have to run it very often. You can run a

batch once every six months, basically keeping up with the flow of data. We are
looking at how to speed that up because clearly that’s not a very scalable solution.”
Schmitt said RENCI stays abreast of most computationally difficult genomics problems.
For example, RENCI is interested in de-novo sequencing once there are approaches
that can compete with or augment reference-based alignments. He added, “Developing
techniques to detect rare variations, as well as combinations of variations, are of high
interest to our group and we are doing research in this area. We currently aren’t
doing trio sequencing.”
One increasingly popular approach to accelerating data-intensive computing is Hadoop.
Essentially, Hadoop uses a distributed file system and framework (MapReduce) to break
large datasets into chunks, to distribute/store (Map) those chunks to nodes in a cluster,
and to gather (Reduce) results following computation. Hadoop’s distinguishing feature
is that it automatically stores the chunks of data on the same nodes on which they
will be processed. This strategy of co-locating data and processing power (proximity
computing) significantly accelerates performance.
It also turns out that Hadoop architecture is a good choice for many life sciences
applications. This is largely because so much of life sciences data is semi- or
unstructured file-based data and ideally suited for “embarrassingly parallel”
computation. Moreover, the use of commodity hardware (e.g., Linux cluster)
keeps cost down, and little or no hardware modification is required.
“We’ve used a few Hadoop-specific applications. The main one is to process VCF files
(variant call format) when determining allele frequency on NIDA sequences. We
developed a set of tools called Hadoop VCF that lets us put a number of VCF files into
Hadoop and perform MapReduce jobs across VCF files,” said Schmitt. There are several
challenges in processing NIDA sequences, not the least of which is the size of the
databases against which NIDA sequences are compared—e.g., the 1000 Genomes,
plus other sources. “In one case we had 6,000 or so genomes,” said Schmitt. “Hadoop
was a convenient, existing technology to do those kinds of parallel calculations.”
Native support of HDFS (Hadoop Distributed File System) is provided by the EMC
Isilon system. HDFS is a lightweight protocol layer between the Isilon OneFS®
file
system and HDFS clients. “This makes it simple for organizations to utilize protocols
like NFS, REST, FTP, HTTP, etc., to ingest data for their Hadoop workflows,” says
Sanjay Johshi, CTO–Life Sciences, EMC Isilon Storage Division. “If the data is already
stored on the EMC Isilon scale-out NAS, then an organization simply points its Hadoop
compute farm at OneFS without having to perform a time- and resource-intensive load
operation of the Hadoop workflow (see EMC Isilon OneFS tames Big Data, page 13).
This is the type of innovation that EMC Isilon brings that RENCI hopes to adopt in
order to leverage its investment in Hadoop and high-performance storage systems.
Nevertheless, Hadoop is only part of the answer. “We’ve looked at a number of uses
for Hadoop. We tried some BAM processing, developing our own file formats for some
of the sequencing data, but haven’t found it to be more valuable than using traditional
tools,” said Schmitt. “We’ve been able to get by so far in batch mode processing, doing
embarrassingly parallel calculations, but we don’t see that scaling as we move into
tens of thousands of sequences. Past that, we’re pretty sure we are going to have
to switch to a more data-intensive paradigm.”

Schmitt cites two concerns with Hadoop: 1) RENCI is increasingly emphasizing
algorithms that are either graph-based or Markov Model-oriented and, according
to Schmitt, “Hadoop isn’t necessarily the best way to scale those algorithms.”
2) The other big issue is that Hadoop does not work well in a shared HPC cluster
environment. “This keeps us from using Hadoop more. We just can’t take over a
shared cluster periodically and allocate it for Hadoop,” said Schmitt.
Data management—iRODS proving its value
Data management for RENCI’s health and biosciences initiatives is fairly complicated.
“Briefly, what happens is the sequencing facility puts out data (on disk), and all that
gets tracked through a laboratory information management system (LIMS). We pick
it up at that point,” said Schmitt. “We run all of our analysis pipelines on a single
HPC cluster and the large EMC Isilon system. We keep all the intermediate and
analyzed data products on the Isilon system, and our pipelines register the data
products associated with each pipeline stage into the LIMS.” Intermediate processed
sequencing data—FASTQ files—are moved to tape as part of the sequencing process.
UNC runs a LIMS called BSPLims that handles the processing of blood samples. RENCI
has developed a related LIMS called libLims that handles its sequencing workflows for
NIDA—libLims interacts with BSPLims, but is customized for the more specialized
NIDA workflow.
All of the canonical variant data are stored in a large database—VarDBv
—that also holds
reference genomic data: “Most importantly, in this regard, is that it holds several versions
of the NCBI reference genome and manages translating genomic locations between the
different versions,” said Schmitt. VarDB also holds variants from public data sources,
such as dbSNP and The 1000 Genomes Project, variants from UNC sequencing efforts,
as well as variants from HGMD, the database of human gene mutation data. Finally, it
holds annotations on data from public databases, such as OMIM and RefSeq, as well as
annotations derived from tools like Polyphen. All together, VarDB currently stores the
data on the EMC Isilon system and this will steadily grow.
To help cope with its Big Data management challenge—storage, access, archiving, data
security, etc.—RENCI is making growing use of iRODS. In fact, RENCI is spearheading
an E-iRODS development effort in which Schmitt is the leader.
Broadly speaking, iRODS (the integrated Rule-Oriented Data System) is a data grid
technology that essentially puts a unified namespace on data files, regardless of where
those files are physically located. You may have files in four or five different storage
systems, but to the user it appears as one directory tree. iRODS also allows setting
enforcement rules on any access to the data or submission of data. For example, if
someone entered data into the system, that might trigger a rule to replicate the data
to another system and compress it at the same time. Access protection rules based on
metadata about a file can be set.
RENCI is already using iRODS with the analytical pipelines. “When our analytical
pipelines are processing the data, they also register that data into iRODS,” Schmitt
says. At the end of the pipeline, the data exists on disks and is registered into iRODS.
Anyone wanting to use the data must come in through iRODS to get the data; this
allows RENCI to set policies on access and data use.

“We originally did this as a way to let the clinical system access the raw research
data,” Schmitt continued. “Within the clinical system there is the ability for a clinician
looking at a patient to click on a button and download the BAM file, and we wanted a
way to separate that clinical system from where we store the BAM file.”
Here’s how it works. The clinical system takes the ID of the patient, sends it to
iRODS, which does a look-up and gives back the BAM file. At the same time, it
compresses it, and pulls up just the section of data on that BAM file that the clinician
actually wants. The Integrative Genomics Viewer (IGV) from the Broad Institute is
then launched to allow the clinician to view the sequence reads associated with the
variation of interest in context with the reference genome and other relevant data
(e.g., locations of exons and regulatory regions). In that way, data can be moved
elsewhere, maybe even to tape, and iRODS manages hiding all of that from the
clinical side.
The IGS team is now investigating the use of iRODS to automate replication of the
raw data produced at UNC to storage at RENCI. “It’s not really a backup, just a
redundant store. We’re looking into the process of selectively copying some of the
data to put it onto tape,” noted Schmitt.
To some extent, RENCI/UNC’s archival strategy is still evolving. “FASTQs are all
archived to tape and that’s put on a copy at UNC and a copy off-site. That’s our
primary safety net. We are also starting to copy the BAM files to tape at RENCI.
That’s a little less secure than the FASTQs, but sufficient in that we can regenerate
those in the case of disaster. Those are the two main ones that we archive. The
phenotypic and demographic data are all stored in databases and those are
independently backed up and archived,” Schmitt said.
Data security—overcoming UNIX limitations
with iRODS
Because RENCI works on multiple, shared systems in different data centers,
implementing security is complex. Basic security is provided through IT groups
at UNC and RENCI that provide aspects such as anti-virus, network filtering, single
sign-on, and system-level logging. “Standard user ID/password is used on the
research side of our work for access to resources such as file systems or databases.
The number of people with such access is very limited and governed through UNC’s
IRB,” explained Schmitt.
“On the clinical side there are more people accessing the data, so access is through
websites that users have to authenticate against and are secured in standard ways
(e.g., SSL, database server/Web servers running on VMs behind locked doors). iRODS
is used to automate standard procedures, including archiving, replication, and access
to raw data from users on the clinical side—this allows us to use iRODS logging and
sign-on for security. We are moving to project-level access control as we bring iRODS
further into our overall solution,” said Schmitt.
One problem is that UNIX directories can only go so far in managing the project
orientation of data. “That becomes a real headache,” said Schmitt. “With iRODS,
we can assign protection based on metadata for that file. That’s important because
we have many different graduate students, medical students, and rotating

bioinformaticians coming in; otherwise we would have to devote whole directory
trees to them.”
Indeed, the use of iRODS is a growing trend in life sciences, according to Joshi.
“Isilon customers are turning to iRODS for its rule-based data management capabilities
to complement the OneFS system administration features. By leveraging both OneFS
capabilities and iRODS, storage administrators not only can implement data policies
for disaster recovery, archive, and replication, but can also empower research teams
with capabilities to manage data throughout the study (project) lifecycle. With iRODS,
investigators can take advantage of tools that allow them to automate annotation of
data sets with project information, move data based on the project lifecycle, and find
the data based on study attributes when they need it.”
Protected insight into Big Data: the Secure
Medical Workspace
At the end of the day, the goal is to be able to deliver important genomics information
to both clinicians and researchers. To accomplish this part of its broad genomics
infrastructure mission, RENCI, in collaboration with UNC TraCS, the School of Information
and Library Science (SILS), and UNC Hospitals, has developed the Secure Medical
Workspace (SMW) system to enable the CDW-H to provide researchers and healthcare
professionals secure access to patient records.
The SMW shown in Figure 3 combines a secure centralized infrastructure with
virtualization and data leakage protection technologies to allow researchers to analyze
their research data, while ensuring sensitive patient information remains within the
SMW environment. “It’s a front-end to get to the data,” said Schmitt. “So for those
people who need direct access to sensitive data containing PHI, we’re using this secure
workspace as a way to give them access to data files.” Authorized researchers connect
to SMW from their local computing devices over a secure network connection to a
dedicated virtual workspace.
Figure 3. The Secure Medical Workspace

“It’s a virtualization solution where we can give a researcher a virtual server, and
once on that server the researcher can get access to data, either directly attached to
that server or remote somewhere else. But we include data leakage protection on the
server, which gives us protection and screens against any data being pulled outside of
the system,” explained Schmitt. “Yet, researchers can freely bring their own data and
tools onto the server.” There are commercial solutions that allow you to set policies
for who can take data out and what happens when someone tries to take data out.
“The way that we have favored doing this,” he continued, “is if someone tries to copy
data out, we allow it but throw up a warning screen saying you have to abide by your
data usage agreement. That agreement and the data removed from the server are
then stored for compliance audits.”
Big Data’s persistent challenges
Amid the substantial progress in developing an infrastructure to handle life sciences’
Big Data challenge, many thorny challenges persist, noted Schmitt. Consider that a
database of sequenced and variant data associated with 10,000 patients would have
roughly a petabyte of data. Working with such a massive data repository complicates
basically everything—storage, replication, ongoing analysis, traditional ETL database
functions, etc.
Collaboration, for example, remains problematic, with data transmission the biggest
issue. RENCI’s current collaboration with UC San Diego and the Scripps Institute,
explained Schmitt, “has been done by sending BAM files in batches. The first batch
took a month to send. Then, talking back and forth by phone about issues regarding
the data takes more time. It’s not a great process,” he says.
Schmitt continued: “We are looking at some of the advanced networking coming
out of NSF to get the bandwidth we want to move data. Of course that’s all kind
of experimental right now. We are exploring using some of the OpenStack and
Open Science Cloud offerings as a way to help collaborate.”
Large-scale computation on Big Data—particularly some of the so-called n-squarevi
problems—remains challenging. “We continue to explore Hadoop as one answer down
the road, but we are looking at other approaches, including data flow solutions and
systems for computing over large-scale graphs,” said Schmitt.
Archiving is another bottleneck. “Our goal for UNC and the UNC healthcare system
is to be able to manage storing a genome for every individual patient and using that
for research, but to get to that level cost-wise is going to be very difficult in terms of
data storage,” Schmitt continued. “We need a better idea of what data we can throw
away and when we can throw away data, and how to represent data at various levels
of hierarchy.”
Nevertheless, RENCI’s progress on all fronts has been substantial. UNC healthcare
professionals are able to look at patient genomic data for clinical care through the
NCGENES project—the last stage in RENCI’s analysis and data delivery pipeline. The
NIDA project is longer-term, and still in data and analysis collection mode, but many
of the kinks to collecting and processing the larger NIDA sample datasets have been

worked out. RENCI is poised to play a growing and important role developing
the HPC infrastructure and necessary analysis pipelines to support life sciences
and healthcare.
What IGS will deliver
In addition to handling the basic processing of next-generation DNA sequencer (NGS)
output, the RENCI-built Informatics for Genetic Sequencing (IGS) infrastructure continues
to be enhanced in order to support:
• Improved population-oriented queries: Given a variant, the system will find
the frequency of that variant and related haplotypes in a large population to help
determine whether the variant is potentially deleterious.
• Automated annotation: The system will extract data from multiple different
source databases, extract annotation, and incorporate it back into the variant
database for use by researchers, thus providing an increasingly diverse range
of annotation sources.
• Reference rationalization: Data in the system could be used to redefine
the “reference” genome, the template used to compare genomes from
different individuals.
• Improved variant analysis: Enhanced data processing will help researchers
identify additional information about genetic variation between individuals
beyond that which is possible with current technologies.
• Visualization: Data visualization will help enable new insights and inspire new
research questions.
• Metadata grid: The system will enable automated generation and propagation of
metadata to enhance analysis and data management and to guide computational
and data workflows.
EMC Isilon OneFS tames Big Data
EMC Isilon OneFS 7.0 is designed to address the convergence of Big Data and enterprise
IT, and extend the benefits of Isilon scale-out NAS architecture to a wider range of
enterprise storage needs.
OneFS combines the three layers of traditional storage architectures—the file system,
volume manager, and RAID—into one unified software layer, creating a single intelligent
distributed file system that runs on one storage cluster. The advantages of OneFS for
NGS are many:
• Scalable: Scale out as needs grow. Linear scale with increasing capacity: from
18 TB to 20 PB in a single file system and a single global namespace.
• Predictable: Dynamic content balancing is performed as nodes are added,
upgraded, or as capacity changes. No added management time is required,
because this process is simple.
• Available: OneFS is “self-healing.” It protects your data from power loss, node or
disk failures, and loss of quorum and storage rebuild by distributing data, metadata,
and parity across all nodes.

• Efficient: Compared to the average 50 percent efficiency of traditional RAID
systems, OneFS provides over 80 percent efficiency, independent of CPU compute
or cache. This efficiency is achieved by tiering the process into three types, as
shown in the figure alongside and by the pools within these node types.
• Enterprise-ready: Administration of the storage clusters is via an intuitive
Web-based UI. Connectivity to your process is through standard protocols: CIFS,
SMB, NFS, FTP/ HTTP, Object, and HDFS. Standardized authentication and access
control is available at scale: AD, LDAP, and NIS.
Isilon is the only scale-out NAS offering that provides enterprise capabilities
at scale to manage rapidly growing unstructured data assets more effectively.
Isilon OneFS provides data protection through snapshots across the whole cluster,
and is the only scale-out NAS solution compliant to SEC 17a-4 standards. Isilon
is the world's fastest NAS platform, delivering over 100 GB/s system throughput,
and remains the world-record holder for scale-out NAS performance with 1.6 million
SpecSFS2008 CIFS operations per second. With OneFS 7.0, Isilon storage systems
now provide dramatically improved caching capability to reduce average latency by
60 percent for I/O-intensive applications.
Intel’s HPC leadership empowers life sciences
Intel technology is used throughout HPC and is particularly prevalent in life sciences,
where Big Data challenges are now the norm. For example, Intel Xeon processors,
both the E5 and Phi lines, are accelerating parallel computing and bringing greater
accuracy to genomics analysis. Similarly, Intel software, such as the Intel Distribution
for Hadoop and Intel Manager for Hadoop, helps administrators simplify configuring
hardware and tuning Hadoop performance.
In all aspects of HPC, Intel technology and products are at the forefront. Nowhere is
this leadership more important than life sciences and throughout the RENCI/UNC HPC
infrastructure, where Intel products are widely embedded and helping researchers
and clinicians manage and interpret the genomics data deluge.
Here’s a brief overview of just a few Intel enabling technologies:
• Xeon/E5. The E5 processor, a solid foundation for HPC, delivers 80 percent
greater performance, 70 percent more energy-efficiency, and 30 percent less
network latency than earlier Xeon processors. Servers based on the E5 family
provide an optimum combination of performance, built-in capabilities, and cost-
effectiveness. From virtualization and cloud computing solutions to design
automation or real-time financial transactions, the E5 provides needed power.
• Xeon/Phi. Intel’s new line of Xeon Phi coprocessors is optimized for performance
and programmability for highly parallel workloads. The 5110P, first member of the
line, has 60 cores at 1.053GHz and handles 240 threads. Importantly, Intel Xeon
processors and Phi coprocessors support the same code, reducing the complexity
of development. The same techniques—such as scaling applications to many cores
and threads—can be used on both.
• Intel software. This extensive portfolio includes, for example, Intel Cluster
Studio XE, which features high performance, standards-driven compilers, libraries,
analysis tools, OpenMP, and MPI. Intel Distribution for Hadoop and Intel Manager
for Hadoop are important products for life sciences. Other offerings include Intel
Data Center Manager (DCM) and Intel Node Manager (NM) for resource/power
management, and Intel Expressway Service Gateway for cloud usage models.

• Intel fabric. HPC workloads today are too large to be managed by unspecialized
tools. Intel has several specifically designed for large and complex workloads.
Among them are Intel True Scale Fabric, designed from the ground up for HPC,
and QDR-40 and QDR-80, which deliver performance that scales. These tools are
optimized support for Xeon E5 and Xeon Phi processors.
• Intel storage. Intel storage technologies are used throughout industry at every
level (enterprise, SM business, home). Here are a few: Intel Xeon processors
and platforms enabled with beneficial storage optimizations; Solid-state drives
(SSDs) and other NVM technologies improve storage performance; Intel Cache
Acceleration Software (CAS); and Intel’s open source Lustre file-system
support/development and Chroma management/provisioning tools.
For more information
For more information about the exciting work done at the Renaissance Computing
Institute (RENCI), visit http://www.renci.org.
To learn more about how EMC products, services, and solutions help solve your life
sciences IT challenges, contact your local representative or authorized reseller—or
visit us at www.EMC.com/isilon.
To learn more about Intel technology, visit
http://www.intel.com/content/www/us/en/healthcare-it/big-data-in-healthcare.html.
To learn how e-IRODS can solve your enterprise data management needs, visit
http://e-irods.org/.
i
http://www.renci.org/resources/computing
ii
“…coverage is a measure of the number of times that a specific genomic site is sequenced during a
sequencing.” – JP Sulzberger Columbia Genome Center, http://genomecenter.columbia.edu/?q=node/77.
iii
The Genome Analysis Toolkit or GATK is a software package developed at the Broad Institute to analyze
next-generation resequencing data. The Toolkit offers a wide variety of tools, with a primary focus on
variant discovery and genotyping, as well as strong emphasis on data quality assurance.
http://www.broadinstitute.org/gatk/.
iv
University of Oxford backgrounder on imputing,
http://mathgen.stats.ox.ac.uk/impute/impute_v2.html#home
v
VarDB is a PostgreSQL relational database. “VarDB doesn’t directly integrate with RENCI usage of
Hadoop other than occasionally store results from certain Hadoop calculations, such as allele frequencies
from VCF files, in VarDB.” – Charles Schmitt.
vi
N-squared is shorthand for problems that are actually O(n^2) or similar to n-squared, such as
O(n^2.8). That would be different from true np hard problems.

RENCI uses Big Data IT to advance life sciences research

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to RENCI uses Big Data IT to advance life sciences research

Similar to RENCI uses Big Data IT to advance life sciences research (20)

More from EMC

More from EMC (20)

Recently uploaded

Recently uploaded (20)

RENCI uses Big Data IT to advance life sciences research