White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Inform


Published on

This white paper explains how the Renaissance Computing Institute (RENCI) of the University of North Carolina uses EMC Isilon scale-out NAS storage, Intel processor and system technology, and iRODS-based data management to tackle Big Data processing, Hadoop-based analytics, security and privacy challenges in research and clinical genomics.

Published in: Technology, Health & Medicine
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Inform

  1. 1. White Paper Abstract This white paper explains how the Renaissance Computing Institute (RENCI) of the University of North Carolina uses EMC Isilon scale-out NAS storage, Intel processor and system technology, and iRODS-based data management to tackle Big Data processing, Hadoop-based analytics, security and privacy challenges in research and clinical genomics. July 2013 LIFE SCIENCES AT RENCI Big Data IT to manage, decipher, and inform
  2. 2. Copyright © 2013 EMC Corporation. All Rights Reserved. EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice. The information in this publication is provided “as is.” EMC Corporation makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose. Use, copying, and distribution of any EMC software described in this publication requires an applicable software license. For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com. EMC2 , EMC, the EMC logo, Isilon, and OneFS are registered trademarks or trademarks of EMC Corporation in the United States and other countries. All other trademarks used herein are the property of their respective owners. Part Number H11692.1 2Life sciences at RENCI: Big Data IT to manage, decipher, and inform
  3. 3. Table of Contents Life sciences at RENCI: Big Data IT to manage, decipher, and inform ............4 Tackling clinical and research genomics.........................................................5 Data analysis—Hadoop assists in variant calling ............................................7 Data management—iRODS proving its value ..................................................9 Data security—overcoming UNIX limitations with iRODS .............................10 Protected insight into Big Data: the Secure Medical Workspace...................11 Big Data’s persistent challenges ..................................................................12 What IGS will deliver.................................................................................... 13 EMC Isilon OneFS tames Big Data .................................................................. 13 Intel’s HPC leadership empowers life sciences ................................................. 14 For more information ...................................................................................15 3Life sciences at RENCI: Big Data IT to manage, decipher, and inform
  4. 4. Life sciences at RENCI: Big Data IT to manage, decipher, and inform Turning Big Data into insight in the lab and therapy in the clinic is perhaps the preeminent challenge of modern life sciences. Not only must massive datasets be managed and analyzed, but the insights gleaned must also be delivered to healthcare professionals and patients in a way they can understand and use. Kirk Wilhelmsen, M.D./Ph.D., Charles Schmitt, Ph.D., and their colleagues at the Renaissance Computing Institute (RENCI) of the University of North Carolina (UNC) are at the forefront of efforts to create the necessary IT infrastructure and tools to advance this ambitious goal. RENCI’s Health & Bioscience initiatives span basic research, advanced genomics, translational medicine, and clinical decision support. Some are tightly focused, such as two “knowledge-based medicine” programs that are developing decision support tools to enhance the way physicians treat epilepsy and prostate cancer. Another, Secure Medical Workspace (SMW), is creating a platform for providing controlled access to confidential medical records stored in the Carolina Data Warehouse for Health (CDW-H). A fourth initiative, Informatics for Genetic Sequencing (IGS) is the epitome of the Big Data challenge; it’s working on developing the end-to-end IT infrastructure necessary to support advanced DNA sequencing, genomics research, and the delivery of genomics-informed healthcare. Taken as a whole, it’s worth noting the distinct ”translational” bent to RENCI bioscience efforts, which fits naturally into the Institute’s broad mission to develop technologies that boost North Carolina competitiveness. “Initially we looked at traditional bioinformatics and systems biology, but genomics was really starting to make the transition to medicine and there was a big gap in translational capabilities. It was a natural place to focus,” said Schmitt, RENCI director of data sciences and informatics. The IGS project is an instructive use case in coping with Big Data. On the order of 30 human genomes are sequenced weekly for RENCI’s projects. Just one genome, depending upon the type of sequencing and the coverage, can generate 100 GB of data to manage. Capturing, analyzing, storing, and presenting the accumulating data requires a hybrid HPC (high-performance computing) infrastructure that blends traditional cluster computing with emerging tools such as iRODS (Integrated Rule- Oriented Data System) and Hadoop. Unsurprisingly, the HPC infrastructure is always a work in progress, noted Schmitt. RENCI/UNC computing resourcesi are already significant. They include large internal clusters, links to the Open Science Grid, more than 2 PB of spinning disk storage, and roughly 3 PB of tape storage. The IGS pipeline/analysis uses a substantial piece of the overall computing power—RENCI-based DELL blade-based Linux clusters with more than 1,400 cores; UNC’s Dell and HP blade-based Linux clusters with nearly 1,000 nodes. Primary storage is handled by a 909 TB EMC® Isilon® system at UNC, a 1.7 PB Lustre scratch space at RENCI, and PB-scale tape storage systems at UNC and RENCI. Intel is another important contributor to RENCI’s computing power, supplying processor, development, and systems technology used throughout the RENCI/UNC HPC infrastructure. “Intel is doing much more than just processors in HPC. We bring domain experts as well as hardware, platforms, software, and HPC leadership to life 4Life sciences at RENCI: Big Data IT to manage, decipher, and inform
  5. 5. sciences and healthcare,” says Ketan Paranjape, Global Director, Healthcare and Life Sciences (see Intel’s HPC leadership empowers life sciences, page 14). The RENCI Big Data infrastructure is shown in the Figure 1 below. Figure 1. RENCI Genomics Big Data infrastructure Significant investments have also been made in the wet lab. UNC acquired 12 next- generation high-throughput sequencers (NGS) from Illumina, Pacific Biosciences, and Life Technologies to support both the clinical-care mission of the UNC healthcare system and to further basic genomic and biology research. Tackling clinical and research genomics “There are two primary projects we are working on now,” said Schmitt. One is NCGENES (North Carolina Clinical Genomic Evaluation for NextGen Exome Sequencing). Its official description is, “a multidisciplinary effort to create a bioinformatics infrastructure and a systematic process for using whole-exome sequencing (WES) as a tool in diagnosing disease, revealing genetic markers for disease, and helping people understand the relationship between their genotype and diseases they have or are at risk of developing.” Much of that infrastructure has been built and is in production use for NCGENES. In whole-exome sequencing, only those regions of the genome coding for expressed proteins—roughly 1.5 percent of the human genome—are sequenced. Patients of the UNC health system are the subjects. The direct goal here is to identify known mutations in those sequences that are associated with disease risk or health and provide that 5Life sciences at RENCI: Big Data IT to manage, decipher, and inform
  6. 6. information to clinicians and patients. More broadly, it’s also intended to explore ethical and psychological issues of explaining risk—sometimes when there is no treatment— to patients. NCGENES is a good example of efforts to deliver translational medicine. The second project is for the National Institute of Drug Abuse (NIDA) and involves whole-genome sequencing. Its purpose is to investigate the genetics of drug addiction. It takes about 10–15 days to sequence a full genome and costs $5–$10K per genome— roughly 10 times the cost to sequence a whole exome. In terms of sequencing coverageii , the NCGENES program is considered moderate at 50X whereas the NIDA project at ~10X is considered low coverage, but, of course, it is sequencing the entire 3-billion base-pair human genome. The NIDA work seeks to discover low-frequency, novel variants and relies heavily on statistical imputation. “Over a period of a year we’ve sequenced ~1,000 whole genomes for NIDA and are now processing another round of 2,500 whole genomes to be completed by end of 2013,” noted Schmitt. “NCGENES has about 250 people in process right now, and we’ll probably do another 750 over the grant period.” The size of the data per sample (person) varies considerably between the projects. A typical whole genome sequenced NIDA sample averages 100 GB, versus 15 GB1 per sample for NCGENES’ exome sequenced samples. Currently, RENCI has on the order of 400 TB of genomics data stored on the EMC Isilon system and projects growing to 600 TB by the end of 2013. Here’s a snapshot of the three-stage analysis pipeline RENCI has developed: • DNA sequencing. DNA extracted from tissue samples is run through the high- throughput NGS instruments. These modern sequencers generate hundreds of millions of short DNA sequences for each patient, which must then be “assembled” into proper order to determine the genome. Researchers use parallelized computational workflows to assemble the genome and perform quality control on the reassembly— fixing errors in the reassembly. • Variant calling. DNA variations (SNPs, haplotypes, indels, etc.) for an individual are detected, often using large patient populations to help resolve ambiguities in the individual’s sequence data. Data is organized into a hybrid solution that uses a relational database to store canonical variations, high-performance file systems to hold data, and a Hadoop-based approach for specialized data-intensive analysis. Links to public and private databases help researchers identify the impact of variations including, for example, whether variants have known associations with clinically relevant conditions. • Clinical binning. The final step in the NCGENES project is the report to the physicians. Key to this stage is a process termed “clinical binning,” which is performed using custom UNC-developed software. It assigns a clinical relevancy to each variant, shown in Figure 2, allowing clinicians and patients to determine which variants they care about. Once variants are “binned,” a website delivers the information to physicians and patients (via the Secure Medical Workspace). The overall process, from blood-draw to analysis to reporting, including several stages that provide independent validation of the identified variants, is managed through a custom workflow solution developed by RENCI. 1 These are FASTQ, BAM, and VCF files with ancillary log and metric files. 6Life sciences at RENCI: Big Data IT to manage, decipher, and inform
  7. 7. Figure 2. Clinical Binning: assigning a clinical relevancy to each variant Criteria Loci with clinical utility Loci with clinical validity Loci with unknown clinical implications Loci with important reproductive implications Genes Bins Bin 1 Genes, which when mutated, result in high risk of clinically actionable condition Bin 2A Low risk incidental information Bin 2B Medium risk incidental information Bin 2C High risk incidental information Bin3 All other Loci Bin R Carrier status for severe AR disease Examples BRCA1/2 MLH1, MSH2 FBN1 NF1 Loci with proven PGx clinical utility PGx variants and common risk SNPs with no proven clinical utility APOE, genes associated with Mendelian disease for which clinical recommendations exist Huntington’s disease Prion diseases SCA, PS1, PS2, APP Tay Sachs Familial Dysautonomia CF, etc. Estimated number of Genes/Loci Dozen(s) ~20 (eventually 100s-1000s) 100s Dozen(s) >20,000 Hundreds “Most of what we do is traditional HPC,” said Schmitt. “There’s the analytical pipeline most people associate with genomics sequencing, which is stitching (assembly) the genome back together up to the point of starting to call variations. This can be handled by the type of HPC clusters we have in place. In terms of disk space, more is always better and our usage will grow several hundred terabytes this year. At the same time, our usage per sample has dropped as we focus what we store more precisely on the needs of downstream analysis and leverage tape for archiving.” Data analysis—Hadoop assists in variant calling Calling variations is relatively straightforward for NCGENES because of the manageable size of exome datasets, the ready availability of software analysis tools, and well- characterized reference genomes. “For NCGENES, we call variants in a very traditional way using the GATKiii software package from Broad Institute,” said Schmitt. Variant calling is done in batches of 50 or 100, something easily handled by HPC clusters. “It takes a week or less, depending upon the batch size.” For the NIDA project, identifying meaningful variation is far more challenging. The much larger datasets, the lower coverage, the search for novel variants across the entire genome, and the need to characterize variations against a pool of genomes— not just against a single reference—all combine to make variant calling for NIDA a memory-intensive, computationally demanding task. “NIDA is actually investigating new approaches to calling variants and finding haplotypes. It’s doing something called imputing genotypesiv , and the calculations can take up to a month,” said Schmitt. “Of course, you don’t have to run it very often. You can run a 7Life sciences at RENCI: Big Data IT to manage, decipher, and inform
  8. 8. batch once every six months, basically keeping up with the flow of data. We are looking at how to speed that up because clearly that’s not a very scalable solution.” Schmitt said RENCI stays abreast of most computationally difficult genomics problems. For example, RENCI is interested in de-novo sequencing once there are approaches that can compete with or augment reference-based alignments. He added, “Developing techniques to detect rare variations, as well as combinations of variations, are of high interest to our group and we are doing research in this area. We currently aren’t doing trio sequencing.” One increasingly popular approach to accelerating data-intensive computing is Hadoop. Essentially, Hadoop uses a distributed file system and framework (MapReduce) to break large datasets into chunks, to distribute/store (Map) those chunks to nodes in a cluster, and to gather (Reduce) results following computation. Hadoop’s distinguishing feature is that it automatically stores the chunks of data on the same nodes on which they will be processed. This strategy of co-locating data and processing power (proximity computing) significantly accelerates performance. It also turns out that Hadoop architecture is a good choice for many life sciences applications. This is largely because so much of life sciences data is semi- or unstructured file-based data and ideally suited for “embarrassingly parallel” computation. Moreover, the use of commodity hardware (e.g., Linux cluster) keeps cost down, and little or no hardware modification is required. “We’ve used a few Hadoop-specific applications. The main one is to process VCF files (variant call format) when determining allele frequency on NIDA sequences. We developed a set of tools called Hadoop VCF that lets us put a number of VCF files into Hadoop and perform MapReduce jobs across VCF files,” said Schmitt. There are several challenges in processing NIDA sequences, not the least of which is the size of the databases against which NIDA sequences are compared—e.g., the 1000 Genomes, plus other sources. “In one case we had 6,000 or so genomes,” said Schmitt. “Hadoop was a convenient, existing technology to do those kinds of parallel calculations.” Native support of HDFS (Hadoop Distributed File System) is provided by the EMC Isilon system. HDFS is a lightweight protocol layer between the Isilon OneFS® file system and HDFS clients. “This makes it simple for organizations to utilize protocols like NFS, REST, FTP, HTTP, etc., to ingest data for their Hadoop workflows,” says Sanjay Johshi, CTO–Life Sciences, EMC Isilon Storage Division. “If the data is already stored on the EMC Isilon scale-out NAS, then an organization simply points its Hadoop compute farm at OneFS without having to perform a time- and resource-intensive load operation of the Hadoop workflow (see EMC Isilon OneFS tames Big Data, page 13). This is the type of innovation that EMC Isilon brings that RENCI hopes to adopt in order to leverage its investment in Hadoop and high-performance storage systems. Nevertheless, Hadoop is only part of the answer. “We’ve looked at a number of uses for Hadoop. We tried some BAM processing, developing our own file formats for some of the sequencing data, but haven’t found it to be more valuable than using traditional tools,” said Schmitt. “We’ve been able to get by so far in batch mode processing, doing embarrassingly parallel calculations, but we don’t see that scaling as we move into tens of thousands of sequences. Past that, we’re pretty sure we are going to have to switch to a more data-intensive paradigm.” 8Life sciences at RENCI: Big Data IT to manage, decipher, and inform
  9. 9. Schmitt cites two concerns with Hadoop: 1) RENCI is increasingly emphasizing algorithms that are either graph-based or Markov Model-oriented and, according to Schmitt, “Hadoop isn’t necessarily the best way to scale those algorithms.” 2) The other big issue is that Hadoop does not work well in a shared HPC cluster environment. “This keeps us from using Hadoop more. We just can’t take over a shared cluster periodically and allocate it for Hadoop,” said Schmitt. Data management—iRODS proving its value Data management for RENCI’s health and biosciences initiatives is fairly complicated. “Briefly, what happens is the sequencing facility puts out data (on disk), and all that gets tracked through a laboratory information management system (LIMS). We pick it up at that point,” said Schmitt. “We run all of our analysis pipelines on a single HPC cluster and the large EMC Isilon system. We keep all the intermediate and analyzed data products on the Isilon system, and our pipelines register the data products associated with each pipeline stage into the LIMS.” Intermediate processed sequencing data—FASTQ files—are moved to tape as part of the sequencing process. UNC runs a LIMS called BSPLims that handles the processing of blood samples. RENCI has developed a related LIMS called libLims that handles its sequencing workflows for NIDA—libLims interacts with BSPLims, but is customized for the more specialized NIDA workflow. All of the canonical variant data are stored in a large database—VarDBv —that also holds reference genomic data: “Most importantly, in this regard, is that it holds several versions of the NCBI reference genome and manages translating genomic locations between the different versions,” said Schmitt. VarDB also holds variants from public data sources, such as dbSNP and The 1000 Genomes Project, variants from UNC sequencing efforts, as well as variants from HGMD, the database of human gene mutation data. Finally, it holds annotations on data from public databases, such as OMIM and RefSeq, as well as annotations derived from tools like Polyphen. All together, VarDB currently stores the data on the EMC Isilon system and this will steadily grow. To help cope with its Big Data management challenge—storage, access, archiving, data security, etc.—RENCI is making growing use of iRODS. In fact, RENCI is spearheading an E-iRODS development effort in which Schmitt is the leader. Broadly speaking, iRODS (the integrated Rule-Oriented Data System) is a data grid technology that essentially puts a unified namespace on data files, regardless of where those files are physically located. You may have files in four or five different storage systems, but to the user it appears as one directory tree. iRODS also allows setting enforcement rules on any access to the data or submission of data. For example, if someone entered data into the system, that might trigger a rule to replicate the data to another system and compress it at the same time. Access protection rules based on metadata about a file can be set. RENCI is already using iRODS with the analytical pipelines. “When our analytical pipelines are processing the data, they also register that data into iRODS,” Schmitt says. At the end of the pipeline, the data exists on disks and is registered into iRODS. Anyone wanting to use the data must come in through iRODS to get the data; this allows RENCI to set policies on access and data use. 9Life sciences at RENCI: Big Data IT to manage, decipher, and inform
  10. 10. “We originally did this as a way to let the clinical system access the raw research data,” Schmitt continued. “Within the clinical system there is the ability for a clinician looking at a patient to click on a button and download the BAM file, and we wanted a way to separate that clinical system from where we store the BAM file.” Here’s how it works. The clinical system takes the ID of the patient, sends it to iRODS, which does a look-up and gives back the BAM file. At the same time, it compresses it, and pulls up just the section of data on that BAM file that the clinician actually wants. The Integrative Genomics Viewer (IGV) from the Broad Institute is then launched to allow the clinician to view the sequence reads associated with the variation of interest in context with the reference genome and other relevant data (e.g., locations of exons and regulatory regions). In that way, data can be moved elsewhere, maybe even to tape, and iRODS manages hiding all of that from the clinical side. The IGS team is now investigating the use of iRODS to automate replication of the raw data produced at UNC to storage at RENCI. “It’s not really a backup, just a redundant store. We’re looking into the process of selectively copying some of the data to put it onto tape,” noted Schmitt. To some extent, RENCI/UNC’s archival strategy is still evolving. “FASTQs are all archived to tape and that’s put on a copy at UNC and a copy off-site. That’s our primary safety net. We are also starting to copy the BAM files to tape at RENCI. That’s a little less secure than the FASTQs, but sufficient in that we can regenerate those in the case of disaster. Those are the two main ones that we archive. The phenotypic and demographic data are all stored in databases and those are independently backed up and archived,” Schmitt said. Data security—overcoming UNIX limitations with iRODS Because RENCI works on multiple, shared systems in different data centers, implementing security is complex. Basic security is provided through IT groups at UNC and RENCI that provide aspects such as anti-virus, network filtering, single sign-on, and system-level logging. “Standard user ID/password is used on the research side of our work for access to resources such as file systems or databases. The number of people with such access is very limited and governed through UNC’s IRB,” explained Schmitt. “On the clinical side there are more people accessing the data, so access is through websites that users have to authenticate against and are secured in standard ways (e.g., SSL, database server/Web servers running on VMs behind locked doors). iRODS is used to automate standard procedures, including archiving, replication, and access to raw data from users on the clinical side—this allows us to use iRODS logging and sign-on for security. We are moving to project-level access control as we bring iRODS further into our overall solution,” said Schmitt. One problem is that UNIX directories can only go so far in managing the project orientation of data. “That becomes a real headache,” said Schmitt. “With iRODS, we can assign protection based on metadata for that file. That’s important because we have many different graduate students, medical students, and rotating 10Life sciences at RENCI: Big Data IT to manage, decipher, and inform
  11. 11. bioinformaticians coming in; otherwise we would have to devote whole directory trees to them.” Indeed, the use of iRODS is a growing trend in life sciences, according to Joshi. “Isilon customers are turning to iRODS for its rule-based data management capabilities to complement the OneFS system administration features. By leveraging both OneFS capabilities and iRODS, storage administrators not only can implement data policies for disaster recovery, archive, and replication, but can also empower research teams with capabilities to manage data throughout the study (project) lifecycle. With iRODS, investigators can take advantage of tools that allow them to automate annotation of data sets with project information, move data based on the project lifecycle, and find the data based on study attributes when they need it.” Protected insight into Big Data: the Secure Medical Workspace At the end of the day, the goal is to be able to deliver important genomics information to both clinicians and researchers. To accomplish this part of its broad genomics infrastructure mission, RENCI, in collaboration with UNC TraCS, the School of Information and Library Science (SILS), and UNC Hospitals, has developed the Secure Medical Workspace (SMW) system to enable the CDW-H to provide researchers and healthcare professionals secure access to patient records. The SMW shown in Figure 3 combines a secure centralized infrastructure with virtualization and data leakage protection technologies to allow researchers to analyze their research data, while ensuring sensitive patient information remains within the SMW environment. “It’s a front-end to get to the data,” said Schmitt. “So for those people who need direct access to sensitive data containing PHI, we’re using this secure workspace as a way to give them access to data files.” Authorized researchers connect to SMW from their local computing devices over a secure network connection to a dedicated virtual workspace. Figure 3. The Secure Medical Workspace 11Life sciences at RENCI: Big Data IT to manage, decipher, and inform
  12. 12. “It’s a virtualization solution where we can give a researcher a virtual server, and once on that server the researcher can get access to data, either directly attached to that server or remote somewhere else. But we include data leakage protection on the server, which gives us protection and screens against any data being pulled outside of the system,” explained Schmitt. “Yet, researchers can freely bring their own data and tools onto the server.” There are commercial solutions that allow you to set policies for who can take data out and what happens when someone tries to take data out. “The way that we have favored doing this,” he continued, “is if someone tries to copy data out, we allow it but throw up a warning screen saying you have to abide by your data usage agreement. That agreement and the data removed from the server are then stored for compliance audits.” Big Data’s persistent challenges Amid the substantial progress in developing an infrastructure to handle life sciences’ Big Data challenge, many thorny challenges persist, noted Schmitt. Consider that a database of sequenced and variant data associated with 10,000 patients would have roughly a petabyte of data. Working with such a massive data repository complicates basically everything—storage, replication, ongoing analysis, traditional ETL database functions, etc. Collaboration, for example, remains problematic, with data transmission the biggest issue. RENCI’s current collaboration with UC San Diego and the Scripps Institute, explained Schmitt, “has been done by sending BAM files in batches. The first batch took a month to send. Then, talking back and forth by phone about issues regarding the data takes more time. It’s not a great process,” he says. Schmitt continued: “We are looking at some of the advanced networking coming out of NSF to get the bandwidth we want to move data. Of course that’s all kind of experimental right now. We are exploring using some of the OpenStack and Open Science Cloud offerings as a way to help collaborate.” Large-scale computation on Big Data—particularly some of the so-called n-squarevi problems—remains challenging. “We continue to explore Hadoop as one answer down the road, but we are looking at other approaches, including data flow solutions and systems for computing over large-scale graphs,” said Schmitt. Archiving is another bottleneck. “Our goal for UNC and the UNC healthcare system is to be able to manage storing a genome for every individual patient and using that for research, but to get to that level cost-wise is going to be very difficult in terms of data storage,” Schmitt continued. “We need a better idea of what data we can throw away and when we can throw away data, and how to represent data at various levels of hierarchy.” Nevertheless, RENCI’s progress on all fronts has been substantial. UNC healthcare professionals are able to look at patient genomic data for clinical care through the NCGENES project—the last stage in RENCI’s analysis and data delivery pipeline. The NIDA project is longer-term, and still in data and analysis collection mode, but many of the kinks to collecting and processing the larger NIDA sample datasets have been 12Life sciences at RENCI: Big Data IT to manage, decipher, and inform
  13. 13. worked out. RENCI is poised to play a growing and important role developing the HPC infrastructure and necessary analysis pipelines to support life sciences and healthcare. What IGS will deliver In addition to handling the basic processing of next-generation DNA sequencer (NGS) output, the RENCI-built Informatics for Genetic Sequencing (IGS) infrastructure continues to be enhanced in order to support: • Improved population-oriented queries: Given a variant, the system will find the frequency of that variant and related haplotypes in a large population to help determine whether the variant is potentially deleterious. • Automated annotation: The system will extract data from multiple different source databases, extract annotation, and incorporate it back into the variant database for use by researchers, thus providing an increasingly diverse range of annotation sources. • Reference rationalization: Data in the system could be used to redefine the “reference” genome, the template used to compare genomes from different individuals. • Improved variant analysis: Enhanced data processing will help researchers identify additional information about genetic variation between individuals beyond that which is possible with current technologies. • Visualization: Data visualization will help enable new insights and inspire new research questions. • Metadata grid: The system will enable automated generation and propagation of metadata to enhance analysis and data management and to guide computational and data workflows. EMC Isilon OneFS tames Big Data EMC Isilon OneFS 7.0 is designed to address the convergence of Big Data and enterprise IT, and extend the benefits of Isilon scale-out NAS architecture to a wider range of enterprise storage needs. OneFS combines the three layers of traditional storage architectures—the file system, volume manager, and RAID—into one unified software layer, creating a single intelligent distributed file system that runs on one storage cluster. The advantages of OneFS for NGS are many: • Scalable: Scale out as needs grow. Linear scale with increasing capacity: from 18 TB to 20 PB in a single file system and a single global namespace. • Predictable: Dynamic content balancing is performed as nodes are added, upgraded, or as capacity changes. No added management time is required, because this process is simple. • Available: OneFS is “self-healing.” It protects your data from power loss, node or disk failures, and loss of quorum and storage rebuild by distributing data, metadata, and parity across all nodes. 13Life sciences at RENCI: Big Data IT to manage, decipher, and inform
  14. 14. • Efficient: Compared to the average 50 percent efficiency of traditional RAID systems, OneFS provides over 80 percent efficiency, independent of CPU compute or cache. This efficiency is achieved by tiering the process into three types, as shown in the figure alongside and by the pools within these node types. • Enterprise-ready: Administration of the storage clusters is via an intuitive Web-based UI. Connectivity to your process is through standard protocols: CIFS, SMB, NFS, FTP/ HTTP, Object, and HDFS. Standardized authentication and access control is available at scale: AD, LDAP, and NIS. Isilon is the only scale-out NAS offering that provides enterprise capabilities at scale to manage rapidly growing unstructured data assets more effectively. Isilon OneFS provides data protection through snapshots across the whole cluster, and is the only scale-out NAS solution compliant to SEC 17a-4 standards. Isilon is the world's fastest NAS platform, delivering over 100 GB/s system throughput, and remains the world-record holder for scale-out NAS performance with 1.6 million SpecSFS2008 CIFS operations per second. With OneFS 7.0, Isilon storage systems now provide dramatically improved caching capability to reduce average latency by 60 percent for I/O-intensive applications. Intel’s HPC leadership empowers life sciences Intel technology is used throughout HPC and is particularly prevalent in life sciences, where Big Data challenges are now the norm. For example, Intel Xeon processors, both the E5 and Phi lines, are accelerating parallel computing and bringing greater accuracy to genomics analysis. Similarly, Intel software, such as the Intel Distribution for Hadoop and Intel Manager for Hadoop, helps administrators simplify configuring hardware and tuning Hadoop performance. In all aspects of HPC, Intel technology and products are at the forefront. Nowhere is this leadership more important than life sciences and throughout the RENCI/UNC HPC infrastructure, where Intel products are widely embedded and helping researchers and clinicians manage and interpret the genomics data deluge. Here’s a brief overview of just a few Intel enabling technologies: • Xeon/E5. The E5 processor, a solid foundation for HPC, delivers 80 percent greater performance, 70 percent more energy-efficiency, and 30 percent less network latency than earlier Xeon processors. Servers based on the E5 family provide an optimum combination of performance, built-in capabilities, and cost- effectiveness. From virtualization and cloud computing solutions to design automation or real-time financial transactions, the E5 provides needed power. • Xeon/Phi. Intel’s new line of Xeon Phi coprocessors is optimized for performance and programmability for highly parallel workloads. The 5110P, first member of the line, has 60 cores at 1.053GHz and handles 240 threads. Importantly, Intel Xeon processors and Phi coprocessors support the same code, reducing the complexity of development. The same techniques—such as scaling applications to many cores and threads—can be used on both. • Intel software. This extensive portfolio includes, for example, Intel Cluster Studio XE, which features high performance, standards-driven compilers, libraries, analysis tools, OpenMP, and MPI. Intel Distribution for Hadoop and Intel Manager for Hadoop are important products for life sciences. Other offerings include Intel Data Center Manager (DCM) and Intel Node Manager (NM) for resource/power management, and Intel Expressway Service Gateway for cloud usage models. 14Life sciences at RENCI: Big Data IT to manage, decipher, and inform
  15. 15. • Intel fabric. HPC workloads today are too large to be managed by unspecialized tools. Intel has several specifically designed for large and complex workloads. Among them are Intel True Scale Fabric, designed from the ground up for HPC, and QDR-40 and QDR-80, which deliver performance that scales. These tools are optimized support for Xeon E5 and Xeon Phi processors. • Intel storage. Intel storage technologies are used throughout industry at every level (enterprise, SM business, home). Here are a few: Intel Xeon processors and platforms enabled with beneficial storage optimizations; Solid-state drives (SSDs) and other NVM technologies improve storage performance; Intel Cache Acceleration Software (CAS); and Intel’s open source Lustre file-system support/development and Chroma management/provisioning tools. For more information For more information about the exciting work done at the Renaissance Computing Institute (RENCI), visit http://www.renci.org. To learn more about how EMC products, services, and solutions help solve your life sciences IT challenges, contact your local representative or authorized reseller—or visit us at www.EMC.com/isilon. To learn more about Intel technology, visit http://www.intel.com/content/www/us/en/healthcare-it/big-data-in-healthcare.html. To learn how e-IRODS can solve your enterprise data management needs, visit http://e-irods.org/. i http://www.renci.org/resources/computing ii “…coverage is a measure of the number of times that a specific genomic site is sequenced during a sequencing.” – JP Sulzberger Columbia Genome Center, http://genomecenter.columbia.edu/?q=node/77. iii The Genome Analysis Toolkit or GATK is a software package developed at the Broad Institute to analyze next-generation resequencing data. The Toolkit offers a wide variety of tools, with a primary focus on variant discovery and genotyping, as well as strong emphasis on data quality assurance. http://www.broadinstitute.org/gatk/. iv University of Oxford backgrounder on imputing, http://mathgen.stats.ox.ac.uk/impute/impute_v2.html#home v VarDB is a PostgreSQL relational database. “VarDB doesn’t directly integrate with RENCI usage of Hadoop other than occasionally store results from certain Hadoop calculations, such as allele frequencies from VCF files, in VarDB.” – Charles Schmitt. vi N-squared is shorthand for problems that are actually O(n^2) or similar to n-squared, such as O(n^2.8). That would be different from true np hard problems. 15Life sciences at RENCI: Big Data IT to manage, decipher, and inform