SlideShare a Scribd company logo
White Paper




HADOOP IN THE LIFE SCIENCES:
An Introduction




                  Abstract
                  This introductory white paper reviews the Apache HadoopTM
                  technology, its components – MapReduce and Hadoop
                  Distributed File System (HDFS) – and its adoption in the Life
                  Sciences with an example in Genomics data analysis.

                  March 2012
Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC believes the information in this publication is accurate as
of its publication date. The information is subject to change
without notice.

The information in this publication is provided “as is.” EMC
Corporation makes no representations or warranties of any kind
with respect to the information in this publication, and
specifically disclaims implied warranties of merchantability or
fitness for a particular purpose.

Use, copying, and distribution of any EMC software described in
this publication requires an applicable software license.

Part number h10574




                          Hadoop in the Life Sciences: An Introduction   2
Table of Contents
Audience ....................................................................................... 3	
  
Executive Summary ........................................................................ 4	
  
Hadoop: an Introduction ................................................................. 5	
  
Genomics example: CrossBow .......................................................... 8	
  
Enterprise-Class Hadoop on EMC Isilon ............................................. 9	
  
Conclusion .................................................................................. 10	
  
References .................................................................................. 10	
  




Audience
This white paper introduces the new data processing and analysis paradigm,
HadoopTM, within the context of its usage in the life sciences, specifically Genomics
Sequencing. It is intended for audiences with basic knowledge of storage and
computing technology; a rudimentary understanding of DNA sequencing and the
bioinformatics analysis associated with it.




                                                       Hadoop in the Life Sciences: An Introduction      3
Executive Summary
Life Sciences data will reach the ExaByte (1018 bytes, EB) scale soon. This is “Big
Data”. As a reference point, all words ever spoken by all human beings when
transcribed are about 5 EB of data. In a recent article titled “Will Computers Crash
Genomics?”1, the analysis points to exponential growth of the total genomics
sequencing market capacity, as outlined in Figure 1 below: 10 Tera base-pairs (1012
bp) per day, with an astounding 5x year-on-year growth rate (500%). The human
genome is approximately 3 billion base pairs long – a base pair (bp) comprising of
DNA molecules in G-C or A-T pairs




                           Figure 1: Genomics Growth

Each base-pair represents a total of about 100 bytes (of raw, analyzed and
interpreted data). Therefore the genomics market capacity in 2010 storage terms
(from Fig. 1) was about 200 PetaBytes (PB), with the capacity growing to about 1
ExaByte (EB) by late 2012. This capacity is drowning out technologies attempting to
handle the deluge of Big Data in the life sciences. Proteomics (study of proteins) and
imaging data are early stages of this exponential rise. It is not just the data storage
volume, but also its velocity and variability that make this a challenge requiring
“scale-out” technologies: grow simply and painlessly as the data center and business
needs grow. Within the past year, one computing and storage framework has matured
into a contender to handle this tsunami of Big Data: Hadoop™.
Life Sciences workflows require a High Performance Computing (HPC) infrastructure to
process and analyze the data to determine the variations in the genome and the
proper scale of storage to retain this data. With Next Generation (genome)
Sequencing (NGS) workflows generating up to 2 TeraBytes (TB) of data per run per
week per sequencer – not including the raw images – the need for a scale-out storage
that integrates easily with HPC is a “line item requirement”. EMC Isilon has provided
the scale-out storage for nearly all the workflows for all the DNA sequencer instrument
manufacturers in the market today at more than 150 customers. Since 2008, the EMC
Isilon OneFS storage platform has a Life Sciences installed base of more than 65
PetaBytes (PB).




                                               Hadoop in the Life Sciences: An Introduction   4
As genomics has very large, semi-structured, file-based data and is modeled on post-
process streaming data access and I/O patterns that can be parallelized, it is ideally
suited for Hadoop. It consists of two main components: a file system and a compute
system – the Hadoop Distributed File System (HDFS) and the MapReduce framework
respectively. The Hadoop ecosystem consists of many open source tools, as shown in
Figure 2 below:




                          Figure 2: Hadoop Components

To make the Hadoop storage “scale-out” and truly distributed, the EMC Isilon
OneFS™ file system features connectivity to the Hadoop Distributed File System
(HDFS) just like any other shared file system protocol: NFS, CIFS or SMB3. This allows
for the data co-location of the storage with its compute nodes using the standard
higher level Java application programming interface (API) to build MapReduce “jobs”.


Hadoop: an Introduction
Hadoop was created by Doug Cutting of the Apache Lucene project4 initially as the
Nutch Distributed File System (NDFS), which was inspired by Google’s BigTable data
infrastructure and the MapReduce5 application layer in 2004. Hadoop is an Apache™
Foundation derivative which is comprised of a MapReduce layer for data analysis and
a Hadoop Distributed File System (HDFS) layer written in the Java programming
language to distribute and scale the MapReduce data.
The Hadoop MapReduce framework runs on the compute cluster using the data
stored on the HDFS. MapReduce 'jobs' aim to provide a key/value based processing
ability in a highly parallelized fashion. Since the data is distributed over the cluster, a
MapReduce job can be split-up to run many parallel processes over the data stored
on the cluster. The Map parts of MapReduce only run on the data they can see – that
is the data blocks on the particular machine its running on. The Reduce brings
together the output from the Maps. The result is a system that provides a highly-




                                                  Hadoop in the Life Sciences: An Introduction   5
paralleled batch processing capability. The system scales well, since you just need to
add more hardware to increase its storage capability or decrease the time a
MapReduce job takes to run.
The partitioning of the storage and compute framework into master and worker node
types is outlined in the Figure 3 below:




                             Figure 3: Hadoop Cluster

Hadoop is a Write Once Ready Many (WORM) system with no random writes. This
makes Hadoop faster than HPC and Storage integrated separately. The life sciences
has been at the forefront of the technology adoption curve: one of the earliest use-
cases of the Sun GridEngine6 HPC was the DNA sequence comparison BLAST16 search.
Standard Hadoop interfaces are available via Java, C, FUSE and WebDAV7. The R
(statistical language) Hadoop interface, RHIPE8, is also popular in the life sciences
community.
The HDFS layer has a “Name Node”, the controller, with “data locality” through the
name node and uses the “share nothing” architecture – which is a distributed
independent node based scheme7.
From a platform perspective, the OneFS HDFS interface is compatible with Apache
Hadoop, EMC GreenPlum3 and Cloudera. In a traditional Hadoop implementation, the
HDFS “Name Node” is a single point of failure since it is the sole keeper of all the
metadata for all the data that lives in the filesystem – the OneFS HDFS interface
resolves this by distributing the name node data3. HDFS creates a 3x replica for
redundancy – OneFS drastically reduces the need for a 3x copy.
A good example of the MapReduce algorithm “key-value” pair process for analyzing
word count of specific words across documents9 is shown in Figure 3 below:




                                                Hadoop in the Life Sciences: An Introduction   6
Figure 4: Hadoop Example – word count across documents

Hadoop is not suited for low-latency, “in process” use-cases like real-time, spectral or
video analysis; or for large numbers of small files (<8KB). When small files have to be
used, the Hadoop Archive (HAR) can be used to archive small files for processing.
Since its early days, life sciences organizations have been Hadoop’s earliest
adopters. Following the publication of the first Apache Hadoop project10 in January
2008, the first large-scale MapReduce project was initiated by the Broad Institute –
resulting in the comprehensive Genome Analysis Tool Kit (GATK)11. The Hadoop
“CrossBow” project12 from Johns Hopkins University came soon after. Other projects
are Cloud-based: they include CloudBurst, Contrail, Myrna and CloudBLAST13. An
interesting implementation is the NERSC (Department of Energy) Flash-based Hadoop
cluster within the Magellan Science Cloud14.




                                                Hadoop in the Life Sciences: An Introduction   7
Genomics example: CrossBow
                         The Hadoop ‘word count across
                         documents’ example in Fig. 4 can be
                         extended to DNA Sequencing: count for
                         single base changes across millions of
                         short DNA fragments and across
                         hundreds of samples.
                         A Single Nucleotide Polymorphism (SNP)
                         occurs when one nucleotide (A, T, C or G)
                         varies in the DNA sequence of members
                         of the same biological species. Next
                         Generation Sequencers (NGS) like
                         Illumina® HiSeq can produce data in the
                         order of 200 Giga base pairs in a single
                         one-week run for a 60x human genome
                         “coverage” – this means that each base
                         was present on an average of 60 reads.
                         The larger the coverage, the more
                         statistically significant is the result. This
                         data requires specialized software
                         algorithms called “short read aligners”.
                         CrossBow12 is a combination of several
                         algorithms that provide SNP calling and
                         short read alignment, which are common
                         tasks in NGS. Figure 5 alongside explains
                         the steps necessary to process genome
                         data to look for SNPs.
                         The Map-Sort-Reduce process is ideally
                         suited for a Hadoop framework. The
                         cluster as shown in Figure 5 is a
                         traditional N-node Hadoop cluster.
                         1. The Map step is the short read
                         alignment algorithm, called BoWTie
                         (Burrows Wheeler Transform, BWT).
                         Multiple instances of BoWTie are run in
                         parallel in Hadoop. The input tuples (an
                         ordered list of elements) are the
                         sequence reads and the output tuples are
                         the alignments of the short reads.
  Figure 5: Crossbow
  example– SNP cal ls    2. The Sort step apportions the
  across DNA fragments   alignments according to a primary key
                         (the genome partition) and sorts based
                         on a secondary key (which is the offset




                             Hadoop in the Life Sciences: An Introduction   8
for that partition). The data here are the sorted alignments.
3. The Reduce step calls SNPs for each reference genome partition. Many parallel
instances of the algorithm SOAPsnp (Short Oligonucleotide Analysis Package for SNP)
run in the Hadoop cluster. Input tuples are sorted alignments for a partition and the
output tuples are SNP calls.
Results are stored via HDFS; then archived in SOAPsnp format.


Enterprise-Class Hadoop on EMC Isilon
As demonstrated by previous examples, the data and analysis scalability required for
Genomics is ideally suited for Hadoop. EMC Isilon’s OneFS distributes the Hadoop
Name Node to provide high availability and load balancing, thereby eliminating the
single point of failure. The Isilon NAS storage solution provides a highly efficient
single file system/single volume, scalable up to 15 PB. Data can be staged from other
protocols to HDFS using OneFS as a staging gateway. EMC Isilon provides Enterprise
Grade data services to the Hadoop infrastructure via SnapshotIQ and SyncIQ for
advanced backup and disaster recovery capabilities.
The equation for Hadoop scalability can be represented as:
                     Big(Data + Analytics) = Hadoop EMC:Isilon
These advantages are summarized in Fig. 6 below:




                Figure 6: Hadoop advantages with EMC Isilon

When combined the EMC GreenPlum Analytics appliance and solution17, the Hadoop
architecture becomes a complete Enterprise package.




                                                Hadoop in the Life Sciences: An Introduction   9
Conclusion
What began as an internal project at Google in 2004 has now matured into a scalable
framework for two computing paradigms that are particularly suited for the life
sciences: parallelization and distribution. The post-processing streaming data
patterns for text strings, clustering and sorting – the core process patterns in the life
sciences – are ideal workflows for Hadoop. The CrossBow example discussed above
aligned Illumina NGS reads for SNP calling over a ‘35x’ coverage of the human
genome in under 3 hours using a 40-node Hadoop cluster; an order of magnitude
better than traditional HPC technology for parallel processes.
Even though Hadoop implementations in the Cloud are popular on the Public Cloud
instances, several issues have resulted in most large institutions maintaining their
own data repositories internally: large data transfer from the on-premise storage to
the Cloud; data regulations and security; data availability; data redundancy and HPC
throughput. This is especially true as genome sequencing moves into the Clinic for
diagnostic testing.
The convergence of these issues is evidenced by the mirroring of Short Read
sequence Archive (SRA) at the National Center for Biotechnology Information (NCBI)
on the DNANexus’ SRA Cloud15 – its business model is slowly evolving into a ‘full data
and analysis offsite’ model via Hadoop. The Hybrid Cloud model (a source data
mirror between Private Cloud and Community Cloud) with Hadoop as a Service (HaaS)
is the current state-of-the-art.
Hadoop’s advantages far outweigh its challenges – it is ready to become the life
sciences analytics framework of the future. The EMC Isilon platform is bringing that
future to you today.


References
1. Pennisi, E; Science 11 February 2011: Vol. 331 no. 6018 pp. 666-668
2. Editorial, “Challenges and Opportunities”, Science 11 February 2011: Vol. 331 no.
   6018 pp 692.
3. Hadoop on EMC Isilon Scale Out NAS: EMC White Paper, Part Number h10528
4. Cafarella, M and Cutting D, “Building Nutch, Open Source Search”, ACM Queue
   vol. 2, no. 2, April 2004.
5. Dean J and Ghemawat S, "MapReduce: Simplfied Data Processing on Large
   Clusters", OSDI conference proceedings, 2004.
6. Vasiliu B, “Integrating BLAST with Sun GridEngine”, July 2003,
   http://developers.sun.com/solaris/articles/integrating_blast.html, last visited
   Dec 2011.
7. White, Tom: “Hadoop -- The Definitive Guide” 2nd Edition, Published by O’Reilly,
   Oct 2010
8. RHIPE: http://ml.stat.purdue.edu/rhipe/, last visited Dec 2011



                                                 Hadoop in the Life Sciences: An Introduction   10
9. MapReduce example:
   http://markusklems.files.wordpress.com/2008/07/mapreduce.png , last visited
   Dec 2011.
10. “Hadoop wins Terabyte sort benchmark”, Apr 2008, Apr 2009,
   http://sortbenchmark.org/YahooHadoop.pdf,
   http://sortbenchmark.org/Yahoo2009.pdf last accessed Dec 2011
11. McKenna A, et al, "The Genome Analysis Toolkit: A MapReduce framework for
   analyzing next-generation DNA sequencing data", Genome Research, 20:1297–
   1303, July 2010.
12. Langmead B, Schatz MC, et al, “Human SNPs from short reads in hours using
   cloud computing” Poster Presentation, WABI Sep 2009,
   http://www.cbcb.umd.edu/~mschatz/Posters/Crossbow_WABI_Sept2009.pdf,
   last accessed Dec 2011.
13. Taylor RC, "An overview of the Hadoop/MapReduce/HBase framework and its
   current applications in bioinformatics" BMC Bioinformatics 2010, 11(Suppl
   12):S1, http://www.biomedcentral.com/1471-2105/11/S12/S1 , last accessed
   Dec 2011.
14. Ramakrishnan L, “Evaluating Cloud Computing for HPC Applications”, DoE
   NeRSC, http://www.nersc.gov/assets/Events/MagellanNERSCLunchTalk.pdf, last
   accessed Dec 2011.
15. “DNAnexus to mirror SRA database in Google Cloud”, BioIT World, Page 41,
   http://www.bio-itworld.com/uploadedFiles/Bio-
   IT_World/1111BITW_download.pdf , last visited Dec 2011.
16. Altschul SF, et al, "Basic local alignment search tool". J Mol Biol 215 (3): 403–
    410, October 1990.
17. Lockner J.,"EMC’s Enterprise Hadoop Solution: Isilon Scale-out NAS and
    GreenPlum HD", White Paper, The Enterprise Strategy Group, Inc (ESG), February
    2012




                                                 Hadoop in the Life Sciences: An Introduction   11

More Related Content

What's hot

Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
Dipayan Dev
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
Nalini Mehta
 
D04501036040
D04501036040D04501036040
D04501036040
ijceronline
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
David Tjahjono,MD,MBA(UK)
 
Survey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopSurvey Paper on Big Data and Hadoop
Survey Paper on Big Data and Hadoop
IRJET Journal
 
Hadoop
HadoopHadoop
hadoop
hadoophadoop
hadoop
swatic018
 
Hadoop Based Data Discovery
Hadoop Based Data DiscoveryHadoop Based Data Discovery
Hadoop Based Data Discovery
Benjamin Ashkar
 
HBase Mongo_DB Project
HBase Mongo_DB ProjectHBase Mongo_DB Project
HBase Mongo_DB Project
Sonali Gupta
 
Integrating dbm ss as a read only execution layer into hadoop
Integrating dbm ss as a read only execution layer into hadoopIntegrating dbm ss as a read only execution layer into hadoop
Integrating dbm ss as a read only execution layer into hadoop
João Gabriel Lima
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
inventionjournals
 
Hadoop and its role in Facebook: An Overview
Hadoop and its role in Facebook: An OverviewHadoop and its role in Facebook: An Overview
Hadoop and its role in Facebook: An Overview
rahulmonikasharma
 
Büyük Veri, Hadoop Ekosistemi ve Veri Bilimi
Büyük Veri, Hadoop Ekosistemi ve Veri BilimiBüyük Veri, Hadoop Ekosistemi ve Veri Bilimi
Büyük Veri, Hadoop Ekosistemi ve Veri Bilimi
Ankara Big Data Meetup
 
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
MLconf
 
Cidr11 paper32
Cidr11 paper32Cidr11 paper32
Cidr11 paper32
jujukoko
 
Evaluation and analysis of green hdfs a self-adaptive, energy-conserving var...
Evaluation and analysis of green hdfs  a self-adaptive, energy-conserving var...Evaluation and analysis of green hdfs  a self-adaptive, energy-conserving var...
Evaluation and analysis of green hdfs a self-adaptive, energy-conserving var...
João Gabriel Lima
 
Characterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learningCharacterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learning
João Gabriel Lima
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce framework
eldariof
 
IJET-V3I2P14
IJET-V3I2P14IJET-V3I2P14
Implementation of nosql for robotics
Implementation of nosql for roboticsImplementation of nosql for robotics
Implementation of nosql for robotics
João Gabriel Lima
 

What's hot (20)

Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
D04501036040
D04501036040D04501036040
D04501036040
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
Survey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopSurvey Paper on Big Data and Hadoop
Survey Paper on Big Data and Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
hadoop
hadoophadoop
hadoop
 
Hadoop Based Data Discovery
Hadoop Based Data DiscoveryHadoop Based Data Discovery
Hadoop Based Data Discovery
 
HBase Mongo_DB Project
HBase Mongo_DB ProjectHBase Mongo_DB Project
HBase Mongo_DB Project
 
Integrating dbm ss as a read only execution layer into hadoop
Integrating dbm ss as a read only execution layer into hadoopIntegrating dbm ss as a read only execution layer into hadoop
Integrating dbm ss as a read only execution layer into hadoop
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
 
Hadoop and its role in Facebook: An Overview
Hadoop and its role in Facebook: An OverviewHadoop and its role in Facebook: An Overview
Hadoop and its role in Facebook: An Overview
 
Büyük Veri, Hadoop Ekosistemi ve Veri Bilimi
Büyük Veri, Hadoop Ekosistemi ve Veri BilimiBüyük Veri, Hadoop Ekosistemi ve Veri Bilimi
Büyük Veri, Hadoop Ekosistemi ve Veri Bilimi
 
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
 
Cidr11 paper32
Cidr11 paper32Cidr11 paper32
Cidr11 paper32
 
Evaluation and analysis of green hdfs a self-adaptive, energy-conserving var...
Evaluation and analysis of green hdfs  a self-adaptive, energy-conserving var...Evaluation and analysis of green hdfs  a self-adaptive, energy-conserving var...
Evaluation and analysis of green hdfs a self-adaptive, energy-conserving var...
 
Characterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learningCharacterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learning
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce framework
 
IJET-V3I2P14
IJET-V3I2P14IJET-V3I2P14
IJET-V3I2P14
 
Implementation of nosql for robotics
Implementation of nosql for roboticsImplementation of nosql for robotics
Implementation of nosql for robotics
 

Viewers also liked

Hadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciencesHadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciences
Uri Laserson
 
Force Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereForce Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop Elsewhere
EMC
 
How to Design a Logo. User Guide for Logo Templates
How to Design a Logo. User Guide for Logo TemplatesHow to Design a Logo. User Guide for Logo Templates
How to Design a Logo. User Guide for Logo Templates
Maxim Logoswish
 
профорієнтація і профільне навч.
профорієнтація і профільне навч.профорієнтація і профільне навч.
профорієнтація і профільне навч.Татьяна Глинская
 
The Industrial Internet@Work
The Industrial Internet@WorkThe Industrial Internet@Work
The Industrial Internet@Work
EMC
 
The wise old_man
The wise old_manThe wise old_man
The wise old_man
Chandan Dubey
 
RSA Laboratories' Frequently Asked Questions About Today's Cryptography, Vers...
RSA Laboratories' Frequently Asked Questions About Today's Cryptography, Vers...RSA Laboratories' Frequently Asked Questions About Today's Cryptography, Vers...
RSA Laboratories' Frequently Asked Questions About Today's Cryptography, Vers...
EMC
 
Wind_Energy_Law_2014_Amanda James _Avoiding Regulatory Missteps for Developer...
Wind_Energy_Law_2014_Amanda James _Avoiding Regulatory Missteps for Developer...Wind_Energy_Law_2014_Amanda James _Avoiding Regulatory Missteps for Developer...
Wind_Energy_Law_2014_Amanda James _Avoiding Regulatory Missteps for Developer...
Amanda James
 
Presentation1
Presentation1Presentation1
Presentation1
Ashok Ramnath
 
Euskal Herria
Euskal HerriaEuskal Herria
Euskal Herria
iranjulienara
 
Theoretical research
Theoretical researchTheoretical research
Theoretical research
ChloeMateides
 
Analyst Report : How to Ride the Post-PC End User Computing Wave
Analyst Report : How to Ride the Post-PC End User Computing Wave Analyst Report : How to Ride the Post-PC End User Computing Wave
Analyst Report : How to Ride the Post-PC End User Computing Wave
EMC
 
Helsinki collab
Helsinki collabHelsinki collab
Helsinki collab
sara_chou
 
Ελληνικές Επιχειρήσεις και Οικονομική Κρίση
Ελληνικές Επιχειρήσεις και Οικονομική ΚρίσηΕλληνικές Επιχειρήσεις και Οικονομική Κρίση
Ελληνικές Επιχειρήσεις και Οικονομική Κρίση
chaniadevs
 
You Are the Target
You Are the TargetYou Are the Target
You Are the Target
EMC
 
แบบบ้านสองชั้น สวยทันสมัย ตกแต่งน่าอยู่
แบบบ้านสองชั้น สวยทันสมัย ตกแต่งน่าอยู่แบบบ้านสองชั้น สวยทันสมัย ตกแต่งน่าอยู่
แบบบ้านสองชั้น สวยทันสมัย ตกแต่งน่าอยู่
Kamthon Sarawan
 
Nada por aquí
Nada por aquíNada por aquí
Nada por aquí
Flamenquito 68
 
IT-as-a-Service Solutions for Healthcare Providers
IT-as-a-Service Solutions for Healthcare ProvidersIT-as-a-Service Solutions for Healthcare Providers
IT-as-a-Service Solutions for Healthcare Providers
EMC
 
Law of supply
Law of supplyLaw of supply
Law of supply
Travis Klein
 

Viewers also liked (20)

Hadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciencesHadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciences
 
Force Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereForce Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop Elsewhere
 
How to Design a Logo. User Guide for Logo Templates
How to Design a Logo. User Guide for Logo TemplatesHow to Design a Logo. User Guide for Logo Templates
How to Design a Logo. User Guide for Logo Templates
 
профорієнтація і профільне навч.
профорієнтація і профільне навч.профорієнтація і профільне навч.
профорієнтація і профільне навч.
 
The Industrial Internet@Work
The Industrial Internet@WorkThe Industrial Internet@Work
The Industrial Internet@Work
 
The wise old_man
The wise old_manThe wise old_man
The wise old_man
 
RSA Laboratories' Frequently Asked Questions About Today's Cryptography, Vers...
RSA Laboratories' Frequently Asked Questions About Today's Cryptography, Vers...RSA Laboratories' Frequently Asked Questions About Today's Cryptography, Vers...
RSA Laboratories' Frequently Asked Questions About Today's Cryptography, Vers...
 
Wind_Energy_Law_2014_Amanda James _Avoiding Regulatory Missteps for Developer...
Wind_Energy_Law_2014_Amanda James _Avoiding Regulatory Missteps for Developer...Wind_Energy_Law_2014_Amanda James _Avoiding Regulatory Missteps for Developer...
Wind_Energy_Law_2014_Amanda James _Avoiding Regulatory Missteps for Developer...
 
Presentation1
Presentation1Presentation1
Presentation1
 
Euskal Herria
Euskal HerriaEuskal Herria
Euskal Herria
 
Theoretical research
Theoretical researchTheoretical research
Theoretical research
 
Analyst Report : How to Ride the Post-PC End User Computing Wave
Analyst Report : How to Ride the Post-PC End User Computing Wave Analyst Report : How to Ride the Post-PC End User Computing Wave
Analyst Report : How to Ride the Post-PC End User Computing Wave
 
Helsinki collab
Helsinki collabHelsinki collab
Helsinki collab
 
Pat1
Pat1Pat1
Pat1
 
Ελληνικές Επιχειρήσεις και Οικονομική Κρίση
Ελληνικές Επιχειρήσεις και Οικονομική ΚρίσηΕλληνικές Επιχειρήσεις και Οικονομική Κρίση
Ελληνικές Επιχειρήσεις και Οικονομική Κρίση
 
You Are the Target
You Are the TargetYou Are the Target
You Are the Target
 
แบบบ้านสองชั้น สวยทันสมัย ตกแต่งน่าอยู่
แบบบ้านสองชั้น สวยทันสมัย ตกแต่งน่าอยู่แบบบ้านสองชั้น สวยทันสมัย ตกแต่งน่าอยู่
แบบบ้านสองชั้น สวยทันสมัย ตกแต่งน่าอยู่
 
Nada por aquí
Nada por aquíNada por aquí
Nada por aquí
 
IT-as-a-Service Solutions for Healthcare Providers
IT-as-a-Service Solutions for Healthcare ProvidersIT-as-a-Service Solutions for Healthcare Providers
IT-as-a-Service Solutions for Healthcare Providers
 
Law of supply
Law of supplyLaw of supply
Law of supply
 

Similar to White Paper: Hadoop in Life Sciences — An Introduction

HDFS
HDFSHDFS
Unit-3_BDA.ppt
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt
PoojaShah174393
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
Harikrishnan K
 
Hadoop
HadoopHadoop
Hadoop.powerpoint.pptx
Hadoop.powerpoint.pptxHadoop.powerpoint.pptx
Hadoop.powerpoint.pptx
sonukumar379092
 
Hadoop .pdf
Hadoop .pdfHadoop .pdf
Hadoop .pdf
SudhanshiBakre1
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
Manoj Jangalva
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Giovanna Roda
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
Mishika Bharadwaj
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
Mohanasundaram Ponnusamy
 
The hadoop ecosystem table
The hadoop ecosystem tableThe hadoop ecosystem table
The hadoop ecosystem table
Mohamed Magdy
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
MarianJRuben
 
Hadoop
HadoopHadoop
Hadoop info
Hadoop infoHadoop info
Hadoop info
Nikita Sure
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
BOSC 2010
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
Thanh Nguyen
 
Big Data - Hadoop Ecosystem
Big Data -  Hadoop Ecosystem Big Data -  Hadoop Ecosystem
Big Data - Hadoop Ecosystem
nuriadelasheras
 
Hadoopppt.pptx
Hadoopppt.pptxHadoopppt.pptx
Hadoopppt.pptx
ssuser552a8f
 
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEPERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
ijdpsjournal
 
International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)
ijdpsjournal
 

Similar to White Paper: Hadoop in Life Sciences — An Introduction (20)

HDFS
HDFSHDFS
HDFS
 
Unit-3_BDA.ppt
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop.powerpoint.pptx
Hadoop.powerpoint.pptxHadoop.powerpoint.pptx
Hadoop.powerpoint.pptx
 
Hadoop .pdf
Hadoop .pdfHadoop .pdf
Hadoop .pdf
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
The hadoop ecosystem table
The hadoop ecosystem tableThe hadoop ecosystem table
The hadoop ecosystem table
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Big Data - Hadoop Ecosystem
Big Data -  Hadoop Ecosystem Big Data -  Hadoop Ecosystem
Big Data - Hadoop Ecosystem
 
Hadoopppt.pptx
Hadoopppt.pptxHadoopppt.pptx
Hadoopppt.pptx
 
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEPERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
 
International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)
 

More from EMC

INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDINDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
EMC
 
Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote
EMC
 
EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX
EMC
 
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOTransforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
EMC
 
Citrix ready-webinar-xtremio
Citrix ready-webinar-xtremioCitrix ready-webinar-xtremio
Citrix ready-webinar-xtremio
EMC
 
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC
 
EMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC with Mirantis Openstack
EMC with Mirantis Openstack
EMC
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lake
EMC
 
Pivotal : Moments in Container History
Pivotal : Moments in Container History Pivotal : Moments in Container History
Pivotal : Moments in Container History
EMC
 
Data Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewData Lake Protection - A Technical Review
Data Lake Protection - A Technical Review
EMC
 
Mobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeMobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or Foe
EMC
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic
EMC
 
Intelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityIntelligence-Driven GRC for Security
Intelligence-Driven GRC for Security
EMC
 
The Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeThe Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure Age
EMC
 
EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015
EMC
 
EMC Academic Summit 2015
EMC Academic Summit 2015EMC Academic Summit 2015
EMC Academic Summit 2015
EMC
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education Services
EMC
 
Using EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsUsing EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere Environments
EMC
 
Using EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookUsing EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBook
EMC
 
2014 Cybercrime Roundup: The Year of the POS Breach
2014 Cybercrime Roundup: The Year of the POS Breach2014 Cybercrime Roundup: The Year of the POS Breach
2014 Cybercrime Roundup: The Year of the POS Breach
EMC
 

More from EMC (20)

INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDINDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
 
Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote
 
EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX
 
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOTransforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
 
Citrix ready-webinar-xtremio
Citrix ready-webinar-xtremioCitrix ready-webinar-xtremio
Citrix ready-webinar-xtremio
 
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
 
EMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC with Mirantis Openstack
EMC with Mirantis Openstack
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lake
 
Pivotal : Moments in Container History
Pivotal : Moments in Container History Pivotal : Moments in Container History
Pivotal : Moments in Container History
 
Data Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewData Lake Protection - A Technical Review
Data Lake Protection - A Technical Review
 
Mobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeMobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or Foe
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic
 
Intelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityIntelligence-Driven GRC for Security
Intelligence-Driven GRC for Security
 
The Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeThe Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure Age
 
EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015
 
EMC Academic Summit 2015
EMC Academic Summit 2015EMC Academic Summit 2015
EMC Academic Summit 2015
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education Services
 
Using EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsUsing EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere Environments
 
Using EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookUsing EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBook
 
2014 Cybercrime Roundup: The Year of the POS Breach
2014 Cybercrime Roundup: The Year of the POS Breach2014 Cybercrime Roundup: The Year of the POS Breach
2014 Cybercrime Roundup: The Year of the POS Breach
 

Recently uploaded

Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
marufrahmanstratejm
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
Edge AI and Vision Alliance
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 

Recently uploaded (20)

Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 

White Paper: Hadoop in Life Sciences — An Introduction

  • 1. White Paper HADOOP IN THE LIFE SCIENCES: An Introduction Abstract This introductory white paper reviews the Apache HadoopTM technology, its components – MapReduce and Hadoop Distributed File System (HDFS) – and its adoption in the Life Sciences with an example in Genomics data analysis. March 2012
  • 2. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice. The information in this publication is provided “as is.” EMC Corporation makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose. Use, copying, and distribution of any EMC software described in this publication requires an applicable software license. Part number h10574 Hadoop in the Life Sciences: An Introduction 2
  • 3. Table of Contents Audience ....................................................................................... 3   Executive Summary ........................................................................ 4   Hadoop: an Introduction ................................................................. 5   Genomics example: CrossBow .......................................................... 8   Enterprise-Class Hadoop on EMC Isilon ............................................. 9   Conclusion .................................................................................. 10   References .................................................................................. 10   Audience This white paper introduces the new data processing and analysis paradigm, HadoopTM, within the context of its usage in the life sciences, specifically Genomics Sequencing. It is intended for audiences with basic knowledge of storage and computing technology; a rudimentary understanding of DNA sequencing and the bioinformatics analysis associated with it. Hadoop in the Life Sciences: An Introduction 3
  • 4. Executive Summary Life Sciences data will reach the ExaByte (1018 bytes, EB) scale soon. This is “Big Data”. As a reference point, all words ever spoken by all human beings when transcribed are about 5 EB of data. In a recent article titled “Will Computers Crash Genomics?”1, the analysis points to exponential growth of the total genomics sequencing market capacity, as outlined in Figure 1 below: 10 Tera base-pairs (1012 bp) per day, with an astounding 5x year-on-year growth rate (500%). The human genome is approximately 3 billion base pairs long – a base pair (bp) comprising of DNA molecules in G-C or A-T pairs Figure 1: Genomics Growth Each base-pair represents a total of about 100 bytes (of raw, analyzed and interpreted data). Therefore the genomics market capacity in 2010 storage terms (from Fig. 1) was about 200 PetaBytes (PB), with the capacity growing to about 1 ExaByte (EB) by late 2012. This capacity is drowning out technologies attempting to handle the deluge of Big Data in the life sciences. Proteomics (study of proteins) and imaging data are early stages of this exponential rise. It is not just the data storage volume, but also its velocity and variability that make this a challenge requiring “scale-out” technologies: grow simply and painlessly as the data center and business needs grow. Within the past year, one computing and storage framework has matured into a contender to handle this tsunami of Big Data: Hadoop™. Life Sciences workflows require a High Performance Computing (HPC) infrastructure to process and analyze the data to determine the variations in the genome and the proper scale of storage to retain this data. With Next Generation (genome) Sequencing (NGS) workflows generating up to 2 TeraBytes (TB) of data per run per week per sequencer – not including the raw images – the need for a scale-out storage that integrates easily with HPC is a “line item requirement”. EMC Isilon has provided the scale-out storage for nearly all the workflows for all the DNA sequencer instrument manufacturers in the market today at more than 150 customers. Since 2008, the EMC Isilon OneFS storage platform has a Life Sciences installed base of more than 65 PetaBytes (PB). Hadoop in the Life Sciences: An Introduction 4
  • 5. As genomics has very large, semi-structured, file-based data and is modeled on post- process streaming data access and I/O patterns that can be parallelized, it is ideally suited for Hadoop. It consists of two main components: a file system and a compute system – the Hadoop Distributed File System (HDFS) and the MapReduce framework respectively. The Hadoop ecosystem consists of many open source tools, as shown in Figure 2 below: Figure 2: Hadoop Components To make the Hadoop storage “scale-out” and truly distributed, the EMC Isilon OneFS™ file system features connectivity to the Hadoop Distributed File System (HDFS) just like any other shared file system protocol: NFS, CIFS or SMB3. This allows for the data co-location of the storage with its compute nodes using the standard higher level Java application programming interface (API) to build MapReduce “jobs”. Hadoop: an Introduction Hadoop was created by Doug Cutting of the Apache Lucene project4 initially as the Nutch Distributed File System (NDFS), which was inspired by Google’s BigTable data infrastructure and the MapReduce5 application layer in 2004. Hadoop is an Apache™ Foundation derivative which is comprised of a MapReduce layer for data analysis and a Hadoop Distributed File System (HDFS) layer written in the Java programming language to distribute and scale the MapReduce data. The Hadoop MapReduce framework runs on the compute cluster using the data stored on the HDFS. MapReduce 'jobs' aim to provide a key/value based processing ability in a highly parallelized fashion. Since the data is distributed over the cluster, a MapReduce job can be split-up to run many parallel processes over the data stored on the cluster. The Map parts of MapReduce only run on the data they can see – that is the data blocks on the particular machine its running on. The Reduce brings together the output from the Maps. The result is a system that provides a highly- Hadoop in the Life Sciences: An Introduction 5
  • 6. paralleled batch processing capability. The system scales well, since you just need to add more hardware to increase its storage capability or decrease the time a MapReduce job takes to run. The partitioning of the storage and compute framework into master and worker node types is outlined in the Figure 3 below: Figure 3: Hadoop Cluster Hadoop is a Write Once Ready Many (WORM) system with no random writes. This makes Hadoop faster than HPC and Storage integrated separately. The life sciences has been at the forefront of the technology adoption curve: one of the earliest use- cases of the Sun GridEngine6 HPC was the DNA sequence comparison BLAST16 search. Standard Hadoop interfaces are available via Java, C, FUSE and WebDAV7. The R (statistical language) Hadoop interface, RHIPE8, is also popular in the life sciences community. The HDFS layer has a “Name Node”, the controller, with “data locality” through the name node and uses the “share nothing” architecture – which is a distributed independent node based scheme7. From a platform perspective, the OneFS HDFS interface is compatible with Apache Hadoop, EMC GreenPlum3 and Cloudera. In a traditional Hadoop implementation, the HDFS “Name Node” is a single point of failure since it is the sole keeper of all the metadata for all the data that lives in the filesystem – the OneFS HDFS interface resolves this by distributing the name node data3. HDFS creates a 3x replica for redundancy – OneFS drastically reduces the need for a 3x copy. A good example of the MapReduce algorithm “key-value” pair process for analyzing word count of specific words across documents9 is shown in Figure 3 below: Hadoop in the Life Sciences: An Introduction 6
  • 7. Figure 4: Hadoop Example – word count across documents Hadoop is not suited for low-latency, “in process” use-cases like real-time, spectral or video analysis; or for large numbers of small files (<8KB). When small files have to be used, the Hadoop Archive (HAR) can be used to archive small files for processing. Since its early days, life sciences organizations have been Hadoop’s earliest adopters. Following the publication of the first Apache Hadoop project10 in January 2008, the first large-scale MapReduce project was initiated by the Broad Institute – resulting in the comprehensive Genome Analysis Tool Kit (GATK)11. The Hadoop “CrossBow” project12 from Johns Hopkins University came soon after. Other projects are Cloud-based: they include CloudBurst, Contrail, Myrna and CloudBLAST13. An interesting implementation is the NERSC (Department of Energy) Flash-based Hadoop cluster within the Magellan Science Cloud14. Hadoop in the Life Sciences: An Introduction 7
  • 8. Genomics example: CrossBow The Hadoop ‘word count across documents’ example in Fig. 4 can be extended to DNA Sequencing: count for single base changes across millions of short DNA fragments and across hundreds of samples. A Single Nucleotide Polymorphism (SNP) occurs when one nucleotide (A, T, C or G) varies in the DNA sequence of members of the same biological species. Next Generation Sequencers (NGS) like Illumina® HiSeq can produce data in the order of 200 Giga base pairs in a single one-week run for a 60x human genome “coverage” – this means that each base was present on an average of 60 reads. The larger the coverage, the more statistically significant is the result. This data requires specialized software algorithms called “short read aligners”. CrossBow12 is a combination of several algorithms that provide SNP calling and short read alignment, which are common tasks in NGS. Figure 5 alongside explains the steps necessary to process genome data to look for SNPs. The Map-Sort-Reduce process is ideally suited for a Hadoop framework. The cluster as shown in Figure 5 is a traditional N-node Hadoop cluster. 1. The Map step is the short read alignment algorithm, called BoWTie (Burrows Wheeler Transform, BWT). Multiple instances of BoWTie are run in parallel in Hadoop. The input tuples (an ordered list of elements) are the sequence reads and the output tuples are the alignments of the short reads. Figure 5: Crossbow example– SNP cal ls 2. The Sort step apportions the across DNA fragments alignments according to a primary key (the genome partition) and sorts based on a secondary key (which is the offset Hadoop in the Life Sciences: An Introduction 8
  • 9. for that partition). The data here are the sorted alignments. 3. The Reduce step calls SNPs for each reference genome partition. Many parallel instances of the algorithm SOAPsnp (Short Oligonucleotide Analysis Package for SNP) run in the Hadoop cluster. Input tuples are sorted alignments for a partition and the output tuples are SNP calls. Results are stored via HDFS; then archived in SOAPsnp format. Enterprise-Class Hadoop on EMC Isilon As demonstrated by previous examples, the data and analysis scalability required for Genomics is ideally suited for Hadoop. EMC Isilon’s OneFS distributes the Hadoop Name Node to provide high availability and load balancing, thereby eliminating the single point of failure. The Isilon NAS storage solution provides a highly efficient single file system/single volume, scalable up to 15 PB. Data can be staged from other protocols to HDFS using OneFS as a staging gateway. EMC Isilon provides Enterprise Grade data services to the Hadoop infrastructure via SnapshotIQ and SyncIQ for advanced backup and disaster recovery capabilities. The equation for Hadoop scalability can be represented as: Big(Data + Analytics) = Hadoop EMC:Isilon These advantages are summarized in Fig. 6 below: Figure 6: Hadoop advantages with EMC Isilon When combined the EMC GreenPlum Analytics appliance and solution17, the Hadoop architecture becomes a complete Enterprise package. Hadoop in the Life Sciences: An Introduction 9
  • 10. Conclusion What began as an internal project at Google in 2004 has now matured into a scalable framework for two computing paradigms that are particularly suited for the life sciences: parallelization and distribution. The post-processing streaming data patterns for text strings, clustering and sorting – the core process patterns in the life sciences – are ideal workflows for Hadoop. The CrossBow example discussed above aligned Illumina NGS reads for SNP calling over a ‘35x’ coverage of the human genome in under 3 hours using a 40-node Hadoop cluster; an order of magnitude better than traditional HPC technology for parallel processes. Even though Hadoop implementations in the Cloud are popular on the Public Cloud instances, several issues have resulted in most large institutions maintaining their own data repositories internally: large data transfer from the on-premise storage to the Cloud; data regulations and security; data availability; data redundancy and HPC throughput. This is especially true as genome sequencing moves into the Clinic for diagnostic testing. The convergence of these issues is evidenced by the mirroring of Short Read sequence Archive (SRA) at the National Center for Biotechnology Information (NCBI) on the DNANexus’ SRA Cloud15 – its business model is slowly evolving into a ‘full data and analysis offsite’ model via Hadoop. The Hybrid Cloud model (a source data mirror between Private Cloud and Community Cloud) with Hadoop as a Service (HaaS) is the current state-of-the-art. Hadoop’s advantages far outweigh its challenges – it is ready to become the life sciences analytics framework of the future. The EMC Isilon platform is bringing that future to you today. References 1. Pennisi, E; Science 11 February 2011: Vol. 331 no. 6018 pp. 666-668 2. Editorial, “Challenges and Opportunities”, Science 11 February 2011: Vol. 331 no. 6018 pp 692. 3. Hadoop on EMC Isilon Scale Out NAS: EMC White Paper, Part Number h10528 4. Cafarella, M and Cutting D, “Building Nutch, Open Source Search”, ACM Queue vol. 2, no. 2, April 2004. 5. Dean J and Ghemawat S, "MapReduce: Simplfied Data Processing on Large Clusters", OSDI conference proceedings, 2004. 6. Vasiliu B, “Integrating BLAST with Sun GridEngine”, July 2003, http://developers.sun.com/solaris/articles/integrating_blast.html, last visited Dec 2011. 7. White, Tom: “Hadoop -- The Definitive Guide” 2nd Edition, Published by O’Reilly, Oct 2010 8. RHIPE: http://ml.stat.purdue.edu/rhipe/, last visited Dec 2011 Hadoop in the Life Sciences: An Introduction 10
  • 11. 9. MapReduce example: http://markusklems.files.wordpress.com/2008/07/mapreduce.png , last visited Dec 2011. 10. “Hadoop wins Terabyte sort benchmark”, Apr 2008, Apr 2009, http://sortbenchmark.org/YahooHadoop.pdf, http://sortbenchmark.org/Yahoo2009.pdf last accessed Dec 2011 11. McKenna A, et al, "The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data", Genome Research, 20:1297– 1303, July 2010. 12. Langmead B, Schatz MC, et al, “Human SNPs from short reads in hours using cloud computing” Poster Presentation, WABI Sep 2009, http://www.cbcb.umd.edu/~mschatz/Posters/Crossbow_WABI_Sept2009.pdf, last accessed Dec 2011. 13. Taylor RC, "An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics" BMC Bioinformatics 2010, 11(Suppl 12):S1, http://www.biomedcentral.com/1471-2105/11/S12/S1 , last accessed Dec 2011. 14. Ramakrishnan L, “Evaluating Cloud Computing for HPC Applications”, DoE NeRSC, http://www.nersc.gov/assets/Events/MagellanNERSCLunchTalk.pdf, last accessed Dec 2011. 15. “DNAnexus to mirror SRA database in Google Cloud”, BioIT World, Page 41, http://www.bio-itworld.com/uploadedFiles/Bio- IT_World/1111BITW_download.pdf , last visited Dec 2011. 16. Altschul SF, et al, "Basic local alignment search tool". J Mol Biol 215 (3): 403– 410, October 1990. 17. Lockner J.,"EMC’s Enterprise Hadoop Solution: Isilon Scale-out NAS and GreenPlum HD", White Paper, The Enterprise Strategy Group, Inc (ESG), February 2012 Hadoop in the Life Sciences: An Introduction 11