SlideShare a Scribd company logo
1 of 28
Taming Big Science Data Growth with
Converged Infrastructure
©2014 BioTeam, Inc. All Rights Reserved.
Real-world strategies and implementation details for building converged storage
infrastructure to support the performance, scalability and collaborative
requirements of today's NGS workflows.
Aaron D. Gardner
Senior Scientific Consultant, BioTeam, Inc.
aaron@bioteam.net
BIOTEAM
Enabling Science
1. Introduction
2.The State of NGS Data Analysis
3. Converged Infrastructure
4. Solutions to Support NGS Data Analysis
©2014 BioTeam, Inc. All Rights Reserved.
BIOTEAM
Enabling Science
©2014 BioTeam, Inc. All Rights Reserved.
| About Myself
Who am I?
 A computer engineer who spent the
last 14 years with biologists in situ
 Exposed to NGS in 2005
 Have worked (for better or worse)
with most NGS platforms and data
types
 Along the way learned bioinformatics,
data management, HPC, storage, and
general research cyberinfrastructure
 Desire to help the broader life sciences
community lead me to BioTeam
BIOTEAM
Enabling Science
&
14Years
Later…
&
©2014 BioTeam, Inc. All Rights Reserved.
| About BioTeam
Who are we?
 Independent consulting shop
 Staffed by scientists forced to learn IT,
SW & HPC to get our own research done
 12+ years bridging the “gap”
between science, IT & high
performance computing
BioTeam@Bio-IT World ’14
 Did you just come from Chris’s talk?
Make sure to check out his slides…
 We have lots going on at the
conference this year
 Come visit us at booth #324
BIOTEAM
Enabling Science
©2014 BioTeam, Inc. All Rights Reserved.
| AboutThisTalk
What are we going to talk about?
 A quick look at NGS analysis trends
 Challenges in performance, scalability, and collaboration
 Strategies that address these challenges
 The benefits of pairing converged infrastructure with NGS
 Example topologies and implementations
BIOTEAM
Enabling Science
Approach
 Topics discussed the same way they would
over coffee (or tea)
 I talk about vendors and technologies
I have experience with– that’s why DDN
invited me to speak (thanks)
 Feel free to reach out to me during the
conference if any of this interests you
©2014 BioTeam, Inc. All Rights Reserved.
| A Note About the Big Picture
At BioTeam our mission is to enable science (see above)
i.e. Great people, enabled by great technology, actively
engaging in broader scientific communities
Technology alone doesn’t cover this mandate…
• Instruments never installed, unopened server boxes, idle accelerator racks
Gathering minds without the right resources and tools…
• They flee for the cloud, desk clusters, or other companies or institutions
Locking away resources and data stifles collaboration…
• Focus on services that empower instead of barriers that contain scientists
BIOTEAM
Enabling Science
1. Introduction
2.The State of NGS Data Analysis
3. Converged Infrastructure
4. Solutions to Support NGS Data Analysis
©2014 BioTeam, Inc. All Rights Reserved.
BIOTEAM
Enabling Science
©2014 BioTeam, Inc. All Rights Reserved.
|NGS Analysis Challenges
Performance
 Compute is easy, just not necessarily efficient
 Analysis pipelines longer and more complex
 Usually serial steps still lurking in them
 “New programs”– wrapper scripts with a twist
 Don’t address performance of fundamental
algorithms underneath
Scalability
 See few analysis algorithms scaled to 1-100K
cores each year, same w/ accelerators
 Still vast majority lucky to reach 10-100
 Life sciences still mostly a HTC problem
 Checkpointing becoming increasingly
important
 Varying and mixed IO patterns make HTC
problematic on shared storage
BIOTEAM
Enabling Science
©2014 BioTeam, Inc. All Rights Reserved.
|NGS Analysis Challenges
Collaboration
 Community movement to more efficient
sequence data structures—very encouraging
(e.g. SAM/BAM/CRAM/VCF/HDF5)
 Sharing of datasets still incredibly problematic
 Large sequencing centers, institutes, commercial
interests embracing Science DMZ, data transfer
node concept (w/ Globus, Aspera, etc.)
 Without data lifecycle management, this
newfound scientific data mobility will only
amplify storage issues (enter iRODS, etc.)
 Last mile problem for collaborators with poor
network connectivity
 Need real Big Data collobration solutions
 Find a way to bring computation to the data so
the last mile disappears
BIOTEAM
Enabling Science
©2014 BioTeam, Inc. All Rights Reserved.
| Observation: NGS Analysis Inversion IBIOTEAM
Enabling Science
Infrastructure Spending (in theory):
NGS Analysis Facility Infrastructure Spending (in practice):
Infrastructure Itself (the 80%)
SW and HW
Integrations
(the 20%)
Infrastructure
Itself
(the 80%)
SW and HW Integrations (the 20%)
Minimizing integration overhead is one of the principal challenges right now when
designing scientific computing environments.
 This holds for NGS, as well as other scientific domains
 Analysis environments from pieces which have never previously been tried together
 Synthesized based on what’s best for business instead of technical merits, efficiency is
wasted, and for small and midsize infrastructures integration overhead balloons
©2014 BioTeam, Inc. All Rights Reserved.
| Observation: NGS Analysis Inversion IIBIOTEAM
Enabling Science
The 20% The 80%
The 80% The 20%
What the industry is shooting for (in theory):
Where we seem to be (in practice):
1. Introduction
2.The State of NGS Data Analysis
3. Converged Infrastructure
4. Solutions to Support NGS Data Analysis
©2014 BioTeam, Inc. All Rights Reserved.
BIOTEAM
Enabling Science
©2014 BioTeam, Inc. All Rights Reserved.
| Traditional Infrastructure
Traditional computational infrastructures are comprised of separate
hardware (storage, networking, computation) and software
(provisioning, monitoring, management, etc.) components
 Pieced into one-off solutions
 Integrated and tuned on-site
(can take months for large systems)
 As these infrastructures scale,
they become snow flakes
BIOTEAM
Enabling Science
©2014 BioTeam, Inc. All Rights Reserved.
| Converged Infrastructure
With converged infrastructure, multiple hardware and software
components are developed, selected, integrated, and tuned
together, producing a pre-optimized solution
 Infrastructure building block approach
 Some vendors (e.g. DDN) offer mature
converged storage products like the SFA embedded platform
 Facebook’s Open Compute Project lends itself to building
converged infrastructures w/ OCP compliant components
 Analysis appliances (e.g. SlipStream) also use the
converged infrastructure model
Converged infrastructure shifts the focus from integrating
hardware components to building software services, which is
where organizations can better distinguish and define themselves
BIOTEAM
Enabling Science
©2014 BioTeam, Inc. All Rights Reserved.
| Converged Infrastructure: Example SolutionBIOTEAM
Enabling Science
Remote iRODS
server
Traditional
iRODS
iRODS
clients
iRODS data and
control access
NAS file access
RAID controllerRAID controllerRAID controller
SAN Switch
iRODS/iCAT
server
Block storage
access
Cluster
Network Switch
File server File serverFile server
Disk
array
Disk
array
Disk
array
©2014 BioTeam, Inc. All Rights Reserved.
| Converged Infrastructure: Example SolutionBIOTEAM
Enabling Science
iRODS
clients
iRODS data and
control access
High performance
NAS file access
iCAT & iRODS
servers
Network Switch
Cluster
Integrated Appliance Reduces Complexity
and Integration Time
Remote iRODS
server
Converged
iRODS
©2014 BioTeam, Inc. All Rights Reserved.
| Is Converged Infrastructure inYour Critical Path?BIOTEAM
Enabling Science
Converged infrastructure not as necessary when:
1. Hiring lots of smart people and committing their time to infrastructure
2. Attacking a single or small set of large problems
3. Rarely revalidating or reintegrating your HW stack after deployment
• This is because if you tie your platform closely to a mixed and disparate
hardware stack: staff time to explore reintegration and revalidation
issues, rewrite code for new architectures—this can work for hyper
giants and single service efforts but legacy and vendor-controlled
codes, flexible infrastructure, infrastructure for yet unknown or
unsolved problems—converged infrastructure buys these down…
©2014 BioTeam, Inc. All Rights Reserved.
| Tiered Service Models and Changing Staff RolesBIOTEAM
Enabling Science
Challenges With This Model:
• Need for single instance resources
capable of dealing with big data
• Now need multitenancy capabilities
even as a single organization
• Must minimize latency to better utilize
limited resources—public cloud’s
massive scalability approach might not
be suitable for a small or midsize
research environment with legacy
codes, inexperienced users, etc.
 DevOps and the cloud have changed the relationship between the
researcher and the IT practitioner permanently
 Research computing staff should be developing best practices, not acting
as a human ‘sudo’ for informaticists
Users
instantiate
resources on
demand which
they have
privileged
access to–but
no support is
offered beyond
clearing hang-
ups
Services
requiring a
higher degree
of reliability
and/or security
are built and
managed by IT
staff, with
unprivileged
access
provided to
users
Core
computational
services are still
supported end-
to-end by IT
staff, and are
consumed by
resources in the
previous two
levels
Solution: Move to a tired service and support model
1. Introduction
2.The State of NGS Data Analysis
3. Converged Infrastructure
4. Solutions to Support NGS Data Analysis
©2014 BioTeam, Inc. All Rights Reserved.
BIOTEAM
Enabling Science
GPFS is a fast parallel file system written by IBM
 Distributed metadata and locking
 Good performance with small files
 Tunable for large numbers of small files
 Native Linux and Windows clients
 CIFS and NFSv3 (v4 works, unsupported)
 Raw NGS data is big
 NGS analysis datasets are getting bigger
 They can require lots of IOPS during analysis
 Lots of space required to store what comes after
Can’t satisfy all of these considerations with a single storage tier without tremendous cost
Solution: Hierarchical Storage Management (HSM)
 Create different pools of storage, policies govern data movement
 SSD for metadata, small files,VMs, etc. and SATA for capacity and sequential access
 Can also use tape, object storage, and others as cold archive or warm near-line tiers
NOTE: Lustre now has some HSM capabilities too as of version 2.5
©2014 BioTeam, Inc. All Rights Reserved.
| Tiered Data Storage (e.g. GPFS w/ HSM)BIOTEAM
Enabling Science
Example GPFS based
GRIDScaler System from DDN:
©2014 BioTeam, Inc. All Rights Reserved.
| Science DMZ (e.g. ESnet Model)
Core Drivers
• Enterprise networking architecture is optimized for many
small data flows (Web 2.0, mobile, Internet ofThings)
• Not optimized for fewer large data flows
• Deep packet inspection & stateful firewalls
can’t handle large flows, performance tanks
3 Components of a Science DMZ
1. Fast network paths with streamlined security specific to large scientific data flows
2. DataTransfer Node(s) specifically tuned and dedicated to moving large data flows
3. Network monitoring and measurement node(s)
• Government & academic sites have done similar things for years without the name
• BioTeam strongly believes in the Science DMZ concept
• At this point anybody moving large scientific data should be evaluating
• We are already helping deploy them
ESnet has a great web resource available: http://fasterdata.es.net/
BIOTEAM
Enabling Science
©2014 BioTeam, Inc. All Rights Reserved.
| Implementation: Science DMZBIOTEAM
Enabling Science
Design Source: “The Science DMZ: Introduction & Architecture” – ESnet
©2014 BioTeam, Inc. All Rights Reserved.
| Information Lifecycle Management (ILM) (e.g. iRODS)BIOTEAM
Enabling Science
iRODS, the Integrated Rule-Oriented Data System, is a project for building the next
generation data management cyberinfrastructure. One of the main ideas behind
iRODS is to provide a system that enables a flexible, adaptive, customizable data
management architecture. Suitable for preserving data over its lifecycle.
At the iRODS core, a Rule Engine interprets the Rules to decide how the system is to
respond to various requests and conditions.
Interfaces: GUI, Web, WebDAV, CLI
Operations:
 Search, Access and View,
 Add/Extract Metadata, Annotate,
 Analyze & Process,
 Manage, Replicate, Copy, Share,
Repurpose,
 Track access, Subscribe & more…
iRODS Server software and Rule Engine run
on each data server. The iRODS iCAT
Metadata Catalog uses a database to track
metadata describing data and everything that
happens to it
©2014 BioTeam, Inc. All Rights Reserved.
| Implementation: Science DMZ + ILMBIOTEAM
Enabling Science
©2014 BioTeam, Inc. All Rights Reserved.
| NGS Data Analysis (on a Hybrid HPC Cloud)
General Concept
• On-site local resources are a “cache” that exists…
• To be used constantly
• For best data locality
• For specialized resources
• For security
• Elastic resources from public or parent organization’s private cloud
• The middleware offers cloud-style IaaS and/or PaaS
• Multi-tenant– users/virtual communities can spin up their own resources, clusters, etc.
• These on-demand systems accommodate unique software configurations and services
(suited to varying NGS workflows, etc.)
Sounds great, but…
• It will be a while before you can pull a solution like this off the shelf
• Would be a good candidate for a converged infrastructure offering
Goal: HPC-like performance and latency, cloud-like elasticity and provisioning
BIOTEAM
Enabling Science
©2014 BioTeam, Inc. All Rights Reserved.
| Implementation: Science DMZ + ILM + HPC CloudBIOTEAM
Enabling Science
©2014 BioTeam, Inc. All Rights Reserved.
| PartingThoughts & Lessons Learned
1. Confirmation Bias
• Just because it wasn’t viable before, doesn’t mean it won’t ever be
2. Depth Perception
• Bleeding Edge? Leading Edge? State of the Art? Legacy? Ready to Sunset?
3. Outliers
• The existence of edge or corner cases does not necessarily invalidate a solution,
but it does mean you better understand the scope the solution covers
4. The Power of &&
• Multipart solutions seen as complex, abandoned in search of a silver bullet
• Combining ideas is more collaborative and doesn’t force an ultimatum
5. GameTheory
• Bringing a chess set to a checkers tournament…
6. Relationship overTechnology
• Work with vendors and collaborators that are interested in making a long term
investment in what you do
BIOTEAM
Enabling Science
ThankYou
Questions and Discussion Welcome
©2014 BioTeam, Inc. All Rights Reserved.
Aaron D. Gardner
Senior Scientific Consultant, BioTeam, Inc.
aaron@bioteam.net
BIOTEAM
Enabling Science

More Related Content

What's hot

Multi-Tenant Pharma HPC Clusters
Multi-Tenant Pharma HPC ClustersMulti-Tenant Pharma HPC Clusters
Multi-Tenant Pharma HPC ClustersChris Dagdigian
 
2014 BioIT World - Trends from the trenches - Annual presentation
2014 BioIT World - Trends from the trenches - Annual presentation2014 BioIT World - Trends from the trenches - Annual presentation
2014 BioIT World - Trends from the trenches - Annual presentationChris Dagdigian
 
Cloud Security for Life Science R&D
Cloud Security for Life Science R&DCloud Security for Life Science R&D
Cloud Security for Life Science R&DChris Dagdigian
 
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome MeetingBio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome MeetingChris Dagdigian
 
Disruptive Innovation: how do you use these theories to manage your IT?
Disruptive Innovation: how do you use these theories to manage your IT?Disruptive Innovation: how do you use these theories to manage your IT?
Disruptive Innovation: how do you use these theories to manage your IT?mark madsen
 
Bi isn't big data and big data isn't BI (updated)
Bi isn't big data and big data isn't BI (updated)Bi isn't big data and big data isn't BI (updated)
Bi isn't big data and big data isn't BI (updated)mark madsen
 
Big Data and Bad Analogies
Big Data and Bad AnalogiesBig Data and Bad Analogies
Big Data and Bad Analogiesmark madsen
 
Everything has changed except us
Everything has changed except usEverything has changed except us
Everything has changed except usmark madsen
 
Big Data Meets HCI—How South African Insurance Provider King Price Gives Deve...
Big Data Meets HCI—How South African Insurance Provider King Price Gives Deve...Big Data Meets HCI—How South African Insurance Provider King Price Gives Deve...
Big Data Meets HCI—How South African Insurance Provider King Price Gives Deve...Dana Gardner
 
Maciej Marek (Philip Morris International) - The Tools of The Trade
Maciej Marek (Philip Morris International) - The Tools of The TradeMaciej Marek (Philip Morris International) - The Tools of The Trade
Maciej Marek (Philip Morris International) - The Tools of The TradeCodiax
 
IT Performance Management Handbook for CIOs
IT Performance Management Handbook for CIOsIT Performance Management Handbook for CIOs
IT Performance Management Handbook for CIOsVikram Ramesh
 
The Evolution of Big Data Frameworks
The Evolution of Big Data FrameworksThe Evolution of Big Data Frameworks
The Evolution of Big Data FrameworkseXascale Infolab
 
Innovation med big data – chr. hansens erfaringer
Innovation med big data – chr. hansens erfaringerInnovation med big data – chr. hansens erfaringer
Innovation med big data – chr. hansens erfaringerMicrosoft
 
Microsoft cloud migration and modernization playbook 031819 (1) (2)
Microsoft cloud migration and modernization playbook 031819 (1) (2)Microsoft cloud migration and modernization playbook 031819 (1) (2)
Microsoft cloud migration and modernization playbook 031819 (1) (2)didicadoida
 
frog IoT Big Design IoT World Congress 2015
frog IoT Big Design IoT World Congress 2015frog IoT Big Design IoT World Congress 2015
frog IoT Big Design IoT World Congress 2015Patrick Kalaher
 
Rethinking The Data Warehouse: Emerging Practices and Technologies to Meet To...
Rethinking The Data Warehouse: Emerging Practices and Technologies to Meet To...Rethinking The Data Warehouse: Emerging Practices and Technologies to Meet To...
Rethinking The Data Warehouse: Emerging Practices and Technologies to Meet To...Senturus
 
Informatics Platforms for Biologics R&D: 5 Key Capabilities to Look For
Informatics Platforms for Biologics R&D: 5 Key Capabilities to Look ForInformatics Platforms for Biologics R&D: 5 Key Capabilities to Look For
Informatics Platforms for Biologics R&D: 5 Key Capabilities to Look ForRoger Pellegrini
 
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons LearnedBio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons LearnedChris Dagdigian
 
Damss scurt v2 dss an evolving class ...
Damss scurt  v2 dss an evolving class ...Damss scurt  v2 dss an evolving class ...
Damss scurt v2 dss an evolving class ...ISSIP
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryPaco Nathan
 

What's hot (20)

Multi-Tenant Pharma HPC Clusters
Multi-Tenant Pharma HPC ClustersMulti-Tenant Pharma HPC Clusters
Multi-Tenant Pharma HPC Clusters
 
2014 BioIT World - Trends from the trenches - Annual presentation
2014 BioIT World - Trends from the trenches - Annual presentation2014 BioIT World - Trends from the trenches - Annual presentation
2014 BioIT World - Trends from the trenches - Annual presentation
 
Cloud Security for Life Science R&D
Cloud Security for Life Science R&DCloud Security for Life Science R&D
Cloud Security for Life Science R&D
 
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome MeetingBio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting
 
Disruptive Innovation: how do you use these theories to manage your IT?
Disruptive Innovation: how do you use these theories to manage your IT?Disruptive Innovation: how do you use these theories to manage your IT?
Disruptive Innovation: how do you use these theories to manage your IT?
 
Bi isn't big data and big data isn't BI (updated)
Bi isn't big data and big data isn't BI (updated)Bi isn't big data and big data isn't BI (updated)
Bi isn't big data and big data isn't BI (updated)
 
Big Data and Bad Analogies
Big Data and Bad AnalogiesBig Data and Bad Analogies
Big Data and Bad Analogies
 
Everything has changed except us
Everything has changed except usEverything has changed except us
Everything has changed except us
 
Big Data Meets HCI—How South African Insurance Provider King Price Gives Deve...
Big Data Meets HCI—How South African Insurance Provider King Price Gives Deve...Big Data Meets HCI—How South African Insurance Provider King Price Gives Deve...
Big Data Meets HCI—How South African Insurance Provider King Price Gives Deve...
 
Maciej Marek (Philip Morris International) - The Tools of The Trade
Maciej Marek (Philip Morris International) - The Tools of The TradeMaciej Marek (Philip Morris International) - The Tools of The Trade
Maciej Marek (Philip Morris International) - The Tools of The Trade
 
IT Performance Management Handbook for CIOs
IT Performance Management Handbook for CIOsIT Performance Management Handbook for CIOs
IT Performance Management Handbook for CIOs
 
The Evolution of Big Data Frameworks
The Evolution of Big Data FrameworksThe Evolution of Big Data Frameworks
The Evolution of Big Data Frameworks
 
Innovation med big data – chr. hansens erfaringer
Innovation med big data – chr. hansens erfaringerInnovation med big data – chr. hansens erfaringer
Innovation med big data – chr. hansens erfaringer
 
Microsoft cloud migration and modernization playbook 031819 (1) (2)
Microsoft cloud migration and modernization playbook 031819 (1) (2)Microsoft cloud migration and modernization playbook 031819 (1) (2)
Microsoft cloud migration and modernization playbook 031819 (1) (2)
 
frog IoT Big Design IoT World Congress 2015
frog IoT Big Design IoT World Congress 2015frog IoT Big Design IoT World Congress 2015
frog IoT Big Design IoT World Congress 2015
 
Rethinking The Data Warehouse: Emerging Practices and Technologies to Meet To...
Rethinking The Data Warehouse: Emerging Practices and Technologies to Meet To...Rethinking The Data Warehouse: Emerging Practices and Technologies to Meet To...
Rethinking The Data Warehouse: Emerging Practices and Technologies to Meet To...
 
Informatics Platforms for Biologics R&D: 5 Key Capabilities to Look For
Informatics Platforms for Biologics R&D: 5 Key Capabilities to Look ForInformatics Platforms for Biologics R&D: 5 Key Capabilities to Look For
Informatics Platforms for Biologics R&D: 5 Key Capabilities to Look For
 
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons LearnedBio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned
 
Damss scurt v2 dss an evolving class ...
Damss scurt  v2 dss an evolving class ...Damss scurt  v2 dss an evolving class ...
Damss scurt v2 dss an evolving class ...
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
 

Similar to Taming Big Science Data Growth with Converged Infrastructure

Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Nathan Bijnens
 
Uniting traditional GIS and mainstream IT
Uniting traditional GIS and mainstream ITUniting traditional GIS and mainstream IT
Uniting traditional GIS and mainstream ITgssg
 
Whitepaper: Big Data - Infrastructure Considerations - Happiest Minds
Whitepaper: Big Data - Infrastructure Considerations - Happiest MindsWhitepaper: Big Data - Infrastructure Considerations - Happiest Minds
Whitepaper: Big Data - Infrastructure Considerations - Happiest MindsHappiest Minds Technologies
 
IBM Aspera In Life Sciences
IBM Aspera In Life SciencesIBM Aspera In Life Sciences
IBM Aspera In Life SciencesChris Shaw
 
Denodo’s Data Catalog: Bridging the Gap between Data and Business
Denodo’s Data Catalog: Bridging the Gap between Data and BusinessDenodo’s Data Catalog: Bridging the Gap between Data and Business
Denodo’s Data Catalog: Bridging the Gap between Data and BusinessDenodo
 
data-mesh_whitepaper_dec2021.pdf
data-mesh_whitepaper_dec2021.pdfdata-mesh_whitepaper_dec2021.pdf
data-mesh_whitepaper_dec2021.pdfssuser18927d
 
Hedstrom Infrastructure
Hedstrom InfrastructureHedstrom Infrastructure
Hedstrom Infrastructureguest2c9ba28e
 
Intorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft AzureIntorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft AzureKhalid Salama
 
Big Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop PlatformBig Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop PlatformIRJET Journal
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Scienceijtsrd
 
Accenture’s INTIENT Research Platform
Accenture’s INTIENT Research PlatformAccenture’s INTIENT Research Platform
Accenture’s INTIENT Research Platformaccenture
 
Managing The Data Deluge By Optimizing Storage
Managing The Data Deluge By Optimizing StorageManaging The Data Deluge By Optimizing Storage
Managing The Data Deluge By Optimizing StorageDell World
 
A Logical Architecture is Always a Flexible Architecture (ASEAN)
A Logical Architecture is Always a Flexible Architecture (ASEAN)A Logical Architecture is Always a Flexible Architecture (ASEAN)
A Logical Architecture is Always a Flexible Architecture (ASEAN)Denodo
 
Building Data Ecosystems for Accelerated Discovery
Building Data Ecosystems for Accelerated DiscoveryBuilding Data Ecosystems for Accelerated Discovery
Building Data Ecosystems for Accelerated Discoveryadamkraut
 
CRO Solutions by Dotmatics
CRO Solutions by DotmaticsCRO Solutions by Dotmatics
CRO Solutions by DotmaticsYuri de Lugt
 
IRJET- Open Source Solution for Centralized Storage System using Network ...
IRJET-  	  Open Source Solution for Centralized Storage System using Network ...IRJET-  	  Open Source Solution for Centralized Storage System using Network ...
IRJET- Open Source Solution for Centralized Storage System using Network ...IRJET Journal
 
Introduction to Modern Data Virtualization (US)
Introduction to Modern Data Virtualization (US)Introduction to Modern Data Virtualization (US)
Introduction to Modern Data Virtualization (US)Denodo
 

Similar to Taming Big Science Data Growth with Converged Infrastructure (20)

Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)
 
Uniting traditional GIS and mainstream IT
Uniting traditional GIS and mainstream ITUniting traditional GIS and mainstream IT
Uniting traditional GIS and mainstream IT
 
Whitepaper: Big Data - Infrastructure Considerations - Happiest Minds
Whitepaper: Big Data - Infrastructure Considerations - Happiest MindsWhitepaper: Big Data - Infrastructure Considerations - Happiest Minds
Whitepaper: Big Data - Infrastructure Considerations - Happiest Minds
 
IBM Aspera In Life Sciences
IBM Aspera In Life SciencesIBM Aspera In Life Sciences
IBM Aspera In Life Sciences
 
Research paper on big data and hadoop
Research paper on big data and hadoopResearch paper on big data and hadoop
Research paper on big data and hadoop
 
Denodo’s Data Catalog: Bridging the Gap between Data and Business
Denodo’s Data Catalog: Bridging the Gap between Data and BusinessDenodo’s Data Catalog: Bridging the Gap between Data and Business
Denodo’s Data Catalog: Bridging the Gap between Data and Business
 
AtomicDBCoreTech_White Papaer
AtomicDBCoreTech_White PapaerAtomicDBCoreTech_White Papaer
AtomicDBCoreTech_White Papaer
 
data-mesh_whitepaper_dec2021.pdf
data-mesh_whitepaper_dec2021.pdfdata-mesh_whitepaper_dec2021.pdf
data-mesh_whitepaper_dec2021.pdf
 
Connect July-Aug 2014
Connect July-Aug 2014Connect July-Aug 2014
Connect July-Aug 2014
 
Hedstrom Infrastructure
Hedstrom InfrastructureHedstrom Infrastructure
Hedstrom Infrastructure
 
Intorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft AzureIntorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft Azure
 
Big Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop PlatformBig Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop Platform
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Science
 
Accenture’s INTIENT Research Platform
Accenture’s INTIENT Research PlatformAccenture’s INTIENT Research Platform
Accenture’s INTIENT Research Platform
 
Managing The Data Deluge By Optimizing Storage
Managing The Data Deluge By Optimizing StorageManaging The Data Deluge By Optimizing Storage
Managing The Data Deluge By Optimizing Storage
 
A Logical Architecture is Always a Flexible Architecture (ASEAN)
A Logical Architecture is Always a Flexible Architecture (ASEAN)A Logical Architecture is Always a Flexible Architecture (ASEAN)
A Logical Architecture is Always a Flexible Architecture (ASEAN)
 
Building Data Ecosystems for Accelerated Discovery
Building Data Ecosystems for Accelerated DiscoveryBuilding Data Ecosystems for Accelerated Discovery
Building Data Ecosystems for Accelerated Discovery
 
CRO Solutions by Dotmatics
CRO Solutions by DotmaticsCRO Solutions by Dotmatics
CRO Solutions by Dotmatics
 
IRJET- Open Source Solution for Centralized Storage System using Network ...
IRJET-  	  Open Source Solution for Centralized Storage System using Network ...IRJET-  	  Open Source Solution for Centralized Storage System using Network ...
IRJET- Open Source Solution for Centralized Storage System using Network ...
 
Introduction to Modern Data Virtualization (US)
Introduction to Modern Data Virtualization (US)Introduction to Modern Data Virtualization (US)
Introduction to Modern Data Virtualization (US)
 

Recently uploaded

Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Recently uploaded (20)

Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

Taming Big Science Data Growth with Converged Infrastructure

  • 1. Taming Big Science Data Growth with Converged Infrastructure ©2014 BioTeam, Inc. All Rights Reserved. Real-world strategies and implementation details for building converged storage infrastructure to support the performance, scalability and collaborative requirements of today's NGS workflows. Aaron D. Gardner Senior Scientific Consultant, BioTeam, Inc. aaron@bioteam.net BIOTEAM Enabling Science
  • 2. 1. Introduction 2.The State of NGS Data Analysis 3. Converged Infrastructure 4. Solutions to Support NGS Data Analysis ©2014 BioTeam, Inc. All Rights Reserved. BIOTEAM Enabling Science
  • 3. ©2014 BioTeam, Inc. All Rights Reserved. | About Myself Who am I?  A computer engineer who spent the last 14 years with biologists in situ  Exposed to NGS in 2005  Have worked (for better or worse) with most NGS platforms and data types  Along the way learned bioinformatics, data management, HPC, storage, and general research cyberinfrastructure  Desire to help the broader life sciences community lead me to BioTeam BIOTEAM Enabling Science & 14Years Later… &
  • 4. ©2014 BioTeam, Inc. All Rights Reserved. | About BioTeam Who are we?  Independent consulting shop  Staffed by scientists forced to learn IT, SW & HPC to get our own research done  12+ years bridging the “gap” between science, IT & high performance computing BioTeam@Bio-IT World ’14  Did you just come from Chris’s talk? Make sure to check out his slides…  We have lots going on at the conference this year  Come visit us at booth #324 BIOTEAM Enabling Science
  • 5. ©2014 BioTeam, Inc. All Rights Reserved. | AboutThisTalk What are we going to talk about?  A quick look at NGS analysis trends  Challenges in performance, scalability, and collaboration  Strategies that address these challenges  The benefits of pairing converged infrastructure with NGS  Example topologies and implementations BIOTEAM Enabling Science Approach  Topics discussed the same way they would over coffee (or tea)  I talk about vendors and technologies I have experience with– that’s why DDN invited me to speak (thanks)  Feel free to reach out to me during the conference if any of this interests you
  • 6. ©2014 BioTeam, Inc. All Rights Reserved. | A Note About the Big Picture At BioTeam our mission is to enable science (see above) i.e. Great people, enabled by great technology, actively engaging in broader scientific communities Technology alone doesn’t cover this mandate… • Instruments never installed, unopened server boxes, idle accelerator racks Gathering minds without the right resources and tools… • They flee for the cloud, desk clusters, or other companies or institutions Locking away resources and data stifles collaboration… • Focus on services that empower instead of barriers that contain scientists BIOTEAM Enabling Science
  • 7. 1. Introduction 2.The State of NGS Data Analysis 3. Converged Infrastructure 4. Solutions to Support NGS Data Analysis ©2014 BioTeam, Inc. All Rights Reserved. BIOTEAM Enabling Science
  • 8. ©2014 BioTeam, Inc. All Rights Reserved. |NGS Analysis Challenges Performance  Compute is easy, just not necessarily efficient  Analysis pipelines longer and more complex  Usually serial steps still lurking in them  “New programs”– wrapper scripts with a twist  Don’t address performance of fundamental algorithms underneath Scalability  See few analysis algorithms scaled to 1-100K cores each year, same w/ accelerators  Still vast majority lucky to reach 10-100  Life sciences still mostly a HTC problem  Checkpointing becoming increasingly important  Varying and mixed IO patterns make HTC problematic on shared storage BIOTEAM Enabling Science
  • 9. ©2014 BioTeam, Inc. All Rights Reserved. |NGS Analysis Challenges Collaboration  Community movement to more efficient sequence data structures—very encouraging (e.g. SAM/BAM/CRAM/VCF/HDF5)  Sharing of datasets still incredibly problematic  Large sequencing centers, institutes, commercial interests embracing Science DMZ, data transfer node concept (w/ Globus, Aspera, etc.)  Without data lifecycle management, this newfound scientific data mobility will only amplify storage issues (enter iRODS, etc.)  Last mile problem for collaborators with poor network connectivity  Need real Big Data collobration solutions  Find a way to bring computation to the data so the last mile disappears BIOTEAM Enabling Science
  • 10. ©2014 BioTeam, Inc. All Rights Reserved. | Observation: NGS Analysis Inversion IBIOTEAM Enabling Science Infrastructure Spending (in theory): NGS Analysis Facility Infrastructure Spending (in practice): Infrastructure Itself (the 80%) SW and HW Integrations (the 20%) Infrastructure Itself (the 80%) SW and HW Integrations (the 20%) Minimizing integration overhead is one of the principal challenges right now when designing scientific computing environments.  This holds for NGS, as well as other scientific domains  Analysis environments from pieces which have never previously been tried together  Synthesized based on what’s best for business instead of technical merits, efficiency is wasted, and for small and midsize infrastructures integration overhead balloons
  • 11. ©2014 BioTeam, Inc. All Rights Reserved. | Observation: NGS Analysis Inversion IIBIOTEAM Enabling Science The 20% The 80% The 80% The 20% What the industry is shooting for (in theory): Where we seem to be (in practice):
  • 12. 1. Introduction 2.The State of NGS Data Analysis 3. Converged Infrastructure 4. Solutions to Support NGS Data Analysis ©2014 BioTeam, Inc. All Rights Reserved. BIOTEAM Enabling Science
  • 13. ©2014 BioTeam, Inc. All Rights Reserved. | Traditional Infrastructure Traditional computational infrastructures are comprised of separate hardware (storage, networking, computation) and software (provisioning, monitoring, management, etc.) components  Pieced into one-off solutions  Integrated and tuned on-site (can take months for large systems)  As these infrastructures scale, they become snow flakes BIOTEAM Enabling Science
  • 14. ©2014 BioTeam, Inc. All Rights Reserved. | Converged Infrastructure With converged infrastructure, multiple hardware and software components are developed, selected, integrated, and tuned together, producing a pre-optimized solution  Infrastructure building block approach  Some vendors (e.g. DDN) offer mature converged storage products like the SFA embedded platform  Facebook’s Open Compute Project lends itself to building converged infrastructures w/ OCP compliant components  Analysis appliances (e.g. SlipStream) also use the converged infrastructure model Converged infrastructure shifts the focus from integrating hardware components to building software services, which is where organizations can better distinguish and define themselves BIOTEAM Enabling Science
  • 15. ©2014 BioTeam, Inc. All Rights Reserved. | Converged Infrastructure: Example SolutionBIOTEAM Enabling Science Remote iRODS server Traditional iRODS iRODS clients iRODS data and control access NAS file access RAID controllerRAID controllerRAID controller SAN Switch iRODS/iCAT server Block storage access Cluster Network Switch File server File serverFile server Disk array Disk array Disk array
  • 16. ©2014 BioTeam, Inc. All Rights Reserved. | Converged Infrastructure: Example SolutionBIOTEAM Enabling Science iRODS clients iRODS data and control access High performance NAS file access iCAT & iRODS servers Network Switch Cluster Integrated Appliance Reduces Complexity and Integration Time Remote iRODS server Converged iRODS
  • 17. ©2014 BioTeam, Inc. All Rights Reserved. | Is Converged Infrastructure inYour Critical Path?BIOTEAM Enabling Science Converged infrastructure not as necessary when: 1. Hiring lots of smart people and committing their time to infrastructure 2. Attacking a single or small set of large problems 3. Rarely revalidating or reintegrating your HW stack after deployment • This is because if you tie your platform closely to a mixed and disparate hardware stack: staff time to explore reintegration and revalidation issues, rewrite code for new architectures—this can work for hyper giants and single service efforts but legacy and vendor-controlled codes, flexible infrastructure, infrastructure for yet unknown or unsolved problems—converged infrastructure buys these down…
  • 18. ©2014 BioTeam, Inc. All Rights Reserved. | Tiered Service Models and Changing Staff RolesBIOTEAM Enabling Science Challenges With This Model: • Need for single instance resources capable of dealing with big data • Now need multitenancy capabilities even as a single organization • Must minimize latency to better utilize limited resources—public cloud’s massive scalability approach might not be suitable for a small or midsize research environment with legacy codes, inexperienced users, etc.  DevOps and the cloud have changed the relationship between the researcher and the IT practitioner permanently  Research computing staff should be developing best practices, not acting as a human ‘sudo’ for informaticists Users instantiate resources on demand which they have privileged access to–but no support is offered beyond clearing hang- ups Services requiring a higher degree of reliability and/or security are built and managed by IT staff, with unprivileged access provided to users Core computational services are still supported end- to-end by IT staff, and are consumed by resources in the previous two levels Solution: Move to a tired service and support model
  • 19. 1. Introduction 2.The State of NGS Data Analysis 3. Converged Infrastructure 4. Solutions to Support NGS Data Analysis ©2014 BioTeam, Inc. All Rights Reserved. BIOTEAM Enabling Science
  • 20. GPFS is a fast parallel file system written by IBM  Distributed metadata and locking  Good performance with small files  Tunable for large numbers of small files  Native Linux and Windows clients  CIFS and NFSv3 (v4 works, unsupported)  Raw NGS data is big  NGS analysis datasets are getting bigger  They can require lots of IOPS during analysis  Lots of space required to store what comes after Can’t satisfy all of these considerations with a single storage tier without tremendous cost Solution: Hierarchical Storage Management (HSM)  Create different pools of storage, policies govern data movement  SSD for metadata, small files,VMs, etc. and SATA for capacity and sequential access  Can also use tape, object storage, and others as cold archive or warm near-line tiers NOTE: Lustre now has some HSM capabilities too as of version 2.5 ©2014 BioTeam, Inc. All Rights Reserved. | Tiered Data Storage (e.g. GPFS w/ HSM)BIOTEAM Enabling Science Example GPFS based GRIDScaler System from DDN:
  • 21. ©2014 BioTeam, Inc. All Rights Reserved. | Science DMZ (e.g. ESnet Model) Core Drivers • Enterprise networking architecture is optimized for many small data flows (Web 2.0, mobile, Internet ofThings) • Not optimized for fewer large data flows • Deep packet inspection & stateful firewalls can’t handle large flows, performance tanks 3 Components of a Science DMZ 1. Fast network paths with streamlined security specific to large scientific data flows 2. DataTransfer Node(s) specifically tuned and dedicated to moving large data flows 3. Network monitoring and measurement node(s) • Government & academic sites have done similar things for years without the name • BioTeam strongly believes in the Science DMZ concept • At this point anybody moving large scientific data should be evaluating • We are already helping deploy them ESnet has a great web resource available: http://fasterdata.es.net/ BIOTEAM Enabling Science
  • 22. ©2014 BioTeam, Inc. All Rights Reserved. | Implementation: Science DMZBIOTEAM Enabling Science Design Source: “The Science DMZ: Introduction & Architecture” – ESnet
  • 23. ©2014 BioTeam, Inc. All Rights Reserved. | Information Lifecycle Management (ILM) (e.g. iRODS)BIOTEAM Enabling Science iRODS, the Integrated Rule-Oriented Data System, is a project for building the next generation data management cyberinfrastructure. One of the main ideas behind iRODS is to provide a system that enables a flexible, adaptive, customizable data management architecture. Suitable for preserving data over its lifecycle. At the iRODS core, a Rule Engine interprets the Rules to decide how the system is to respond to various requests and conditions. Interfaces: GUI, Web, WebDAV, CLI Operations:  Search, Access and View,  Add/Extract Metadata, Annotate,  Analyze & Process,  Manage, Replicate, Copy, Share, Repurpose,  Track access, Subscribe & more… iRODS Server software and Rule Engine run on each data server. The iRODS iCAT Metadata Catalog uses a database to track metadata describing data and everything that happens to it
  • 24. ©2014 BioTeam, Inc. All Rights Reserved. | Implementation: Science DMZ + ILMBIOTEAM Enabling Science
  • 25. ©2014 BioTeam, Inc. All Rights Reserved. | NGS Data Analysis (on a Hybrid HPC Cloud) General Concept • On-site local resources are a “cache” that exists… • To be used constantly • For best data locality • For specialized resources • For security • Elastic resources from public or parent organization’s private cloud • The middleware offers cloud-style IaaS and/or PaaS • Multi-tenant– users/virtual communities can spin up their own resources, clusters, etc. • These on-demand systems accommodate unique software configurations and services (suited to varying NGS workflows, etc.) Sounds great, but… • It will be a while before you can pull a solution like this off the shelf • Would be a good candidate for a converged infrastructure offering Goal: HPC-like performance and latency, cloud-like elasticity and provisioning BIOTEAM Enabling Science
  • 26. ©2014 BioTeam, Inc. All Rights Reserved. | Implementation: Science DMZ + ILM + HPC CloudBIOTEAM Enabling Science
  • 27. ©2014 BioTeam, Inc. All Rights Reserved. | PartingThoughts & Lessons Learned 1. Confirmation Bias • Just because it wasn’t viable before, doesn’t mean it won’t ever be 2. Depth Perception • Bleeding Edge? Leading Edge? State of the Art? Legacy? Ready to Sunset? 3. Outliers • The existence of edge or corner cases does not necessarily invalidate a solution, but it does mean you better understand the scope the solution covers 4. The Power of && • Multipart solutions seen as complex, abandoned in search of a silver bullet • Combining ideas is more collaborative and doesn’t force an ultimatum 5. GameTheory • Bringing a chess set to a checkers tournament… 6. Relationship overTechnology • Work with vendors and collaborators that are interested in making a long term investment in what you do BIOTEAM Enabling Science
  • 28. ThankYou Questions and Discussion Welcome ©2014 BioTeam, Inc. All Rights Reserved. Aaron D. Gardner Senior Scientific Consultant, BioTeam, Inc. aaron@bioteam.net BIOTEAM Enabling Science