Taming Big Science Data Growth with Converged Infrastructure

Taming Big Science Data Growth with
Converged Infrastructure
©2014 BioTeam, Inc. All Rights Reserved.
Real-world strategies and implementation details for building converged storage
infrastructure to support the performance, scalability and collaborative
requirements of today's NGS workflows.
Aaron D. Gardner
Senior Scientific Consultant, BioTeam, Inc.
aaron@bioteam.net
BIOTEAM
Enabling Science

1. Introduction
2.The State of NGS Data Analysis
3. Converged Infrastructure
4. Solutions to Support NGS Data Analysis
BIOTEAM
Enabling Science

| About Myself
Who am I?
 A computer engineer who spent the
last 14 years with biologists in situ
 Exposed to NGS in 2005
 Have worked (for better or worse)
with most NGS platforms and data
types
 Along the way learned bioinformatics,
data management, HPC, storage, and
general research cyberinfrastructure
 Desire to help the broader life sciences
community lead me to BioTeam
BIOTEAM
Enabling Science
&
14Years
Later…
&

| About BioTeam
Who are we?
 Independent consulting shop
 Staffed by scientists forced to learn IT,
SW & HPC to get our own research done
 12+ years bridging the “gap”
between science, IT & high
performance computing
BioTeam@Bio-IT World ’14
 Did you just come from Chris’s talk?
Make sure to check out his slides…
 We have lots going on at the
conference this year
 Come visit us at booth #324
BIOTEAM
Enabling Science

| AboutThisTalk
What are we going to talk about?
 A quick look at NGS analysis trends
 Challenges in performance, scalability, and collaboration
 Strategies that address these challenges
 The benefits of pairing converged infrastructure with NGS
 Example topologies and implementations
BIOTEAM
Enabling Science
Approach
 Topics discussed the same way they would
over coffee (or tea)
 I talk about vendors and technologies
I have experience with– that’s why DDN
invited me to speak (thanks)
 Feel free to reach out to me during the
conference if any of this interests you

| A Note About the Big Picture
At BioTeam our mission is to enable science (see above)
i.e. Great people, enabled by great technology, actively
engaging in broader scientific communities
Technology alone doesn’t cover this mandate…
• Instruments never installed, unopened server boxes, idle accelerator racks
Gathering minds without the right resources and tools…
• They flee for the cloud, desk clusters, or other companies or institutions
Locking away resources and data stifles collaboration…
• Focus on services that empower instead of barriers that contain scientists
BIOTEAM
Enabling Science

|NGS Analysis Challenges
Performance
 Compute is easy, just not necessarily efficient
 Analysis pipelines longer and more complex
 Usually serial steps still lurking in them
 “New programs”– wrapper scripts with a twist
 Don’t address performance of fundamental
algorithms underneath
Scalability
 See few analysis algorithms scaled to 1-100K
cores each year, same w/ accelerators
 Still vast majority lucky to reach 10-100
 Life sciences still mostly a HTC problem
 Checkpointing becoming increasingly
important
 Varying and mixed IO patterns make HTC
problematic on shared storage
BIOTEAM
Enabling Science

|NGS Analysis Challenges
Collaboration
 Community movement to more efficient
sequence data structures—very encouraging
(e.g. SAM/BAM/CRAM/VCF/HDF5)
 Sharing of datasets still incredibly problematic
 Large sequencing centers, institutes, commercial
interests embracing Science DMZ, data transfer
node concept (w/ Globus, Aspera, etc.)
 Without data lifecycle management, this
newfound scientific data mobility will only
amplify storage issues (enter iRODS, etc.)
 Last mile problem for collaborators with poor
network connectivity
 Need real Big Data collobration solutions
 Find a way to bring computation to the data so
the last mile disappears
BIOTEAM
Enabling Science

| Observation: NGS Analysis Inversion IBIOTEAM
Enabling Science
Infrastructure Spending (in theory):
NGS Analysis Facility Infrastructure Spending (in practice):
Infrastructure Itself (the 80%)
SW and HW
Integrations
(the 20%)
Infrastructure
Itself
(the 80%)
SW and HW Integrations (the 20%)
Minimizing integration overhead is one of the principal challenges right now when
designing scientific computing environments.
 This holds for NGS, as well as other scientific domains
 Analysis environments from pieces which have never previously been tried together
 Synthesized based on what’s best for business instead of technical merits, efficiency is
wasted, and for small and midsize infrastructures integration overhead balloons

| Observation: NGS Analysis Inversion IIBIOTEAM
Enabling Science
The 20% The 80%
The 80% The 20%
What the industry is shooting for (in theory):
Where we seem to be (in practice):

| Traditional Infrastructure
Traditional computational infrastructures are comprised of separate
hardware (storage, networking, computation) and software
(provisioning, monitoring, management, etc.) components
 Pieced into one-off solutions
 Integrated and tuned on-site
(can take months for large systems)
 As these infrastructures scale,
they become snow flakes
BIOTEAM
Enabling Science

| Converged Infrastructure
With converged infrastructure, multiple hardware and software
components are developed, selected, integrated, and tuned
together, producing a pre-optimized solution
 Infrastructure building block approach
 Some vendors (e.g. DDN) offer mature
converged storage products like the SFA embedded platform
 Facebook’s Open Compute Project lends itself to building
converged infrastructures w/ OCP compliant components
 Analysis appliances (e.g. SlipStream) also use the
converged infrastructure model
Converged infrastructure shifts the focus from integrating
hardware components to building software services, which is
where organizations can better distinguish and define themselves
BIOTEAM
Enabling Science

| Converged Infrastructure: Example SolutionBIOTEAM
Enabling Science
Remote iRODS
server
Traditional
iRODS
iRODS
clients
iRODS data and
control access
NAS file access
RAID controllerRAID controllerRAID controller
SAN Switch
iRODS/iCAT
server
Block storage
access
Cluster
Network Switch
File server File serverFile server
Disk
array
Disk
array
Disk
array

| Converged Infrastructure: Example SolutionBIOTEAM
Enabling Science
iRODS
clients
iRODS data and
control access
High performance
NAS file access
iCAT & iRODS
servers
Network Switch
Cluster
Integrated Appliance Reduces Complexity
and Integration Time
Remote iRODS
server
Converged
iRODS

| Is Converged Infrastructure inYour Critical Path?BIOTEAM
Enabling Science
Converged infrastructure not as necessary when:
1. Hiring lots of smart people and committing their time to infrastructure
2. Attacking a single or small set of large problems
3. Rarely revalidating or reintegrating your HW stack after deployment
• This is because if you tie your platform closely to a mixed and disparate
hardware stack: staff time to explore reintegration and revalidation
issues, rewrite code for new architectures—this can work for hyper
giants and single service efforts but legacy and vendor-controlled
codes, flexible infrastructure, infrastructure for yet unknown or
unsolved problems—converged infrastructure buys these down…

| Tiered Service Models and Changing Staff RolesBIOTEAM
Enabling Science
Challenges With This Model:
• Need for single instance resources
capable of dealing with big data
• Now need multitenancy capabilities
even as a single organization
• Must minimize latency to better utilize
limited resources—public cloud’s
massive scalability approach might not
be suitable for a small or midsize
research environment with legacy
codes, inexperienced users, etc.
 DevOps and the cloud have changed the relationship between the
researcher and the IT practitioner permanently
 Research computing staff should be developing best practices, not acting
as a human ‘sudo’ for informaticists
Users
instantiate
resources on
demand which
they have
privileged
access to–but
no support is
offered beyond
clearing hang-
ups
Services
requiring a
higher degree
of reliability
and/or security
are built and
managed by IT
staff, with
unprivileged
access
provided to
users
Core
computational
services are still
supported end-
to-end by IT
staff, and are
consumed by
resources in the
previous two
levels
Solution: Move to a tired service and support model

GPFS is a fast parallel file system written by IBM
 Distributed metadata and locking
 Good performance with small files
 Tunable for large numbers of small files
 Native Linux and Windows clients
 CIFS and NFSv3 (v4 works, unsupported)
 Raw NGS data is big
 NGS analysis datasets are getting bigger
 They can require lots of IOPS during analysis
 Lots of space required to store what comes after
Can’t satisfy all of these considerations with a single storage tier without tremendous cost
Solution: Hierarchical Storage Management (HSM)
 Create different pools of storage, policies govern data movement
 SSD for metadata, small files,VMs, etc. and SATA for capacity and sequential access
 Can also use tape, object storage, and others as cold archive or warm near-line tiers
NOTE: Lustre now has some HSM capabilities too as of version 2.5
| Tiered Data Storage (e.g. GPFS w/ HSM)BIOTEAM
Enabling Science
Example GPFS based
GRIDScaler System from DDN:

| Science DMZ (e.g. ESnet Model)
Core Drivers
• Enterprise networking architecture is optimized for many
small data flows (Web 2.0, mobile, Internet ofThings)
• Not optimized for fewer large data flows
• Deep packet inspection & stateful firewalls
can’t handle large flows, performance tanks
3 Components of a Science DMZ
1. Fast network paths with streamlined security specific to large scientific data flows
2. DataTransfer Node(s) specifically tuned and dedicated to moving large data flows
3. Network monitoring and measurement node(s)
• Government & academic sites have done similar things for years without the name
• BioTeam strongly believes in the Science DMZ concept
• At this point anybody moving large scientific data should be evaluating
• We are already helping deploy them
ESnet has a great web resource available: http://fasterdata.es.net/
BIOTEAM
Enabling Science

| Implementation: Science DMZBIOTEAM
Enabling Science
Design Source: “The Science DMZ: Introduction & Architecture” – ESnet

| Information Lifecycle Management (ILM) (e.g. iRODS)BIOTEAM
Enabling Science
iRODS, the Integrated Rule-Oriented Data System, is a project for building the next
generation data management cyberinfrastructure. One of the main ideas behind
iRODS is to provide a system that enables a flexible, adaptive, customizable data
management architecture. Suitable for preserving data over its lifecycle.
At the iRODS core, a Rule Engine interprets the Rules to decide how the system is to
respond to various requests and conditions.
Interfaces: GUI, Web, WebDAV, CLI
Operations:
 Search, Access and View,
 Add/Extract Metadata, Annotate,
 Analyze & Process,
 Manage, Replicate, Copy, Share,
Repurpose,
 Track access, Subscribe & more…
iRODS Server software and Rule Engine run
on each data server. The iRODS iCAT
Metadata Catalog uses a database to track
metadata describing data and everything that
happens to it

| Implementation: Science DMZ + ILMBIOTEAM
Enabling Science

| NGS Data Analysis (on a Hybrid HPC Cloud)
General Concept
• On-site local resources are a “cache” that exists…
• To be used constantly
• For best data locality
• For specialized resources
• For security
• Elastic resources from public or parent organization’s private cloud
• The middleware offers cloud-style IaaS and/or PaaS
• Multi-tenant– users/virtual communities can spin up their own resources, clusters, etc.
• These on-demand systems accommodate unique software configurations and services
(suited to varying NGS workflows, etc.)
Sounds great, but…
• It will be a while before you can pull a solution like this off the shelf
• Would be a good candidate for a converged infrastructure offering
Goal: HPC-like performance and latency, cloud-like elasticity and provisioning
BIOTEAM
Enabling Science

| Implementation: Science DMZ + ILM + HPC CloudBIOTEAM
Enabling Science

| PartingThoughts & Lessons Learned
1. Confirmation Bias
• Just because it wasn’t viable before, doesn’t mean it won’t ever be
2. Depth Perception
• Bleeding Edge? Leading Edge? State of the Art? Legacy? Ready to Sunset?
3. Outliers
• The existence of edge or corner cases does not necessarily invalidate a solution,
but it does mean you better understand the scope the solution covers
4. The Power of &&
• Multipart solutions seen as complex, abandoned in search of a silver bullet
• Combining ideas is more collaborative and doesn’t force an ultimatum
5. GameTheory
• Bringing a chess set to a checkers tournament…
6. Relationship overTechnology
• Work with vendors and collaborators that are interested in making a long term
investment in what you do
BIOTEAM
Enabling Science

ThankYou
Questions and Discussion Welcome
Aaron D. Gardner
Senior Scientific Consultant, BioTeam, Inc.
aaron@bioteam.net
BIOTEAM
Enabling Science

Taming Big Science Data Growth with Converged Infrastructure

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Taming Big Science Data Growth with Converged Infrastructure

Similar to Taming Big Science Data Growth with Converged Infrastructure (20)

Recently uploaded

Recently uploaded (20)

Taming Big Science Data Growth with Converged Infrastructure