Taming Big Science Data Growth with
Converged Infrastructure
©2014 BioTeam, Inc. All Rights Reserved.
Real-world strategie...
1. Introduction
2.The State of NGS Data Analysis
3. Converged Infrastructure
4. Solutions to Support NGS Data Analysis
©20...
©2014 BioTeam, Inc. All Rights Reserved.
| About Myself
Who am I?
 A computer engineer who spent the
last 14 years with b...
©2014 BioTeam, Inc. All Rights Reserved.
| About BioTeam
Who are we?
 Independent consulting shop
 Staffed by scientists...
©2014 BioTeam, Inc. All Rights Reserved.
| AboutThisTalk
What are we going to talk about?
 A quick look at NGS analysis t...
©2014 BioTeam, Inc. All Rights Reserved.
| A Note About the Big Picture
At BioTeam our mission is to enable science (see a...
1. Introduction
2.The State of NGS Data Analysis
3. Converged Infrastructure
4. Solutions to Support NGS Data Analysis
©20...
©2014 BioTeam, Inc. All Rights Reserved.
|NGS Analysis Challenges
Performance
 Compute is easy, just not necessarily effi...
©2014 BioTeam, Inc. All Rights Reserved.
|NGS Analysis Challenges
Collaboration
 Community movement to more efficient
seq...
©2014 BioTeam, Inc. All Rights Reserved.
| Observation: NGS Analysis Inversion IBIOTEAM
Enabling Science
Infrastructure Sp...
©2014 BioTeam, Inc. All Rights Reserved.
| Observation: NGS Analysis Inversion IIBIOTEAM
Enabling Science
The 20% The 80%
...
1. Introduction
2.The State of NGS Data Analysis
3. Converged Infrastructure
4. Solutions to Support NGS Data Analysis
©20...
©2014 BioTeam, Inc. All Rights Reserved.
| Traditional Infrastructure
Traditional computational infrastructures are compri...
©2014 BioTeam, Inc. All Rights Reserved.
| Converged Infrastructure
With converged infrastructure, multiple hardware and s...
©2014 BioTeam, Inc. All Rights Reserved.
| Converged Infrastructure: Example SolutionBIOTEAM
Enabling Science
Remote iRODS...
©2014 BioTeam, Inc. All Rights Reserved.
| Converged Infrastructure: Example SolutionBIOTEAM
Enabling Science
iRODS
client...
©2014 BioTeam, Inc. All Rights Reserved.
| Is Converged Infrastructure inYour Critical Path?BIOTEAM
Enabling Science
Conve...
©2014 BioTeam, Inc. All Rights Reserved.
| Tiered Service Models and Changing Staff RolesBIOTEAM
Enabling Science
Challeng...
1. Introduction
2.The State of NGS Data Analysis
3. Converged Infrastructure
4. Solutions to Support NGS Data Analysis
©20...
GPFS is a fast parallel file system written by IBM
 Distributed metadata and locking
 Good performance with small files
...
©2014 BioTeam, Inc. All Rights Reserved.
| Science DMZ (e.g. ESnet Model)
Core Drivers
• Enterprise networking architectur...
©2014 BioTeam, Inc. All Rights Reserved.
| Implementation: Science DMZBIOTEAM
Enabling Science
Design Source: “The Science...
©2014 BioTeam, Inc. All Rights Reserved.
| Information Lifecycle Management (ILM) (e.g. iRODS)BIOTEAM
Enabling Science
iRO...
©2014 BioTeam, Inc. All Rights Reserved.
| Implementation: Science DMZ + ILMBIOTEAM
Enabling Science
©2014 BioTeam, Inc. All Rights Reserved.
| NGS Data Analysis (on a Hybrid HPC Cloud)
General Concept
• On-site local resou...
©2014 BioTeam, Inc. All Rights Reserved.
| Implementation: Science DMZ + ILM + HPC CloudBIOTEAM
Enabling Science
©2014 BioTeam, Inc. All Rights Reserved.
| PartingThoughts & Lessons Learned
1. Confirmation Bias
• Just because it wasn’t...
ThankYou
Questions and Discussion Welcome
©2014 BioTeam, Inc. All Rights Reserved.
Aaron D. Gardner
Senior Scientific Cons...
Upcoming SlideShare
Loading in …5
×

Taming Big Science Data Growth with Converged Infrastructure

895 views

Published on

2014 BioIT World Expo presentation
"Many of the largest NGS sites have identified IO bottlenecks as their number one concern in growing their infrastructure to support current and projected data growth rates. In this talk Aaron D. Gardner, Senior Scientific Consultant, BioTeam, Inc. will share real-world strategies and implementation details for building converged storage infrastructure to support the performance, scalability and collaborative requirements of today's NGS workflows. "

For a copy of this presentation please email: chris@bioteam.net

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
895
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Taming Big Science Data Growth with Converged Infrastructure

  1. 1. Taming Big Science Data Growth with Converged Infrastructure ©2014 BioTeam, Inc. All Rights Reserved. Real-world strategies and implementation details for building converged storage infrastructure to support the performance, scalability and collaborative requirements of today's NGS workflows. Aaron D. Gardner Senior Scientific Consultant, BioTeam, Inc. aaron@bioteam.net BIOTEAM Enabling Science
  2. 2. 1. Introduction 2.The State of NGS Data Analysis 3. Converged Infrastructure 4. Solutions to Support NGS Data Analysis ©2014 BioTeam, Inc. All Rights Reserved. BIOTEAM Enabling Science
  3. 3. ©2014 BioTeam, Inc. All Rights Reserved. | About Myself Who am I?  A computer engineer who spent the last 14 years with biologists in situ  Exposed to NGS in 2005  Have worked (for better or worse) with most NGS platforms and data types  Along the way learned bioinformatics, data management, HPC, storage, and general research cyberinfrastructure  Desire to help the broader life sciences community lead me to BioTeam BIOTEAM Enabling Science & 14Years Later… &
  4. 4. ©2014 BioTeam, Inc. All Rights Reserved. | About BioTeam Who are we?  Independent consulting shop  Staffed by scientists forced to learn IT, SW & HPC to get our own research done  12+ years bridging the “gap” between science, IT & high performance computing BioTeam@Bio-IT World ’14  Did you just come from Chris’s talk? Make sure to check out his slides…  We have lots going on at the conference this year  Come visit us at booth #324 BIOTEAM Enabling Science
  5. 5. ©2014 BioTeam, Inc. All Rights Reserved. | AboutThisTalk What are we going to talk about?  A quick look at NGS analysis trends  Challenges in performance, scalability, and collaboration  Strategies that address these challenges  The benefits of pairing converged infrastructure with NGS  Example topologies and implementations BIOTEAM Enabling Science Approach  Topics discussed the same way they would over coffee (or tea)  I talk about vendors and technologies I have experience with– that’s why DDN invited me to speak (thanks)  Feel free to reach out to me during the conference if any of this interests you
  6. 6. ©2014 BioTeam, Inc. All Rights Reserved. | A Note About the Big Picture At BioTeam our mission is to enable science (see above) i.e. Great people, enabled by great technology, actively engaging in broader scientific communities Technology alone doesn’t cover this mandate… • Instruments never installed, unopened server boxes, idle accelerator racks Gathering minds without the right resources and tools… • They flee for the cloud, desk clusters, or other companies or institutions Locking away resources and data stifles collaboration… • Focus on services that empower instead of barriers that contain scientists BIOTEAM Enabling Science
  7. 7. 1. Introduction 2.The State of NGS Data Analysis 3. Converged Infrastructure 4. Solutions to Support NGS Data Analysis ©2014 BioTeam, Inc. All Rights Reserved. BIOTEAM Enabling Science
  8. 8. ©2014 BioTeam, Inc. All Rights Reserved. |NGS Analysis Challenges Performance  Compute is easy, just not necessarily efficient  Analysis pipelines longer and more complex  Usually serial steps still lurking in them  “New programs”– wrapper scripts with a twist  Don’t address performance of fundamental algorithms underneath Scalability  See few analysis algorithms scaled to 1-100K cores each year, same w/ accelerators  Still vast majority lucky to reach 10-100  Life sciences still mostly a HTC problem  Checkpointing becoming increasingly important  Varying and mixed IO patterns make HTC problematic on shared storage BIOTEAM Enabling Science
  9. 9. ©2014 BioTeam, Inc. All Rights Reserved. |NGS Analysis Challenges Collaboration  Community movement to more efficient sequence data structures—very encouraging (e.g. SAM/BAM/CRAM/VCF/HDF5)  Sharing of datasets still incredibly problematic  Large sequencing centers, institutes, commercial interests embracing Science DMZ, data transfer node concept (w/ Globus, Aspera, etc.)  Without data lifecycle management, this newfound scientific data mobility will only amplify storage issues (enter iRODS, etc.)  Last mile problem for collaborators with poor network connectivity  Need real Big Data collobration solutions  Find a way to bring computation to the data so the last mile disappears BIOTEAM Enabling Science
  10. 10. ©2014 BioTeam, Inc. All Rights Reserved. | Observation: NGS Analysis Inversion IBIOTEAM Enabling Science Infrastructure Spending (in theory): NGS Analysis Facility Infrastructure Spending (in practice): Infrastructure Itself (the 80%) SW and HW Integrations (the 20%) Infrastructure Itself (the 80%) SW and HW Integrations (the 20%) Minimizing integration overhead is one of the principal challenges right now when designing scientific computing environments.  This holds for NGS, as well as other scientific domains  Analysis environments from pieces which have never previously been tried together  Synthesized based on what’s best for business instead of technical merits, efficiency is wasted, and for small and midsize infrastructures integration overhead balloons
  11. 11. ©2014 BioTeam, Inc. All Rights Reserved. | Observation: NGS Analysis Inversion IIBIOTEAM Enabling Science The 20% The 80% The 80% The 20% What the industry is shooting for (in theory): Where we seem to be (in practice):
  12. 12. 1. Introduction 2.The State of NGS Data Analysis 3. Converged Infrastructure 4. Solutions to Support NGS Data Analysis ©2014 BioTeam, Inc. All Rights Reserved. BIOTEAM Enabling Science
  13. 13. ©2014 BioTeam, Inc. All Rights Reserved. | Traditional Infrastructure Traditional computational infrastructures are comprised of separate hardware (storage, networking, computation) and software (provisioning, monitoring, management, etc.) components  Pieced into one-off solutions  Integrated and tuned on-site (can take months for large systems)  As these infrastructures scale, they become snow flakes BIOTEAM Enabling Science
  14. 14. ©2014 BioTeam, Inc. All Rights Reserved. | Converged Infrastructure With converged infrastructure, multiple hardware and software components are developed, selected, integrated, and tuned together, producing a pre-optimized solution  Infrastructure building block approach  Some vendors (e.g. DDN) offer mature converged storage products like the SFA embedded platform  Facebook’s Open Compute Project lends itself to building converged infrastructures w/ OCP compliant components  Analysis appliances (e.g. SlipStream) also use the converged infrastructure model Converged infrastructure shifts the focus from integrating hardware components to building software services, which is where organizations can better distinguish and define themselves BIOTEAM Enabling Science
  15. 15. ©2014 BioTeam, Inc. All Rights Reserved. | Converged Infrastructure: Example SolutionBIOTEAM Enabling Science Remote iRODS server Traditional iRODS iRODS clients iRODS data and control access NAS file access RAID controllerRAID controllerRAID controller SAN Switch iRODS/iCAT server Block storage access Cluster Network Switch File server File serverFile server Disk array Disk array Disk array
  16. 16. ©2014 BioTeam, Inc. All Rights Reserved. | Converged Infrastructure: Example SolutionBIOTEAM Enabling Science iRODS clients iRODS data and control access High performance NAS file access iCAT & iRODS servers Network Switch Cluster Integrated Appliance Reduces Complexity and Integration Time Remote iRODS server Converged iRODS
  17. 17. ©2014 BioTeam, Inc. All Rights Reserved. | Is Converged Infrastructure inYour Critical Path?BIOTEAM Enabling Science Converged infrastructure not as necessary when: 1. Hiring lots of smart people and committing their time to infrastructure 2. Attacking a single or small set of large problems 3. Rarely revalidating or reintegrating your HW stack after deployment • This is because if you tie your platform closely to a mixed and disparate hardware stack: staff time to explore reintegration and revalidation issues, rewrite code for new architectures—this can work for hyper giants and single service efforts but legacy and vendor-controlled codes, flexible infrastructure, infrastructure for yet unknown or unsolved problems—converged infrastructure buys these down…
  18. 18. ©2014 BioTeam, Inc. All Rights Reserved. | Tiered Service Models and Changing Staff RolesBIOTEAM Enabling Science Challenges With This Model: • Need for single instance resources capable of dealing with big data • Now need multitenancy capabilities even as a single organization • Must minimize latency to better utilize limited resources—public cloud’s massive scalability approach might not be suitable for a small or midsize research environment with legacy codes, inexperienced users, etc.  DevOps and the cloud have changed the relationship between the researcher and the IT practitioner permanently  Research computing staff should be developing best practices, not acting as a human ‘sudo’ for informaticists Users instantiate resources on demand which they have privileged access to–but no support is offered beyond clearing hang- ups Services requiring a higher degree of reliability and/or security are built and managed by IT staff, with unprivileged access provided to users Core computational services are still supported end- to-end by IT staff, and are consumed by resources in the previous two levels Solution: Move to a tired service and support model
  19. 19. 1. Introduction 2.The State of NGS Data Analysis 3. Converged Infrastructure 4. Solutions to Support NGS Data Analysis ©2014 BioTeam, Inc. All Rights Reserved. BIOTEAM Enabling Science
  20. 20. GPFS is a fast parallel file system written by IBM  Distributed metadata and locking  Good performance with small files  Tunable for large numbers of small files  Native Linux and Windows clients  CIFS and NFSv3 (v4 works, unsupported)  Raw NGS data is big  NGS analysis datasets are getting bigger  They can require lots of IOPS during analysis  Lots of space required to store what comes after Can’t satisfy all of these considerations with a single storage tier without tremendous cost Solution: Hierarchical Storage Management (HSM)  Create different pools of storage, policies govern data movement  SSD for metadata, small files,VMs, etc. and SATA for capacity and sequential access  Can also use tape, object storage, and others as cold archive or warm near-line tiers NOTE: Lustre now has some HSM capabilities too as of version 2.5 ©2014 BioTeam, Inc. All Rights Reserved. | Tiered Data Storage (e.g. GPFS w/ HSM)BIOTEAM Enabling Science Example GPFS based GRIDScaler System from DDN:
  21. 21. ©2014 BioTeam, Inc. All Rights Reserved. | Science DMZ (e.g. ESnet Model) Core Drivers • Enterprise networking architecture is optimized for many small data flows (Web 2.0, mobile, Internet ofThings) • Not optimized for fewer large data flows • Deep packet inspection & stateful firewalls can’t handle large flows, performance tanks 3 Components of a Science DMZ 1. Fast network paths with streamlined security specific to large scientific data flows 2. DataTransfer Node(s) specifically tuned and dedicated to moving large data flows 3. Network monitoring and measurement node(s) • Government & academic sites have done similar things for years without the name • BioTeam strongly believes in the Science DMZ concept • At this point anybody moving large scientific data should be evaluating • We are already helping deploy them ESnet has a great web resource available: http://fasterdata.es.net/ BIOTEAM Enabling Science
  22. 22. ©2014 BioTeam, Inc. All Rights Reserved. | Implementation: Science DMZBIOTEAM Enabling Science Design Source: “The Science DMZ: Introduction & Architecture” – ESnet
  23. 23. ©2014 BioTeam, Inc. All Rights Reserved. | Information Lifecycle Management (ILM) (e.g. iRODS)BIOTEAM Enabling Science iRODS, the Integrated Rule-Oriented Data System, is a project for building the next generation data management cyberinfrastructure. One of the main ideas behind iRODS is to provide a system that enables a flexible, adaptive, customizable data management architecture. Suitable for preserving data over its lifecycle. At the iRODS core, a Rule Engine interprets the Rules to decide how the system is to respond to various requests and conditions. Interfaces: GUI, Web, WebDAV, CLI Operations:  Search, Access and View,  Add/Extract Metadata, Annotate,  Analyze & Process,  Manage, Replicate, Copy, Share, Repurpose,  Track access, Subscribe & more… iRODS Server software and Rule Engine run on each data server. The iRODS iCAT Metadata Catalog uses a database to track metadata describing data and everything that happens to it
  24. 24. ©2014 BioTeam, Inc. All Rights Reserved. | Implementation: Science DMZ + ILMBIOTEAM Enabling Science
  25. 25. ©2014 BioTeam, Inc. All Rights Reserved. | NGS Data Analysis (on a Hybrid HPC Cloud) General Concept • On-site local resources are a “cache” that exists… • To be used constantly • For best data locality • For specialized resources • For security • Elastic resources from public or parent organization’s private cloud • The middleware offers cloud-style IaaS and/or PaaS • Multi-tenant– users/virtual communities can spin up their own resources, clusters, etc. • These on-demand systems accommodate unique software configurations and services (suited to varying NGS workflows, etc.) Sounds great, but… • It will be a while before you can pull a solution like this off the shelf • Would be a good candidate for a converged infrastructure offering Goal: HPC-like performance and latency, cloud-like elasticity and provisioning BIOTEAM Enabling Science
  26. 26. ©2014 BioTeam, Inc. All Rights Reserved. | Implementation: Science DMZ + ILM + HPC CloudBIOTEAM Enabling Science
  27. 27. ©2014 BioTeam, Inc. All Rights Reserved. | PartingThoughts & Lessons Learned 1. Confirmation Bias • Just because it wasn’t viable before, doesn’t mean it won’t ever be 2. Depth Perception • Bleeding Edge? Leading Edge? State of the Art? Legacy? Ready to Sunset? 3. Outliers • The existence of edge or corner cases does not necessarily invalidate a solution, but it does mean you better understand the scope the solution covers 4. The Power of && • Multipart solutions seen as complex, abandoned in search of a silver bullet • Combining ideas is more collaborative and doesn’t force an ultimatum 5. GameTheory • Bringing a chess set to a checkers tournament… 6. Relationship overTechnology • Work with vendors and collaborators that are interested in making a long term investment in what you do BIOTEAM Enabling Science
  28. 28. ThankYou Questions and Discussion Welcome ©2014 BioTeam, Inc. All Rights Reserved. Aaron D. Gardner Senior Scientific Consultant, BioTeam, Inc. aaron@bioteam.net BIOTEAM Enabling Science

×