GTC 2013: Practical NGS Data Management


Slides from my workshop presentation at this GTC Bioinformatics Data Management Strategies Workshop:

  Practical NGS Data Management
2013 GTC Bioinformatics & Data Management Strategies - San Francisco
  I'm Chris.
I'm an infrastructure geek.
I work for the BioTeam.com - Twitter: @chris_dag
  Who, What, Why ...
BioTeam
‣ Independent consulting shop
‣ Staffed by scientists forced to learn IT, SW & HPC to get our own research done
‣ 10+ years bridging the "gap" between science, IT & high performance computing
‣ We get to see how many groups of smart people tackle similar problems
  Listen to me at your own risk
Standard Dag Disclaimer
‣ I'm not an expert, pundit, visionary or "thought leader"
‣ Any career success entirely due to shamelessly copying what actual smart people do
‣ I'm biased, burnt-out & cynical
‣ Filter my words accordingly
  So why are you here?
  It's a risky time to be doing Bio-IT
  Science progressing way faster than IT can refresh/change
The Central Problem Is ...
‣ Instrumentation & protocols are changing FAR FASTER than we can refresh our Research-IT & Scientific Computing infrastructure
• Bench science is changing month-to-month ...
• ... while our IT infrastructure only gets refreshed every 2-7 years
‣ We have to design systems TODAY that can support unknown research requirements & workflows over many years (gulp ...)
  The Central Problem Is ...
‣ The easy period is over
‣ 5 years ago we could toss inexpensive storage and servers at the problem; even in a nearby closet or under a lab bench if necessary
‣ That does not work anymore; real solutions required
  The new normal.
  We are here today because ...
‣ It has never been easier to acquire vast amounts of data cheaply and easily
‣ Growth rate of data creation/ingest exceeds rate at which the storage industry is improving disk capacity
‣ Not just a storage lifecycle problem. This data *moves* and often needs to be shared among multiple entities and providers
• ... ideally without punching holes in your firewall or consuming all available internet bandwidth
  If you get it wrong ...
‣ Lost opportunity
‣ Missing capability
‣ Frustrated & very vocal scientific staff
‣ Problems in recruiting, retention, publication & product development
  Topic #1: "Meta Issues"
  Get Comfortable With Insane Rates Of Change
Meta Issue #1
‣ Genome Sequencing innovation rate is simply ludicrous.
‣ Similar rapid change in tools & lab protocols
‣ We MUST acknowledge and plan for disruptive science affecting IT systems and methods
  Get Comfortable With Insane Rates Of Change
Meta Issue #1, continued
‣ Multiple ways to approach this, often affected by how your funding cycle works
1. Over-design & Over-provision
2. Spend upfront on agility & scaling at the IT core
3. Incremental refresh of smaller, "right-sized" systems
  Know Your Platforms
Meta Issue #2
‣ NGS Platform has huge impact on IT footprint
‣ Which category do you fall in?
• Single NGS instrument? Multiple NGS instruments?
• One NGS vendor or many?
• Outsourced sequencing?
• Outsourced sequencing + analysis?
• Hybrid mix of onsite & outsourced?
  Understand The Duty Cycle
Meta Issue #3
‣ NGS "duty cycle" also has huge influence on size and shape of the IT footprint
‣ What type of lab are you running?
• 24x7 NGS Industrial Production?
• Central NGS Core Facility?
• Single PI, Department or Workgroup?
‣ Operational & Analysis Workflow?
• Sequence .... Analyze .... Sequence?
• Sequence ... Sequence ... Sequence ... Analysis
  Know The Tool Landscape
Meta Issue #4
‣ Analysis & data tools are changing almost as rapidly as NGS chemistry & platforms
• Open source, commercial, on-premise and cloud-hosted are all in the mix for 2013
‣ Understand that your software landscape and toolchain may change significantly (multiple times) over the lifespan of your NGS efforts
‣ IT people also need solid understanding of WHO makes algorithm & toolchain decisions
  Diagram Your Pipelines & Data Flows
Meta Issue #5
‣ Different groups with the same NGS platform will do wildly different things
• Low end example: push FASTQ into CLCBio, export VCF files and "call it a day"
• High end example: custom reference alignment or de-novo assembly followed by intense human-driven bioinformatics
• Complex example: genomic medicine & NGS data being used to drive clinical decisions
  Can we do an NGS talk without using the 'C' word?
  Whether you like it or not ...
NGS & "The Cloud"
‣ You can't ignore or avoid the cloud
‣ Period.
  Whether you like it or not ...
NGS & "The Cloud"
‣ Why you can't ignore the cloud in 2013
1. NGS data flows small enough to allow "write to cloud"
2. NGS vendors are forcing the issue
3. Our local storage is increasingly becoming "cloud aware"
4. Your users may prefer a cloud-hosted solution
5. Sequencing partners can deliver directly to the cloud
6. Easier to partner on NGS analysis & data distribution
7. Cloud economic (particularly storage) trends are clear
  Whether you like it or not ...
NGS & "The Cloud"
‣ Why you need to start work NOW
‣ Blunt Truth: 90% of cloud technical bits are easy to understand and fast to implement.
‣ Almost frictionless to access the cloud and NGS vendors have a vested interest in making it faster and easier; cloud may be in use today without your knowledge
‣ Risk of scientists bypassing/leapfrogging internal IT
  Whether you like it or not ...
NGS & "The Cloud"
‣ You need to start NOW because of the 10% of "cloud stuff" that is neither fast nor easy ...
• Internal policies/procedures & risk assessment
• Adding additional internet capacity takes time
• Safe networking, firewall, VPN, VPC and Identity Management implementations require experts to design and potentially lengthy implementation periods
‣ "Accessing" cloud is easy. Using it properly, safely & persistently is not easy and not trivial.
  Subnets & VPC can be more complex than the compute & storage
  Storage.
(the hard bit ...)
  Storage & Information Management
‣ Compute power in 2103 is a cheap commodity
‣ Storage? Not so much.
• Still many ways to spectacularly waste money
• Incredible diversity of vendors, products & capability
‣ A significant percentage of your budget and pre-purchase design efforts should center around storage, data movement & data lifecycle management
  Storage & Information Management
‣ Only time for 3 bits of advice:
1. The need for a default storage 'stance'
2. WHAT you store is as important as HOW you store it
3. Importance of non-crap metrics
  Storage 'Stance'
‣ Storage landscape is immense & diverse
• 100TB storage can be bought for $12,000 - $400,000
‣ You need a 'default stance'
‣ Good news is you have many options
‣ ... your 'stance' is often defined by how your budget and funding cycles work
  Monolithic "all-in-one"
Storage 'Stance' - Option 1
‣ Sized for the future on Day #1
• Purchased upfront with future need in mind; looks overbuilt and over-provisioned early on
‣ Good for:
• Groups with "one-shot" funding & little refresh chances over the lifespan of the platform
‣ Not great if:
• Business or science changes unexpectedly
• You did your sizing/scaling/growth calculations wrong
  Tier 1 Storage - Easy to Grow & Manage
Storage 'Stance' - Option 2
‣ Invest upfront in peta-capable single-namespace & low operational burden
• Enterprise-grade storage that is very easy to manage, maintain and grow over time
‣ Good for:
• Organizations where getting new headcount is harder than spending CapEx; Intentional spend on hardware that does not require additional humans to run & maintain it.
• Organizations with budgeting that allows for incremental refresh cycles
• Organizations without onsite gurus & dedicated storage admins
‣ Downside:
• Upfront & ongoing investment can be large; possibly affecting compute, tools or software budget
• Expensive relative to alternatives. You are paying for "future-proof" scalability & systems engineered for the lowest possible operational burden
  Getting clever & straddling Tier 1 and Tier 2 Storage
Storage 'Stance' - Option 3
‣ Strive for peta-capable single-namespace and easy operation
• ... but be willing to make modest trade-offs in exchange for lower cost
• Look at both Tier 1 and Tier 2 storage vendors (and people like Cambridge Computer)
‣ Good for:
• Organizations willing to take a more active role in vendor selection, design, deployment & operation
• Organizations motivated by ROI and willing to make modest trade-offs in capability, performance or operational burden in exchange for lower CapEx cost
‣ Downside:
• More risk in this area - easy to make a misguided decision. Requires brains & active interest in pre-sale design and vendor selection process. May require more storage admin effort day-to-day. Some trade-offs are better than others.
  Clever but not dumb
Storage 'Stance' - Option 4
‣ Midrange "Cheap & Clever"
• There are tons of very interesting Tier2 and Tier3 storage options available. The hard part is separating the good stuff from the crap stuff.
• Check out: RAID Inc, NexentaStor, NexSan, etc. etc.
‣ Good for:
• Budget constrained groups with motivated IT people
‣ Downside
• Might have to throw away stuff as you outgrow it ("forklift upgrade")
• Careful pre-purchase work required to properly config/size it
• Storage design may force changes on scientific workflows
• Higher administrative burden
  DIY & Super Cheap
Storage 'Stance' - Option 4
‣ DIY & Disruptive
• Incredibly disruptive stuff is out there for motivated DIY'ers and people who can't afford Tier 1 and Tier 2 platforms; This is where you can spend $12,000 on a 100TB storage node.
• Driven largely by high-density x86_64 server chassis and many people writing clever software (both free and commercial). There are NAS, SAN, Parallel and Distributed filesystem options all in this realm
‣ Good for:
• Smart people & organizations with guru storage & sysadmin resources.
• People with no money or people who spent all their money on NGS instrument & reagents and "forgot about all that IT stuff ..."
‣ Downside
• Non-trivial risk to science. Catastrophic data loss and science-disrupting downtime can all easily occur down at this level; Mess up badly and you will LOSE YOUR JOB
  WHAT you store is as important as HOW you store it
  Information Management
‣ Often Overlooked
• Hopefully previous speakers convinced you of the value gained from "information lifecycle management"
‣ The Core Problem
• POSIX filesystem semantics are insufficient for storing all of the attributes and information we want to tag our data with
• ... "something else" is required
  Something else is required ...
Information Management
‣ The large NGS heavy hitters are all looking at "metadata aware" storage as the penultimate solution
‣ Small & midrange NGS shops usually leverage LIMS with a bit of storage reporting/analytics
‣ LIMS warning:
• NGS vendors tend to assume you will only use NGS instruments that they make! Their software may not handle a future "multi-platform" NGS environment
• Beware of the time/effort/cost required to modify many LIMS systems that are on the market today
• BioTeam consulting has resulted in some products being made in this space- MiniLIMS, Slipstream NGS & Galaxy Editions
  Importance of 'non-crap' metrics
  Storage Metrics
‣ It is VERY important that you understand what you are storing and what the short, medium and long-term trend lines look like
‣ Very few people actually bother to do this
‣ ... and many that do end up producing pretty graphs that look good on dashboards but don't actually help drive scaling, refresh or upgrade decisions.
‣ You need metrics that can drive actionable decisions related to storage management and growth
  Some (Biased) Examples ...
  It's 2013 ... we know what questions to ask about our storage
  A 6 month rolling window provides real/actionable info ...
  Critical to have a handle on "raw" vs "derived" data also
Raw 70% Derived 30% PacBio (1.55 TB vs .569 TB)
Raw 86% Derived 14% Roche454 (4.55 TB vs .757 TB)
Raw 85% Derived 15% Illumina (10.171 TB vs 1.86 TB)
  Physical & Network NGS Ingest
  You need a plan for both network and physical ingest
NGS Data Ingest
‣ Whatever your 'stance' is today regarding ingest of external NGS data it will almost certainly change over time
• ... interesting public domain data sets
• Data from collaborators & partners
• Moving data among your own organization
‣ Plan for both 'network' and 'physical' methods
  You need a plan for both network and physical ingest
NGS Data Ingest
‣ Ingest is hard. It may seem easy but it's not, especially if you care about data integrity.
• Are you validating MD5 checksums on every file each time it moves from location A to location B?
‣ ... it can also sap
  46. 46. 46“Naked” Data MovementWednesday, June 19, 13
  47. 47. 47Physical data movement station; Unit= Naked DiskWednesday, June 19, 13
  48. 48. 48“Naked” Data ArchiveWednesday, June 19, 13
  49. 49. 49Cloud/Network-based Data MovementHigh speed 7+ hour sustained transfer from US East to West CoastSufficient for a NGS core facility ...Wednesday, June 19, 13
  50. 50. You need a plan for both network and physical ingestPhysical NGS Data Ingest‣ Physical ingest is best done with dedicated hardwareand (ideally) a dedicated workstation‣ Things to think about• How are you labeling/storing/tracking physical media? Who doesthe work? Expensive PhD? IT staff? Is there a written SOP guidingthe process?• How does physical media end up at your loading dock? Wheredoes it go after that?• Is your ingest workstation fast enough to handle MD5checksumming on the fly? Enough RAM for lots of TCP sessions?• Is your ingest station physically located in an optimal networklocation to facilitate the data movement to core storage?50Wednesday, June 19, 13
  51. 51. You need a plan for both network and physical ingestNetwork NGS Data Ingest‣ Network ingest (at high speed) requires advanceplanning and potential infrastructure‣ Things to think about• Commercial via Aspera? OpenSource via GridFTP? Somethingelse?• How exactly will you safely get data inside your organization via theinternet? How do you move from DMZ through firewall and ontoyour internal scientific IT systems?• Can you move data at speed without taking down VOIP andTeleconferencing systems or making network admins cry?• Will the IDS or Firewall doing deep packet inspection or protocolreassembly melt under the load?51Wednesday, June 19, 13
  52. 52. 52Wrap-up: Summary TipsWednesday, June 19, 13
  53. 53. 53Ending Advice: 1 of 6‣ Understand the ‘interesting time’ we are in• Science is changing faster than we can refresh IT• Disruptive innovation in the NGS space itself‣ Advice:• Spend as much time thinking about future flexibility asyou spend on actual current needs & requirementsWednesday, June 19, 13
  54. 54. 54Ending Advice: 2 of 6‣ NGS Assumptions don’t last very long• Will you change NGS vendor, platform or method?• Will the tools in use today still be in use tomorrow?• How will the “local vs. outsourced vs. cloud” landscapechange for you over the next few years?‣ Advice:• Avoid things that lock you into a vendor or platform• Look long and hard at your default assumptionsWednesday, June 19, 13
  55. 55. 55Ending Advice: 3 of 6‣ You need Physical & Network Ingest Planning• You may have standardized on one method or practicebut there will always be outliers and unexpectedsituations; Data always seems to be on the move!- NGS data volume mean outliers are non-trivial to handle‣ Advice:• Just think about how you would handle the edge casesand unexpected; don’t go crazy with upfront investment.Wednesday, June 19, 13
  56. 56. 56Ending Advice: 4 of 6‣ You need a cloud strategy. Today.- Your users or vendors may force the issue- The economic trend lines make cloud inescapable- 90% of cloud is “easy”. Remaining 10% takes time & effort‣ Advice:• 100% Cloud is not unreasonable in 2013*• Do the boring/long work now (policies, procedure, etc.)• Consider laying the tech groundwork (Bandwidth, VPN, VPC& Identity Management) now so you can easily and simplymake use of the cloud when neededWednesday, June 19, 13
  57. 57. 57Ending Advice: 5 of 6‣ Compute & Analysis‣ Advice:• Compute power is essentially a commodity in 2013- Both local and “on the cloud”• Easy and relatively inexpensive to acquire and deploy• There are some potential ‘gotcha’ and tuning areas thatdeserve advance thought and attention- ... but relative to storage & data it’s an “easy” problem areaWednesday, June 19, 13
  58. 58. 58Ending Advice: 6 of 6‣ Storage & Data Management‣ Advice:• Bulk of your attention & budget goes here• Huge diversity in product and feature offerings meanmore risk & more chances of mistakes- Outside expertise & NGS-aware vendors like Accunet &Cambridge Computer really can act as “value added resellers”• Pick one of the “default stances” that best match yourorganization funding & staffing model and build aroundthatTextWednesday, June 19, 13
  59. 59. 59end; Thanks!Slides:, June 19, 13
  60. 60. 60Wednesday, June 19, 13
