GTC 2013: Practical NGS Data Management


Published on

Slides from my workshop presentation at this GTC Bioinformatics Data Management Strategies Workshop:

Drop me a line ( if you want the full PDF download.

Published in: Technology, Business

GTC 2013: Practical NGS Data Management

  1. 1. 1Practical NGS Data Management2013 GTC Bioinformatics & Data Management Strategies - San FranciscoWednesday, June 19, 13
  2. 2. 2I’m Chris.I’m an infrastructure geek.I work for the - Twitter: @chris_dagWednesday, June 19, 13
  3. 3. Who, What, Why ...3BioTeam‣ Independent consulting shop‣ Staffed by scientists forcedto learn IT, SW & HPC to getour own research done‣ 10+ years bridging the “gap”between science, IT & highperformance computing‣ We get to see how manygroups of smart peopletackle similar problemsWednesday, June 19, 13
  4. 4. Listen to me at your own risk4Standard Dag Disclaimer‣ I’m not an expert, pundit,visionary or “thought leader”‣ Any career success entirely dueto shamelessly copying whatactual smart people do‣ I’m biased, burnt-out & cynical‣ Filter my words accordinglyWednesday, June 19, 13
  5. 5. 5So why are you here?Wednesday, June 19, 13
  6. 6. 6It’s a risky time to be doing Bio-ITWednesday, June 19, 13
  7. 7. Science progressing way faster than IT can refresh/changeThe Central Problem Is ...‣ Instrumentation & protocols are changing FARFASTER than we can refresh our Research-IT &Scientific Computing infrastructure• Bench science is changing month-to-month ...• ... while our IT infrastructure only gets refreshed every2-7 years‣ We have to design systems TODAY that cansupport unknown research requirements &workflows over many years (gulp ...)7Wednesday, June 19, 13
  8. 8. The Central Problem Is ...‣ The easy period is over‣ 5 years ago we could tossinexpensive storage andservers at the problem;even in a nearby closet orunder a lab bench ifnecessary‣ That does not work anymore; real solutionsrequired8Wednesday, June 19, 13
  9. 9. 9The new normal.Wednesday, June 19, 13
  10. 10. We are here today because ...‣ It has never been easier toacquire vast amounts of datacheaply and easily‣ Growth rate of data creation/ingest exceeds rate at whichthe storage industry isimproving disk capacity‣ Not just a storage lifecycleproblem. This data *moves*and often needs to be sharedamong multiple entities andproviders• ... ideally without punching holes inyour firewall or consuming allavailable internet bandwidth10Wednesday, June 19, 13
  11. 11. If you get it wrong ...‣ Lost opportunity‣ Missing capability‣ Frustrated & very vocal scientific staff‣ Problems in recruiting, retention,publication & product development11Wednesday, June 19, 13
  12. 12. 12Topic #1: “Meta Issues”Wednesday, June 19, 13
  13. 13. Get Comfortable With Insane Rates Of ChangeMeta Issue #1‣ Genome Sequencinginnovation rateis simply ludicrous.‣ Similar rapid change intools & lab protocols‣ We MUST acknowledgeand plan for disruptivescience affecting ITsystems and methods13Wednesday, June 19, 13
  14. 14. Get Comfortable With Insane Rates Of ChangeMeta Issue #1, continued‣ Multiple ways to approach this, often affectedby how your funding cycle works1. Over-design & Over-provision2. Spend upfront on agility & scaling at the IT core3. Incremental refresh of smaller, “right-sized” systems14Wednesday, June 19, 13
  15. 15. Know Your PlatformsMeta Issue #2‣ NGS Platform has hugeimpact on IT footprint‣ Which category do you fall in?• Single NGS instrument? MultipleNGS instruments?• One NGS vendor or many?• Outsourced sequencing?• Outsourced sequencing +analysis?• Hybrid mix of onsite & outsourced?15Wednesday, June 19, 13
  16. 16. Understand The Duty CycleMeta Issue #3‣ NGS “duty cycle” also has huge influence on sizeand shape of the IT footprint‣ What type of lab are you running?• 24x7 NGS Industrial Production?• Central NGS Core Facility?• Single PI, Department or Workgroup?‣ Operational & Analysis Workflow?• Sequence .... Analyze .... Sequence?• Sequence ... Sequence ... Sequence ... Analysis16Wednesday, June 19, 13
  17. 17. Know The Tool LandscapeMeta Issue #4‣ Analysis & data tools are changing almost asrapidly as NGS chemistry & platforms• Open source, commercial, on-premise and cloud-hostedare all in the mix for 2013‣ Understand that your software landscape andtoolchain may change significantly (multipletimes) over the lifespan of your NGS efforts‣ IT people also need solid understanding ofWHO makes algorithm & toolchain decisions17Wednesday, June 19, 13
  18. 18. Diagram Your Pipelines & Data FlowsMeta Issue #5‣ Different groups with the same NGS platformwill do wildly different things• Low end example: push FASTQ into CLCBio, exportVCF files and “call it a day”• High end example: custom reference alignment or de-novo assembly followed by intense human-drivenbioinformatics• Complex example: genomic medicine & NGS databeing used to drive clinical decisions18Wednesday, June 19, 13
  19. 19. 19Can we do an NGS talk without using the ‘C’ word?Wednesday, June 19, 13
  20. 20. Whether you like it or not ...NGS & “The Cloud”‣ You can’t ignoreor avoid the cloud‣ Period.20Wednesday, June 19, 13
  21. 21. Whether you like it or not ...NGS & “The Cloud”‣ Why you can’t ignore the cloud in 20131. NGS data flows small enough to allow “write to cloud”2. NGS vendors are forcing the issue3. Our local storage is increasingly becoming “cloud aware”4. Your users may prefer a cloud-hosted solution5. Sequencing partners can deliver directly to the cloud6. Easier to partner on NGS analysis & data distribution7. Cloud economic (particularly storage) trends are clear21Wednesday, June 19, 13
  22. 22. Whether you like it or not ...NGS & “The Cloud”‣ Why you need to start work NOW‣ Blunt Truth: 90% of cloud technical bits are easy tounderstand and fast to implement.‣ Almost frictionless to access the cloud and NGSvendors have a vested interest in making it faster andeasier; cloud may be in use today without yourknowledge‣ Risk of scientists bypassing/leapfrogging internal IT22Wednesday, June 19, 13
  23. 23. Whether you like it or not ...NGS & “The Cloud”‣ You need to start NOW because of the 10% of“cloud stuff” that is neither fast nor easy ...• Internal policies/procedures & risk assessment• Adding additional internet capacity takes time• Safe networking, firewall, VPN, VPC and IdentityManagement implementations require experts to designand potentially lengthy implementation periods‣ “Accessing” cloud is easy. Using it properly,safely & persistently is not easy and not trivial.23Wednesday, June 19, 13
  24. 24. 24Subnets & VPC can be more complex than the compute & storageWednesday, June 19, 13
  25. 25. 25Storage.(the hard bit ...)Wednesday, June 19, 13
  26. 26. Storage & Information Management‣ Compute power in 2103 is a cheapcommodity‣ Storage? Not so much.• Still many ways to spectacularly wastemoney• Incredible diversity of vendors, products &capability‣ A significant percentage of yourbudget and pre-purchase designefforts should center aroundstorage, data movement & datalifecycle management26Wednesday, June 19, 13
  27. 27. Storage & Information Management‣ Only time for 3 bits ofadvice:1. The need for a defaultstorage ‘stance’2. WHAT you store is asimportant as HOW youstore it3. Importance of non-crapmetrics27Wednesday, June 19, 13
  28. 28. Storage ‘Stance’‣ Storage landscape isimmense & diverse• 100TB storage can be boughtfor $12,000 - $400,000‣ You need a ‘default stance’‣ Good news is you havemany options‣ ... your ‘stance’ is oftendefined by how your budgetand funding cycles work28Wednesday, June 19, 13
  29. 29. Monolithic “all-in-one”Storage ‘Stance’ - Option 1‣ Sized for the future on Day #1• Purchased upfront with future need in mind; looksoverbuilt and over-provisioned early on‣ Good for:• Groups with “one-shot” funding & little refresh chancesover the lifespan of the platform‣ Not great if:• Business or science changes unexpectedly• You did your sizing/scaling/growth calculations wrong29Wednesday, June 19, 13
  30. 30. Tier 1 Storage - Easy to Grow & ManageStorage ‘Stance’ - Option 2‣ Invest upfront in peta-capable single-namespace & low operationalburden• Enterprise-grade storage that is very easy to manage, maintain and grow over time‣ Good for:• Organizations where getting new headcount is harder than spending CapEx;Intentional spend on hardware that does not require additional humans to run &maintain it.• Organizations with budgeting that allows for incremental refresh cycles• Organizations without onsite gurus & dedicated storage admins‣ Downside:• Upfront & ongoing investment can be large; possibly affecting compute, tools orsoftware budget• Expensive relative to alternatives. You are paying for “future-proof” scalability &systems engineered for the lowest possible operational burden30Wednesday, June 19, 13
  31. 31. Getting clever & straddling Tier 1 and Tier 2 StorageStorage ‘Stance’ - Option 3‣ Strive for peta-capable single-namespace and easy operation• ... but be willing to make modest trade-offs in exchange for lower cost• Look at both Tier 1 and Tier 2 storage vendors (and people like CambridgeComputer)‣ Good for:• Organizations willing to take a more active role in vendor selection, design,deployment & operation• Organizations motivated by ROI and willing to make modest trade-offs incapability, performance or operational burden in exchange for lower CapEx cost‣ Downside:• More risk in this area - easy to make a misguided decision. Requires brains &active interest in pre-sale design and vendor selection process. May requiremore storage admin effort day-to-day. Some trade-offs are better than others.31Wednesday, June 19, 13
  32. 32. Clever but not dumbStorage ‘Stance’ - Option 4‣ Midrange “Cheap & Clever”• There are tons of very interesting Tier2 and Tier3 storage options available.The hard part is separating the good stuff from the crap stuff.• Check out: RAID Inc, NexentaStor, NexSan, etc. etc.‣ Good for:• Budget constrained groups with motivated IT people‣ Downside• Might have to throw away stuff as you outgrow it (“forklift upgrade”)• Careful pre-purchase work required to properly config/size it• Storage design may force changes on scientific workflows• Higher administrative burden32Wednesday, June 19, 13
  33. 33. DIY & Super CheapStorage ‘Stance’ - Option 4‣ DIY & Disruptive• Incredibly disruptive stuff is out there for motivated DIY’ers and people whocan’t afford Tier 1 and Tier 2 platforms; This is where you can spend$12,000 on a 100TB storage node.• Driven largely by high-density x86_64 server chassis and many peoplewriting clever software (both free and commercial). There are NAS, SAN,Parallel and Distributed filesystem options all in this realm‣ Good for:• Smart people & organizations with guru storage & sysadmin resources.• People with no money or people who spent all their money on NGSinstrument & reagents and “forgot about all that IT stuff ...”‣ Downside• Non-trivial risk to science. Catastrophic data loss and science-disruptingdowntime can all easily occur down at this level; Mess up badly and you willLOSE YOUR JOB33Wednesday, June 19, 13
  34. 34. WHAT you store is as important as HOW you store it34Wednesday, June 19, 13
  35. 35. Information Management‣ Often Overlooked• Hopefully previous speakers convinced you of the valuegained from “information lifecycle management”‣ The Core Problem• POSIX filesystem semantics are insufficient for storing allof the attributes and information we want to tag our datawith• ... “something else” is required35Wednesday, June 19, 13
  36. 36. Something else is required ...Information Management‣ The large NGS heavy hitters are all looking at“metadata aware” storage as the penultimatesolution‣ Small & midrange NGS shops usuallyleverage LIMS with a bit of storagereporting/analytics‣ LIMS warning:• NGS vendors tend to assume you will only use NGSinstruments that they make! Their software may nothandle a future “multi-platform” NGS environment• Beware of the time/effort/cost required to modifymany LIMS systems that are on the market today• BioTeam consulting has resulted in some productsbeing made in this space- MiniLIMS, Slipstream NGS & Galaxy Editions36Wednesday, June 19, 13
  37. 37. Importance of ‘non-crap’ metrics37Wednesday, June 19, 13
  38. 38. Storage Metrics‣ It is VERY important that you understand what youare storing and what the short, medium and long-term trend lines look like‣ Very few people actually bother to do this‣ ... and many that do end up producing pretty graphsthat look good on dashboards but don’t actuallyhelp drive scaling, refresh or upgrade decisions.‣ You need metrics that can drive actionabledecisions related to storage management andgrowth38Wednesday, June 19, 13
  39. 39. Some (Biased) Examples ...39Wednesday, June 19, 13
  40. 40. It’s 2013 ... we know what questions to ask about our storage40Wednesday, June 19, 13
  41. 41. A 6 month rolling window provides real/actionable info ...41Wednesday, June 19, 13
  42. 42. Critical to have a handle on “raw” vs “derived” data also42Raw70%Derived30%PacBio (1.55 TB vs .569 TB)Raw86%Derived14%Roche454 (4.55 TB vs .757 TB)Raw85%Derived15%Illumina (10.171 TB vs 1.86 TBWednesday, June 19, 13
  43. 43. Physical & Network NGS Ingest43Wednesday, June 19, 13
  44. 44. You need a plan for both network and physical ingestNGS Data Ingest‣ Whatever your ‘stance’ is today regardingingest of external NGS data it will almostcertainly change over time• ... interesting public domain data sets• Data from collaborators & partners• Moving data among your own organization‣ Plan for both ‘network’ and ‘physical’ methods44Wednesday, June 19, 13
  45. 45. You need a plan for both network and physical ingestNGS Data Ingest‣ Ingest is hard. It may seem easy but it’s not,especially if you care about data integrity.• Are you validating MD5 checksums on every file eachtime it moves from location A to location B?‣ ... it can also sap a lot of time and effort fromyour staff if done ad-hoc or in a disorganizedway‣ Both physical and network-based ingest requirenon-trivial amounts of upfront thought. Someinfrastructure & software may also be required45Wednesday, June 19, 13
  46. 46. 46“Naked” Data MovementWednesday, June 19, 13
  47. 47. 47Physical data movement station; Unit= Naked DiskWednesday, June 19, 13
  48. 48. 48“Naked” Data ArchiveWednesday, June 19, 13
  49. 49. 49Cloud/Network-based Data MovementHigh speed 7+ hour sustained transfer from US East to West CoastSufficient for a NGS core facility ...Wednesday, June 19, 13
  50. 50. You need a plan for both network and physical ingestPhysical NGS Data Ingest‣ Physical ingest is best done with dedicated hardwareand (ideally) a dedicated workstation‣ Things to think about• How are you labeling/storing/tracking physical media? Who doesthe work? Expensive PhD? IT staff? Is there a written SOP guidingthe process?• How does physical media end up at your loading dock? Wheredoes it go after that?• Is your ingest workstation fast enough to handle MD5checksumming on the fly? Enough RAM for lots of TCP sessions?• Is your ingest station physically located in an optimal networklocation to facilitate the data movement to core storage?50Wednesday, June 19, 13
  51. 51. You need a plan for both network and physical ingestNetwork NGS Data Ingest‣ Network ingest (at high speed) requires advanceplanning and potential infrastructure‣ Things to think about• Commercial via Aspera? OpenSource via GridFTP? Somethingelse?• How exactly will you safely get data inside your organization via theinternet? How do you move from DMZ through firewall and ontoyour internal scientific IT systems?• Can you move data at speed without taking down VOIP andTeleconferencing systems or making network admins cry?• Will the IDS or Firewall doing deep packet inspection or protocolreassembly melt under the load?51Wednesday, June 19, 13
  52. 52. 52Wrap-up: Summary TipsWednesday, June 19, 13
  53. 53. 53Ending Advice: 1 of 6‣ Understand the ‘interesting time’ we are in• Science is changing faster than we can refresh IT• Disruptive innovation in the NGS space itself‣ Advice:• Spend as much time thinking about future flexibility asyou spend on actual current needs & requirementsWednesday, June 19, 13
  54. 54. 54Ending Advice: 2 of 6‣ NGS Assumptions don’t last very long• Will you change NGS vendor, platform or method?• Will the tools in use today still be in use tomorrow?• How will the “local vs. outsourced vs. cloud” landscapechange for you over the next few years?‣ Advice:• Avoid things that lock you into a vendor or platform• Look long and hard at your default assumptionsWednesday, June 19, 13
  55. 55. 55Ending Advice: 3 of 6‣ You need Physical & Network Ingest Planning• You may have standardized on one method or practicebut there will always be outliers and unexpectedsituations; Data always seems to be on the move!- NGS data volume mean outliers are non-trivial to handle‣ Advice:• Just think about how you would handle the edge casesand unexpected; don’t go crazy with upfront investment.Wednesday, June 19, 13
  56. 56. 56Ending Advice: 4 of 6‣ You need a cloud strategy. Today.- Your users or vendors may force the issue- The economic trend lines make cloud inescapable- 90% of cloud is “easy”. Remaining 10% takes time & effort‣ Advice:• 100% Cloud is not unreasonable in 2013*• Do the boring/long work now (policies, procedure, etc.)• Consider laying the tech groundwork (Bandwidth, VPN, VPC& Identity Management) now so you can easily and simplymake use of the cloud when neededWednesday, June 19, 13
  57. 57. 57Ending Advice: 5 of 6‣ Compute & Analysis‣ Advice:• Compute power is essentially a commodity in 2013- Both local and “on the cloud”• Easy and relatively inexpensive to acquire and deploy• There are some potential ‘gotcha’ and tuning areas thatdeserve advance thought and attention- ... but relative to storage & data it’s an “easy” problem areaWednesday, June 19, 13
  58. 58. 58Ending Advice: 6 of 6‣ Storage & Data Management‣ Advice:• Bulk of your attention & budget goes here• Huge diversity in product and feature offerings meanmore risk & more chances of mistakes- Outside expertise & NGS-aware vendors like Accunet &Cambridge Computer really can act as “value added resellers”• Pick one of the “default stances” that best match yourorganization funding & staffing model and build aroundthatTextWednesday, June 19, 13
  59. 59. 59end; Thanks!Slides:, June 19, 13
  60. 60. 60Wednesday, June 19, 13
  61. 61. 61Infrastructure TourWhat does this stuff look like?Wednesday, June 19, 13
  62. 62. 62The cliche imageWednesday, June 19, 13
  63. 63. 63Lab-local HPC & storageWednesday, June 19, 13
  64. 64. 64More lab-local kitWednesday, June 19, 13
  65. 65. 65Small core w/ multiple NGS instrument supportWednesday, June 19, 13
  66. 66. 66Small cluster; large storageWednesday, June 19, 13
  67. 67. 67Mid-sized core facilityWednesday, June 19, 13
  68. 68. 68Large Core FacilityWednesday, June 19, 13
  69. 69. 69Large Core Facility: Just StorageWednesday, June 19, 13
  70. 70. 70Regional Scientific Computing “Hub”Wednesday, June 19, 13
  71. 71. 71Petabyte-scale StorageWednesday, June 19, 13
  72. 72. 72Yep. This counts.16 monster compute nodes + 22 GPU nodesCost? 30 bucks an hour via AWS Spot MarketReal world screenshot from mid-2012Wednesday, June 19, 13
  73. 73. 73Physical data movement station;Wednesday, June 19, 13
  74. 74. 74Physical data movement station; Unit= Naked DiskWednesday, June 19, 13
  75. 75. 75Cloud/Network-based Data MovementHigh speed 7+ hour sustained transfer from US East to West CoastSufficient for a NGS core facility ...Wednesday, June 19, 13
  76. 76. 76“Naked” Data MovementWednesday, June 19, 13
  77. 77. 77“Naked” Data ArchiveWednesday, June 19, 13