Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

BioIT Trends - 2014 Internet2 Technology Exchange

1,270 views

Published on

This is a custom "Bio IT trends/problems" deck that I did for a general but highly technical audience at the 2014 Internet2 Technology Exchange conference.

Download of the raw PPT is disabled; contact me at chris@bioteam.net if a direct copy or PDF of the presentation would be useful.

Published in: Technology
  • Be the first to comment

BioIT Trends - 2014 Internet2 Technology Exchange

  1. 1. 1 Life Science Informatics: … trends from the trenches slideshare.net/chrisdag/ chris@bioteam.net @chris_dag #TechEx14
  2. 2. 2 I’m Chris. I’m an infrastructure geek. I work for the BioTeam.
  3. 3. Apologies in advance 3 Disclaimer #1 ‣ ‘Infamous’ for speaking very fast and carrying a huge slide deck ‣ With a new crowd it is always tough to guess what the audience is into ‣ Slides already online via slideshare.net/chrisdag/ Indianapolis, your espresso is amazing!
  4. 4. Disclaimer #2 This is not my usual audience - be warned! ‣ Technically I’m a corporate “suit” • also work with .GOVs and nonprofit research institutes • … my world may be quite different from your world ‣ This actually matters • … because “my people” are about to deluge your communities 4
  5. 5. 5 Why I’m here and not at work …
  6. 6. Why I’m here … and why many more of “us” are on the way …. ‣ Modern day genomics & informatics is not possible without hardcore IT infrastructure ‣ Front line consultants like BioTeam often get 1st view of emerging pain points & trends ‣ My pain points for 2015 and beyond mostly involve peta-scale scientific data movement & ScienceDMZ-like network architectures ‣ ESNet & Internet2 have been awesome 6
  7. 7. PLUG #1: High-Performance Networking Use Cases in Life Sciences 7 Today @ 3:30 - White River Ballroom
  8. 8. 8 Plug #2: “Tech Talk” Advice …
  9. 9. 9 Terabyte Benchtops 1 The Bio/IT Meta Issue 2 DevOps & Org Charts 3 Compute 4 Network 5 Storage 6 Cloud 7 Things To Watch 8 Topics For Today
  10. 10. 10 Terabyte-scale Instruments & Lab Tools Cheap, easy to acquire and popping up EVERYWHERE Semi old school. Mid 2000’s One of the earlier “next-gen” sequencing platforms
  11. 11. 11 Terabyte-scale Instruments & Lab Tools Cheap, easy to acquire and popping up EVERYWHERE 1st set of “terabyte scale” lab instruments that often required lab-side server racks and enclosures <- Check this out!
  12. 12. 12 Terabyte-scale Instruments & Lab Tools Cheap, easy to acquire and popping up EVERYWHERE PacBio 30a 208v required Scientists still think they can buy these, not inform IT and just plug them into the wall Same scientists believe that infinite storage will appear by magic when these arrive at the loading dock
  13. 13. 13 Terabyte-scale Instruments & Lab Tools Cheap, easy to acquire and popping up EVERYWHERE HiSeq 2500 MiSeq NextSeq 500
  14. 14. 14 Terabyte-scale Instruments & Lab Tools Cheap, easy to acquire and popping up EVERYWHERE HiSeq X 10 $1,000 human genome @ 30x coverage * some caveats
  15. 15. 15 Coming Soon (it’s in beta now) USB-attached genome sequencing Crap.
  16. 16. 16
  17. 17. 17
  18. 18. 18 Doing “Bio-IT” is risky right now …
  19. 19. 19 Meta: Science evolving faster than IT can refresh infrastructure & practices
  20. 20. The Central Problem Is ... This is what keeps Bio-IT folks up at night ‣ Instrumentation & protocols are changing FAR FASTER than we can refresh our Research-IT & Scientific Computing infrastructure • Bench science is changing month-to-month ... • ... while our IT infrastructure only gets refreshed every 2-7 years ‣ Our job is to design systems TODAY that can support unknown research requirements & workflows over multi-year spans (gulp ...) 20
  21. 21. The Central Problem Is ... ‣ The easy period is over ‣ 5 years ago we could toss inexpensive storage and servers at the problem; even in a nearby closet or under a lab bench if necessary ‣ That does not work any more; real solutions required 21
  22. 22. 22 This is our “new normal” for informatics
  23. 23. 23 The Central Problem Is ... ‣ Lab technology is being refreshed, upgraded and replaced at an astonishing rate • Bigger, faster, parallel • Requiring increasingly sophisticated IT support • Cheap and easily obtainable Confocal microscope with heating enclosure that allows for 24hr live cell imaging experiments
  24. 24. 24 The Central Problem Is ... ‣ ... and IT still being caught by surprise in 2014 • Procurement practices and cheaper instrument prices result in situations where IT is bypassed or not consulted in advance This cabinet was right next to a chemical bath showered :)
  25. 25. One more thing ... We can’t blame the science/lab side for everything ‣ Can’t blame the lab-side for all our woes ‣ IT innovation is causing headaches in research and program management ‣ Grant funding agencies, regulatory rules and internal risk/program management practices not updated to reflect current and emerging IT capabilities, architectures & practices • Rules & policies often simply do not cover what we are capable of doing right now 25
  26. 26. 26 A related problem ...
  27. 27. This also hurts ... ‣ It has never been easier to acquire vast amounts of data cheaply and easily ‣ Growth rate of data creation/ingest exceeds rate at which the storage industry is improving disk capacity ‣ Not just a storage lifecycle problem. This data *moves* and often needs to be shared among multiple entities and providers • ... ideally without punching holes in your firewall or consuming all available internet bandwidth 27
  28. 28. The future is not looking pretty for the ill prepared 28
  29. 29. High Costs For Getting It Wrong ‣ Lost opportunity ‣ Missing capability ‣ Frustrated & very vocal scientific staff ‣ Problems in recruiting, retention, publication & product development 29
  30. 30. 30 Enough groundwork. Lets Talk Trends
  31. 31. 31 Trends: DevOps & Org Charts
  32. 32. 32 The social contract between scientist and IT is changing forever
  33. 33. 33 You can blame “the cloud” for this
  34. 34. 34 DevOps & Scriptable Everything ‣ On (real) clouds, EVERYTHING has an API ‣ If it’s got an API you can automate and orchestrate it ‣ “scriptable datacenters” are now a very real thing
  35. 35. 35 DevOps will conquer the enterprise ‣ Over the past few years cloud automation/orchestratio n methods have been trickling down into our local infrastructures ‣ This will have significant impact on careers, job descriptions and org charts
  36. 36. 2014: Continue to blur the lines between all these roles 36 Scientist/SysAdmin/Programmer ‣ When everything has an API ... ‣ ... anything can be ‘orchestrated’ or ‘automated’ remotely ‣ And by the way ... ‣ The APIs (‘knobs & buttons’) are accessible to all, not just the expert practitioners sitting in that room next to the datacenter
  37. 37. 2014: Continue to blur the lines between all these roles 37 Scientist/SysAdmin/Programmer ‣ IT jobs, roles and responsibilities are changing ‣ SysAdmins must learn to program in order to harness automation tools ‣ Programmers & Scientists can now self-provision and control sophisticated IT resources
  38. 38. 2014: Continue to blur the lines between all these roles 38 Scientist/SysAdmin/Programmer ‣ My take on the future ... • SysAdmins (Windows & Linux) who can’t code will have career issues • Far more control is going into the hands of the research end user • IT support roles will radically change -- no longer owners or gatekeepers ‣ IT will “own” policies, procedures, reference patterns, identity mgmt, security & best practices ‣ Research will control the “what”, “when” and “how big”
  39. 39. Trend: DevOps & Automation 2014 Summary ‣ Almost every HPC project (all sizes) BioTeam worked on in 2014 included • A bare-metal OS provisioning service (Cobbler, etc.) • A ‘next-gen’ configuration management service (Chef, Puppet, Saltstack, etc.) ‣ Gut feeling: This is going to be very useful for regulated environments • Not BS or empty hype: IT infrastructure and server/OS/service configuration encoded as text files • Easy to version control, audit, revert, rebuild, verify and fold into existing change management & documentation systems 39
  40. 40. 40 Trends: Compute
  41. 41. Compute related design patterns largely static 41 Core Compute ‣ Linux compute clusters are still the baseline compute platform ‣ Even our lab instruments know how to submit jobs to common HPC cluster schedulers ‣ Compute is not hard. It’s a commodity that is easy to acquire & deploy in 2014
  42. 42. Defensive hedge against Big Data / HDFS 42 Compute: Local Disk Matters ‣ This slide is from 2013; trend is continuing ‣ The “new normal” may be 4U enclosures with massive local disk spindles - not occupied, just available ‣ Why? Hadoop & Big Data ‣ This is a defensive hedge against future HDFS or similar requirements • Remember the ‘meta’ problem - science is changing far faster than we can refresh IT. This is a defensive future-proofing play. ‣ Hardcore Hadoop rigs sometimes operate at 1:1 ratio between core count and disk count
  43. 43. Faster networks are driving compute config changes 43 Compute: NICs and Disks ‣ One pain point for me in 2013-2014: • Network links to my nodes are getting faster • It’s embarrassing my disks are slower than the network feeding them • Need to be careful about selecting and configuring high speed NICs - Example: that dual-port 10Gig card may not actually be able to drive both ports if the card was engineered for an active:passive link failover scenario • Also need to re-visit local disk configurations
  44. 44. New and refreshed HPC systems running many node types 44 Compute: Huge trend in ‘diversity’ ‣ Accelerated trend since at least 2012 ... • HPC compute resources no longer homogenous; many types and flavors now deployed in single HPC stacks ‣ Newer clusters mix-and-match to match the known use cases: • GPU nodes for compute • GPU nodes for visualization • Large memory nodes (512GB +) • Very Large memory nodes (1TB +) • ‘Fat’ nodes with many CPU cores • ‘Thin’ nodes with super-fast CPUs • Analytic nodes with SSD, FusionIO, flash or large local disk for ‘big data’ tasks
  45. 45. GPUs, Coprocessors & FPGAs 45 Compute: Hardware Acceleration ‣ Specialized hardware acceleration has it’s place but will not take over the world • “... the activation energy required for a scientist to use this stuff is generally quite high ...” ‣ GPU, Phi and FPGA best used in large scale pipelines or as specific solution to a singular pain point
  46. 46. Emerging Trend: Hybrid HPC Also known as hybrid clouds ‣ No longer “utter crap” or “cynical vendor-supported reference case” • small local footprint • large, dynamic, scalable, orchestrated public cloud component ‣ Key enabler is availability of fast exterior bandwidth (thanks Internet2 !) ‣ DevOps is key to making this work ‣ High-speed network to public cloud required ‣ Software interface layer acting as the mediator between local and public resources ‣ Good for tight budgets, has to be done right to work ‣ Still best approached very carefully 46
  47. 47. 47 Trends: Network
  48. 48. 48 Big trouble ahead ...
  49. 49. 49 Network: Speed @ Core and Edge ‣ This is why I’m at TechEx14 ‣ Huge potential pain point ‣ May surpass storage as our #1 infrastructure headache ‣ Petascale data is useless if you can’t move it or access it fast enough ‣ Problem: corporate folks are smug about 10Gig - totally unprepared for 40gig and 100gig future ‣ We often need 10Gig to some desktops for data ingest/export
  50. 50. 50 Network: Speed @ Core and Edge ‣ Remember ~2004 when research storage requirements started to dwarf what the entire enterprise was using? ‣ Same thing is happening now for network capacity requirements • Research core, edge and top-of-rack networking speeds may exceed what the rest of the organization has standardized on
  51. 51. This is going to be painful Massive data movement needs are driving innovation pain ‣ Enterprise networking folks are even more aloof than storage admins we battled in ’04 ‣ Often used to driving requirements and methods; unhappy when science starts to drive them out of their comfort zones ‣ Research needs to start pushing harder and faster for network speeds above 10GbE 51
  52. 52. Quick Real World Example Tying it all together … 52
  53. 53. Real World Example ‣ Global Pharmaceutical Company • Scientists @ Site A • Lots of “terabyte scale” instruments @ Site A • HPC Compute/Storage & Site B • Replication/DR @ Site C ‣ Peering At the Future … • Complex global collaborations becoming the “new normal” • Potential joint ownership of a remote X 10 platform ‣ How to handle that? • Internet2 to the rescue! 53
  54. 54. 54 Solution: Internet2 & Amazon AWS
  55. 55. 55 “Corporate Approved” Science DMZ Ouch. This stuff ain’t cheap.
  56. 56. Network: ‘ScienceDMZ’ in industry ‣ My gut feeling: 1. The fanciest and most complex Science DMZ architectures in the literature right now are not suitable for our world • Expensive specialized equipment; Expensive specialist staff expertise required • Often still experimental, not something enterprise IT would want to drop into a production environment 2. Science DMZ concepts are sound and simple implementations are possible today 3. I’m recommending that people “start small”: • Incorporate these sorts of concepts/ideas into long term planning ASAP • Start adding network performance monitoring nodes to research networks, DMZs and external circuit connections now; this entire concept falls over without actionable flow and performance data • Start work on policies and procedures for manual bypass of firewall/IDS rules when known sender/receivers are freighting high speed data; automation comes later! 56
  57. 57. Buzzkill @ TechEx Software Defined Networking (“SDN”) 57
  58. 58. More hype than useful reality at the moment *(my world) 58 Network: SDN Hype vs. Reality ‣ Software Defined Networking (“SDN”) is the new buzzword ‣ It WILL become pervasive and will change how we build and architect things ‣ But ... ‣ Not hugely practical at the moment for most environments in “my world” • We need far more than APIs that control port forwarding behavior on switches • More time needed for all of the related technologies and methods to coalesce into something broadly useful and usable
  59. 59. More hype than useful reality at the moment 59 Network: SDN ‣ My gut feeling: • It is the future but right now we are still in the “mostly hype” phase for mere mortals • Production enterprise use: OpenFlow and similar stuff does not provide value relative to implementation effort right now (in my world …) • Best bang for the buck in ’14-15 will be getting ‘SDN’ features as part of some other supported stack - OpenStack, VMWare, Cloud, etc.
  60. 60. 60 Trends: Storage
  61. 61. 61 Storage ‣ Still the biggest expense, biggest headache and scariest systems to design in modern life science informatics environments ‣ Many of the pain points we’ve talked about for years are still in place: • Explosive growth forcing tradeoffs in capacity over performance • Lots of monolithic single tiers of storage • Critical need to actively manage data through it’s full life cycle (just storing data is not enough ...) • Need for post-POSIX solutions such as iRODS and other metadata-aware data repositories
  62. 62. 62 Storage Trends ‣ The large but monolithic storage platforms we’ve built up over the years are no longer sufficient • Do you know how many people are running a single large scale-out NAS or parallel filesystem? Most of us! ‣ Tiered storage is now an absolute requirement • At a minimum we need an active storage tier plus something far cheaper/deeper for cold files ‣ Expect the tiers to involve multiple vendors, products and technologies • The Tier1 storage vendors tend to have higher-end pricing for their “all in one” tiered data management solutions
  63. 63. 63 Storage - The Old Way ‣ Single tier of scale-out NAS or parallel FS ‣ Why? • Suitable for broadest set of use cases • Easy to procure/integrate • Lowest administrative & operational burden ‣ Example: • 400TB - 1PB of ‘something’ stores ‘everything’
  64. 64. 64 Storage - The New Way ‣ Multiple tiers; potentially from multiple vendors ‣ Why? • Way more cost efficient (size the tier to the need) • Single tier no longer capable of supporting all use cases and workflow patterns • Single tiers waste incredible money at large scale - often the scale of wastage is large enough to cover the cost of a full-time data manager ‣ Example: • 10-40 TB SSD/Flash for ingest & IOPS-sensitive workloads • 50-400 TB tier (SATA/SAS/SSD mix) for active processing • Multi-petabyte tier (Cloud, Object, SATA) for cost and operationally efficient long term (yet reachable) storage of scientific data at rest
  65. 65. Sticking 100% with Tier 1 vendors gets expensive 65 Storage: Disruptive stuff ahead ‣ Backblaze pod style methods = 200TB for $12K ish ‣ BioTeam has built 1Petabyte ZFS-based storage pools from commodity whitebox kit for about ~$100,000 in direct hardware costs (engineering effort & admin not included in this price ...) ‣ There are many storage vendors in the middle tier who can provide storage systems that are less ‘risky’ than DIY homebuilt setups yet far less expensive than the traditional Tier 1 enterprise storage options ‣ Companies like Avere Systems are producing boxes that unify disparate storage tiers, add performance to cheap storage and link them to cloud and object stores ‣ Object Storage is the future of scientific data. It will take many years before the life science informatics community is fully transitioned
  66. 66. The new thumper. Infinidat aka http://izbox.com ‣ 1 petabyte usable NAS shipped as a single integrated rack • Reasonably priced ‣ More expensive than DIY ZFS on commodity chassis but less expensive than current mainstream products ‣ Lots of interesting use cases for ‘cheap & deep’ 66
  67. 67. Object Storage ‣ Object storage is the future for scientific data at rest • Total no brainer; makes more sense than the “files and folders” paradigm, especially for automated analysis • Plus Amazon does it for super cheap ‣ But ... There will be a long transition period due to all of our legacy codes and workflows • This is where gateway devices can play ‣ It can: • Provide a much better workflow design pattern than assuming “files and folders” data storage • Save millions of dollars via efficiencies of erasure coding • Provide a much more robust and resilient peta-scale storage framework • Hide behind a metadata-aware layer such as IRODS to provide very interesting capabilities 67
  68. 68. Object Storage ‣ Erasure coding distributed object stores are very interesting at peta-scale ... ‣ Think about how you would handle & replicate 20 petabytes of data the “traditional way” • Purchase 2x or 3x storage capacity to handle replication overhead • Ignore the nightmare scenario of having to restore from one of the distributed replicas 68
  69. 69. Object Storage ‣ Efficiencies of erasure coding allow for LESS raw disk to be distributed across MORE geographic sites ‣ End result is a “single” usable system that tolerant to the failure of an entire datacenter/site ‣ For the 20 petabyte problem instead of purchasing 2x disk you buy ~1.8x and use the capex savings to add an extra colo facility or increase WAN link speed 69
  70. 70. 70 Can you do a Bio-IT talk without using the ‘C’ word?
  71. 71. 71 Cloud: 2014 ‣ Life Science cloud migration is being driven by MORE than economic arguments ‣ Other critical drivers: • Neutral/Safe space for complex collaboration • Our lab instruments can ‘write to cloud’ • Data providers & genome sequencing shops can ‘deliver to the cloud’ • New capabilities via 100% virtual IaaS platforms • Multiple interesting ways to share data
  72. 72. 72 Informatics, Internet2 & IaaS Clouds
  73. 73. IaaS Cloud: 2014 What has changed .. ‣ Revisit some of my bile from prior years ‣ “... private clouds: still utter crap” ‣ “... some AWS competitors are delusional pretenders” ‣ “... AWS has a multi-year lead on the competition” 73
  74. 74. Private Clouds in 2014: ‣ I’m no longer dismissing them as “utter crap” • However it is a lot of work and money to build a system that only has 5% of the features that AWS can deliver today (for a cheaper price). Need to be careful about the use case, justification and operational/development burden. ‣ Usable & useful in certain situations ‣ BioTeam positive experiences with OpenStack ‣ Starting to see OpenStack pilots among our clients ‣ Hype vs. Reality vs. Operational Effort ratio still wacky ‣ Sensible only for certain shops but getting better • Have you seen what you have to do to your networks & gear? ‣ Still important to remain cynical and perform proper due diligence
  75. 75. Not all AWS competitors are delusional ‣ Google Compute is viable in 2014 for scientific workflows • Compute/Memory: Late start into IaaS means CPUs and memory are current generation; we have ‘war stories’ from AWS users who probe /proc/cpuinfo on EC2 servers so they can instantly kill any instance running on older chipsets • Price: Competitive on price although the shooting war between IaaS providers means it is hard to pin down the current “winner”; The “sustained use” pricing is easier to navigate than AWS Reserved Instances. Overall AWS pricing algorithms for various services seem more complicated than Google equivalents. • Network performance: Fantastic networking and excellent performance/latency figures between regions and zones. VPC type features are baked into the default resource set • Ops: Priced in 1min increments; no more need to hunt and kill servers at 55 min past the hour. Google has a concept of “Projects” with assigned collaborators and quotas. Quite different from the AWS account structure and IAM-based access control model. Project-based paradigm easier to think about for scientific use case. • IaaS Building Blocks: Still far fewer features than AWS but the core building blocks that we need for science and engineering workflows are present. ‣ My $.02 • AWS is still the clear leader but Google Compute is now a viable option and it is worth ‘kicking the tires’ in 2014 and beyond ... to me AWS has had no serious competition until now
  76. 76. 76 The road ahead for Bio-IT ...
  77. 77. This has been a slow moving trend for years now ... 77 POSIX Alternatives Coming ‣ The scope of organizations faced with the limitations of POSIX filesystem will continue to expand ‣ We desperately need some sort of “metadata aware” data management solution in life science ‣ Nobody has an easy solution yet; several bespoke installations but no clear mass-market options ‣ IRODS front-ending “cheap & deep” storage tiers or object stores appears to be gaining significant interest out in our community
  78. 78. Application Containers are getting interesting 78 Watch out for: Containerization ‣ Application containerization via methods like http://docker.io gaining significant attention • Docker support now in native RHEL kernel • AWS Elastic Beanstalk recently added Docker support ‣ If broadly adopted, these techniques will stretch research IT infrastructures in interesting directions • This is far more interesting to me than moving virtual machines around a network or into the cloud ‣ ... with a related impact on storage location, features & capability ‣ Major new news and progress expected in 2014
  79. 79. 79 Keep an eye on: Storage ‣ Data generation out-pacing technology ‣ Really interesting disruptive stuff on the market now ‣ Cheap/easy laboratory assays taking over • Researchers largely don’t know what to do with it all • Holding on to the data until someone figures it out • This will cause some interesting headaches for IT • Huge need for real “Big Data” applications to be developed
  80. 80. 80 Keep an eye on: Networking ‣ Unless there’s an investment in ultra-high speed networking, need to change thought on analysis ‣ Data commons are becoming a precedent • Need to minimize the movement of data • Include compute power and analysis platform with data commons ‣ Move the analysis to the data, don’t move the data ‣ ScienceDMZs coming online
  81. 81. 81 Long term trends ... ‣ Compute continues to become easier ‣ Peta-scale storage becomes a commodity ‣ Data movement and ingest (physical & network) gets harder ‣ Cost of storage will be dwarfed by “cost of managing stored data” ‣ We can see end-of-life for our current IT architecture and design patterns; new patterns will start to appear over next 2-5 years
  82. 82. end; Thanks! 82 slideshare.net/chrisdag/ chris@bioteam.net @chris_dag #TechEx14

×