Trends from the Trenches (Singapore Edition)


Published on

(PDF available upon request). This is an updated version of the 2012 BioITWorld Boston talk that I gave 6 weeks later at Bio IT World Asia in June 2012. Some slide content was updated and revised and I also deleted a number of slides in an attempt to shorten the talk since I'm known to speak fast. There was legit concern I'd be unintelligible to non-native english speakers!

Published in: Technology, Business
  • Be the first to comment

Trends from the Trenches (Singapore Edition)

  1. Trends from the Trenches2012 Bio-IT World Asia, Singapore 1
  2. I’m Chris.I’m an infrastructure geek.I work for the BioTeam. 2
  3. BioTeamWho, what & why ‣ Independent consulting shop ‣ Staffed by scientists forced to learn IT, SW & HPC to get our own research done ‣ 10+ years bridging the “gap” between science, IT & high performance computing 3
  4. BioTeamWhy we get invited to these sorts of talks ... ‣ Lots of people hire us across wide range of project types • Pharma, Biotech, EDU, Nonprofit, .Gov, .Mil, etc. ‣ We get to see how groups of smart people approach similar problems ‣ We can speak honestly & objectively about what we see “in the real world” 4
  5. Disclaimer. 5
  6. Listen to me at your own riskSeriously. ‣ I’m not an expert, pundit, visionary or “thought leader” ‣ All career success entirely due to shamelessly copying what actual smart people do ‣ I’m biased, burnt-out & cynical ‣ Filter my words accordingly 6
  7. Introduction 1Business & Marketplace 2Storage 3Cloud 4Hot for ’12 ... 5 7
  8. Business LandscapeSo far 2012 feels a lot like 2011 ... 8
  9. Business & Meta ObservationsMore of the same in ’12 ... ‣ ~4 staff full time on issues involving data handling, data management and multi-instrument Next-Gen sequencing/analysis ‣ ~2 staff full time on infrastructure, storage and facility related projects • Dwan: Big infrastructure & facility projects for Fortune 20 companies, research consortia & .GOV customers • Dag: 40% infrastructure, 20% storage, 20% cloud ‣ ~1 staff full time on Amazon Cloud projects 9
  10. What that tells us ‣ Same problem(s) as last year ‣ Next-gen sequencing still causing a lot of pain when it comes to data handling, storage, organization & integration ‣ As sequencing continues to be commoditized, this will likely only get worse 10
  11. Storage 11
  12. Science-centric StorageCurrent State Assessment ‣ Storage still making me crazy in ’12 12
  13. Science-centric StorageWhy I’m not worried ‣ Peta-capable storage is trivial to acquire in 2012 ‣ Scale-out NAS has won the battle ‣ It’s simply not as hard/risky as it used to be 13
  14. On the other hand ... 14
  15. OMG! The Sky Is Falling!Maybe a little panic is appropriate ... 15
  16. The sky IS falling!Uncomfortable truths‣ Cost of acquiring data (genomes) falling faster than rate at which industry is increasing drive capacity‣ Human researchers downstream of these datasets are also consuming more storage (and less predictably)‣ High-scale labs must react or potentially have catastrophic issues in 2012-2013 16
  17. The sky IS falling!Current Practices Are Not Sustainable ‣ FACT: Chemistry changing faster than we can refresh our datacenters and research IT infrastructure ‣ FACT: Rate at which we can cheaply acquire interesting data exceeds rate at which storage companies can increase the capacity of their products ‣ FACT: We are poor at managing, tagging, valuing & curating our data. Few scientists really understand true cost/complexity involved with keeping data safe, online & accessible ‣ FACT: In 2012 people still think “keep everything online, forever” is a viable demand to be making of IT staff ‣ FACT: Something is going to break. Soon. 17
  18. CRAM it. 18
  19. The sky IS falling!CRAM it in 2012 ... ‣ Minor improvements are useless; order-of-magnitude needed ‣ Some people are talking about radical new methods – compressing against reference sequences and only storing the diffs • With a variable compression “quality budget” to spend on lossless techniques in the areas you care about ‣ - Ewan Birney on “Compressing DNA” ‣ - The actual CRAM paper ‣ If CRAM takes off, storage landscape will change 19
  20. Storage: What comes next? Next 18 months will be really fun... 20
  21. What comes next.The same rules apply for 2012 and beyond ... ‣ Accept that science changes faster than IT infrastructure ‣ Be glad you are not Broad/Sanger/BGI/NCBI ‣ Flexibility, scalability and agility become the key requirements of research informatics platforms • Tiered storage is in your future ... ‣ Shared/concurrent access is still the overwhelming storage use case 21
  22. What comes next.In the following year ... ‣ Many peta-scale capable systems deployed • Most will operate in the hundreds-of-TBs range ‣ Far more aggressive “data triage” ‣ Genome compression via CRAM ‣ Even more data will sit untouched & unloved ‣ Growing need for tiers, HSM & even tape 22
  23. What comes next.In the following year ... ‣ Broad and others are paving the way with respect to metadata-aware & policy driven storage frameworks • And we’ll shamelessly copy a year or two later ‣ I’m still on my cloud storage kick • Economics are inescapable; Will be built into storage platforms, gateways & VMs • Cloud object stores are only a HTTP RESTful call away • Cloud will become “just another tier” 23
  24. What comes next.Expect your storage to be smarter & more capable ... ‣ What do DDN, Panasas, Isilon, BlueArc, etc. have in common? • Under the hood they all run Unix or Unix-like OS’s on x86_64 architectures ‣ Some storage arrays can already run applications natively • More will follow • Likely a big trend for 2012 24
  25. Storage: The road ahead My $.02 for 2012... 25
  26. The Road Ahead Trends & Tips for 2012‣ Peta-capable platforms required‣ Scale-out NAS still the best fit‣ Customers will no longer build one big scale-out NAS tier‣ My ‘hack’ of using nearline spec storage as primary science tier is obsolete in ’12‣ pNFS mainstream in 2012?‣ Not everything is worth backing up‣ Expect disruptive stuff 26
  27. The Road Ahead Trends & Tips for 2012‣ Your storage will be able to run apps • Dedupe, cloud gateways & replication • ‘CRAM’ or similar compression • Storage Resource Brokers (iRODS) & metadata servers • HDFS/Hadoop hooks? • Lab, Data management & LIMS applications Drobo Appliance running BioTeam MiniLIMS internally... 27
  28. The Road Ahead Trends & Tips for 2012‣ Hadoop / MapReduce / BigData • Just like GRID and CLOUD the space is being over-hyped • You still need to think about it • ... and have a roadmap for doing it • Deep, deep ties to your storage • Your users want/need it • My $.02? Fantastic cloud use case 28
  29. Disruptive Storage Example 29
  30. Backblaze Pod For Biotech 30
  31. 100 Terabytes for $12,000 USD 31
  32. Storage Future Feels Like This ...Multiple Tiers, Multiple Vendors, Multiple Products 32
  33. The ‘C’ word Does a Bio-IT talk exist if it does not mention “the cloud”? 33
  34. Cloud Stuff ‣ Before I make some blunt comments ... ‣ I am not an Amazon Cloud shill ‣ I am a jaded, cynical, zero-loyalty consumer of IT services and products that let me get work done ‣ Because I only get paid when my solutions work, I am picky about what tools I keep in my toolkit ‣ Amazon Web Services is a fantastic tool 34
  35. So you thinkyou have a cloud?
  36. No APIs?Not a cloud.
  37. No self-service? Not a cloud.
  38. Installing VMware& issuing a press release? Not a cloud.
  39. Block storageand virtual servers only? (barely) a cloud;
  40. Amazon is the IaaS Cloud Leader‣ Why Amazon is attractive for infrastructure clouds: • Anyone can do virtual servers and block/object storage • Bio-IT needs “more stuff ” in order to get real work done • AWS product & service stack (“the glue”) is far more comprehensive than any other cloud competitors - Need some examples? - ElasticIP, VPC, IAM, SQS, SNS, SES, SimpleDB, DynamoDB, CloudFormation, ElasticBeanstalk, SWS, DirectConnect, etc. 40
  41. Amazon Cloud Dominance Could Be A Good Thing ‣ Amazon Cloud Dominance May Be Good For Bio-IT ‣ The competition must innovate in really interesting ways in order to compete. This is already happening. • Purpose-built platforms for regulated/compliant operation • “Hands-on” Managed Services for Healthcare/Pharma • Hybrid on-premise/off-premise solutions • Full life science solution & software service stacks • Bespoke Service Level Agreements (SLAs) • ,,, 41
  42. Private Clouds My $.02 cents 42
  43. Private Clouds in 2012: ‣ I’m no longer dismissing them as “useless” ‣ Usable & useful in certain situations ‣ Hype vs. Reality ratio still unbalanced ‣ Sensible only for certain environments • Have you seen what you have to do to your networks & gear? ‣ There are easier ways
  44. Private Clouds: My Advice for ‘12 ‣ Remain cynical (test vendor claims) ‣ Due Diligence still essential ‣ I personally would not deploy anything that does not explicitly provide Amazon API compatibility
  45. Private Clouds: My Advice for ‘12 Most people are better off: 1. Adding VM platforms to existing HPC clusters & environments 2. Extending enterprise VM platforms to allow user self- service & server catalogs
  46. Cloud Advice My $.02 cents 46
  47. Cloud AdviceDon’t get left behind ‣ Research IT Organizations need a cloud strategy today ‣ Those that don’t will be bypassed by frustrated users ‣ IaaS cloud services are only a departmental credit card away ... and some senior scientists are too big to be fired for violating IT policy 47
  48. Cloud AdviceDesign Patterns ‣ You will need three tested cloud design patterns: ‣ (1) To handle ‘legacy’ scientific apps & workflows ‣ (2) The special stuff that is worth re-architecting ‣ (3) Hadoop & big data analytics 48
  49. Cloud Advice(1) Legacy HPC on the Cloud ‣ MIT StarCluster • ‣ This is your baseline for legacy apps on ‘the cloud’ ‣ Extend as needed 49
  50. Cloud Advice(2) “Cloudy” HPC ‣ Some of our research workflows are important enough to be rewritten for “the cloud” and the advantages that a truly elastic & API-driven infrastructure can deliver ‣ This is where you have the most freedom ‣ Many published best practices you can borrow ‣ Good commercial options: Cycle Computing, BT, etc. 50
  51. Cloud Advice(3) Big Data HPC ‣ It will be a MapReduce world, get used to it ‣ Little need to roll your own Hadoop in 2012 ‣ ISV & commercial ecosystem already healthy ‣ Multiple providers today; both onsite & cloud-based ‣ Often an excellent cloud use case 51
  52. Cloud Data Movement My $.02 cents 52
  53. Cloud Data Movement‣ Over several years we have participated in a number of large “cloud data movement” efforts‣ We used to be big fans of physical media movement‣ However ... 53
  54. Physical Data Movement Is Not Easy. 54
  55. Cloud Data Movement‣ At first glance, physical data movement “seems easy”‣ It’s not. It is hard to do correctly and requires significant human effort and operational resources‣ This has been a hard lesson learned over several years‣ We have a new strategy for 2012 and the next image shows why ... 55
  56. March 2012 56
  57. Cloud Data MovementWow! ‣ With a 1GbE internet connection ... ‣ and using Aspera software .... ‣ We sustained 700 Mb/sec for more than 7 hours freighting genomes into Amazon Web Services ‣ This is fast enough for many use cases, including genome sequencing core facilities* ‣ Chris Dwan’s webinar on this topic: 57
  58. Cloud Data MovementWow! ‣ Results like this mean we now favor network-based data movement over physical media movement ‣ Large-scale physical data movement carries a high operational burden and consumes non-trivial staff time & resources ‣ *Unclear if our experience holds true for Asia or Asia-EU-Americas data transfers 58
  59. Cloud Data MovementThere are three ways to do network data movement ... ‣ (1) Buy software from Aspera and be done with it ‣ (2) Attend the annual SuperComputing conference & see which student group wins the bandwidth challenge contest; use their code ‣ (3) Get GridFTP from the Globus folks • Trend: At every single “data movement” talk I’ve been to in 2011 it seemed that any speaker who was NOT using Aspera was a very happy user of GridFTP. #notCoincidence 59
  60. Hot topics for 2012 ... 60
  61. Hot for ’12BioTeam side projects & research interests ‣ Like to wrap up with some topics we think are interesting ‣ Who knows? These might be trends for 2013! 61
  62. Siri Voice Control of Instruments/Pipelines ‣ BioTeam recently revealed work with BT and Accelrys ‣ Demonstrated Siri voice control of a Pipeline Pilot experiment running in the BT Compute Cloud ‣ ‣ We expect to continue doing cool things with Siri in ’12 62
  63. Smart Storage & Lab-local Appliances ‣ I firmly expect the “storage arrays running apps & VMs” trend to go mainstream ‣ This has beneficial implications for life science informatics ‣ We’ll be hitting this topic hard on systems ranging from Drobo to DataDirect ‣ Also working with the Intel Modular Server concept 63
  64. Lab Local AppliancesIntel Modular Server ‣ Interesting hardware combination; storage + servers + native hypervisor ‣ VM Pool 1: MiniLIMs + other useful lab software ‣ VM Pool 2: Amazon Storage Gateway Appliance ‣ Server Blade 3: BrightCluster HPC Stack 64
  65. Cloud, Community & Orchestration‣ The emerging class of “DevOps” and “Infrastructure Automation” methods are incredibly interesting • We love Opscode & Chef (‣ We’ll be doing more with systems orchestration in ’12 • And hopefully expanding our community collection of useful Chef coobooks for life science informatics‣ We also still love MIT StarCluster and will hopefully be contributing plugins and enhancements 65
  66. Thanks!Slides online at: 66