Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

2015 Bio-IT Trends From the Trenches

16,499 views

Published on

Talk slides from my annual address at the Bio-IT World Expo & Conference where I cover trends, best practices and emerging pain points for life science focused HPC, scientific computing and "research IT"

Email "chris@bioteam.net" if you want a PDF copy of these slides. I've disabled the raw powerpoint download option on slideshare.

Published in: Technology

2015 Bio-IT Trends From the Trenches

  1. 1. 1 2015 Bio IT World Expo: Trends From The Trenches slideshare.net/chrisdag/ chris@bioteam.net @chris_dag
  2. 2. 2 I’m Chris. I’m an infrastructure geek. (and failed scientist …) I work for the BioTeam.
  3. 3. 3 BioTeam.
  4. 4. 4 BioTeam. Independent Consulting Shop
  5. 5. 5 BioTeam. Independent Consulting Shop Virtual company with nationwide staff
  6. 6. 6 BioTeam. Independent Consulting Shop Run by scientists forced to learn IT to “get science done” Virtual company with nationwide staff
  7. 7. 7 BioTeam. Independent Consulting Shop Run by scientists forced to learn IT to “get science done” Virtual company with nationwide staff 15+ years “bridging the gap” between hardcore science, HPC & IT
  8. 8. 8 BioTeam. Independent Consulting Shop Run by scientists forced to learn IT to “get science done” Virtual company with nationwide staff 15+ years “bridging the gap” between hardcore science, HPC & IT Honest. Objective. Vendor & Technology Agnostic.
  9. 9. 9 Every. Darn. Year. Why?
  10. 10. 10 Disclaimer I speak mainly from my own personal experiences Not an expert. Not a pundit. “Thought Leader? Hell no! … and what I learn via osmosis from coworkers, clients & colleagues
  11. 11. 11 Disclaimer I speak mainly from my own personal experiences Not an expert. Not a pundit. “Thought Leader? Hell no! … and what I learn via osmosis from coworkers, clients & colleagues Be cynical. Filter my words through your own insight & experience.
  12. 12. 12image: shanelin via flickr Tick … Tick … Tick … Insufferable, huh? Lets talk trends … And steal/recycle some bits from last year …
  13. 13. 13 Still a risky time to be doing Bio-IT
  14. 14. 14 Bottom Line: Science evolves faster than IT can refresh infrastructure & practices
  15. 15. 15 Bench science changes ~monthly … while our IT infrastructure gets refreshed every 2-7 years
  16. 16. 16 Our job is to design systems TODAY that can support unknown research requirements over a multi-year timespan
  17. 17. 17 Our job is to design systems TODAY that can support unknown research requirements over a multi-year timespan Gulp …
  18. 18. 18 The easy period is over.
  19. 19. 19 Years ago we could stage cheap kit in the lab or a nearby telco closet The easy period is over.
  20. 20. 20 Years ago we could stage cheap kit in the lab or a nearby telco closet The easy period is over. This approach has not been viable for years now… real solutions are required
  21. 21. 21 This is the new normal.
  22. 22. 22 This is the new normal.
  23. 23. 23 This is the new normal.
  24. 24. 24 This is the new normal.
  25. 25. 25 And speaking of the Broad Institute
  26. 26. 26 .. and smart people who think about infrastructure
  27. 27. Homework Assignment #1 2015 Bio IT World Expo ‣ 11:10am Tomorrow - Track1 ‣ “Infrastructure, Architecture, and Organization: Data Engineering at Scale at the Broad” • Chris Dwan - Assistant Director, Research Computing and Data Services, Broad Institute of MIT and Harvard • … and the DDN/Qumulo stuff preceding Dwan should be interesting as well 27
  28. 28. 28 Enough groundwork. Lets Talk Trends
  29. 29. 29 Trends: DevOps, Automation & Org Charts
  30. 30. 30 2015: Time to stop screwing around.
  31. 31. 31 SysAdmins: If you can’t script your upward career mobility is done.
  32. 32. 32 It’s not “just” cloud & virtualization … Everything will have an API soon
  33. 33. 33 Orchestration, configuration management & automation stacks are already in your enterprise
  34. 34. 34 You will need to learn them. Bonus: You will be 10x more productive
  35. 35. 35 Organizations: “We are not on the cloud yet” is no longer a viable excuse
  36. 36. 36 DevOps methods & infrastructure automation are already transforming on-premise IT
  37. 37. 37 These methods are transformational “force-multipliers” for overworked & understaffed IT teams
  38. 38. 38 Not using these methods in 2015 implies a certain ‘legacy’ attitude or method of operation
  39. 39. 39 .. that your competition may not have
  40. 40. 40 Chef, Puppet, Ansible, SaltStack … Pick what works for you (and that all can agree on) … and commit to learning, using and evangelizing it … ideally across the enterprise (not just Research)
  41. 41. 41 Hey Network Engineers … Same API-driven automation trends are steamrolling your way You’ll need more than a Cisco certification in the future
  42. 42. 42 Practical software defined networking will not be “mostly cloud” for long
  43. 43. 43 Practical software defined networking will not be “mostly cloud” for long
  44. 44. Homework Assignment #2 2015 Bio IT World Expo ‣ 2:55pm Today - Track1 ‣ “Accelerating Biomedical Research Discovery: The 100G Internet2 Network – Built and Engineered for the Most Demanding Big Data Science Collaborations” • Christian Todorov, Director, Network Services Management, Internet2 ‣ Why? Internet2 has some of the most interesting non- cloud SDN stuff in production today 44
  45. 45. 45 Trends: Compute
  46. 46. 46 Unchanged in 2015: Linux clusters are still the baseline platform
  47. 47. 47 Unchanged in 2015: Even our lab instruments know how to submit jobs to the common HPC resource allocation & scheduling tools
  48. 48. 48 Unchanged in 2015: Still feels like a solved problem. Compute is a commodity. Most of the interesting action is with ‘outlier’ projects Ok. Designing a 60,000 CPU core HPC environment is still hard :) Compute power is rarely challenging these days.
  49. 49. 49 Compute: NICs & Disks At 10Gig and higher speeds, careful attention is needed to ensure that our server, NIC and disk configurations do not become the new bottleneck. Pay careful attention to NIC selection and consider host/kernel tuning when playing at 10Gig and above. Latest Intel Haswell stuff is driving 40Gig NICs today
  50. 50. 50 Compute: Diversity Trend Still Strong GPU Compute Block GPU Visualization Block Phi Coprocessor Block Dev/Application Block Hadoop/HDFS Block HPC Life Science Informatics via modular “Building Blocks” Large Memory Block Very Large Memory Block Fat Nodes (large SMP) Thin Nodes (fastest CPUs) Flash/SSD Analytic Block Homogenous ResearchIT is fading away in favor of “capability- based” computing via standard HPC building blocks
  51. 51. 51
  52. 52. 52 Ever see 96 DIMMs in a server? 1.5TB RAM
  53. 53. 53 “FAT GPU” Building Block Example Image source: nor-tech.com
  54. 54. 54 Compute: FPGAs & Phi’s Exotic hardware has it’s place but will not rule the world 2015: Thoughts on hardware acceleration largely unchanged Why not? “... the activation energy required for a scientist to use this stuff is generally quite high ...” Best deployed as point-solution for pain point or as component of large scale / high-value analytical workflow
  55. 55. Homework Assignment #3 2015 Bio IT World Expo ‣ 1:10pm Today - Lunch Presentation II ‣ “Optimizing Genomic Sequence Searches to Next-Generation Intel Architectures” • Bhanu Rekepalli, Ph.D., Senior Scientific Consultant & Principal Investigator, BioTeam Inc. ‣ Interested in Phi? Bhanu has deep experience and will be talking at lunch about Intel Phi used for massively scalable BLAST searches 55
  56. 56. 56 Topic / Trend: Converged Infrastructure Yep. 100Gig line card in the wild
  57. 57. 57 HyperConvergence in Research IT Not yet a widespread trend. Something to watch though Warning/Caveat: We see it mostly in very large greenfield deployments Feels like scale-out petabyte+ NAS a few years ago — This may be an area that most of us “watch” to see how the big orgs approach it … and then we copy them 2-3 years later
  58. 58. 58 HyperConvergence in Research IT “ISP Model” seeing use within large campus network upgrade projects Small Examples Ultra-converged Virtualization blocks packed with CPU/Disk/Flash/NICs Avere kit front-ending on-prem NAS/Object + Google & Amazon Object Stores DDN Disk Arrays running native applications via onboard hypervisors iRODS + Object Store efforts (including Cleversafe/BioTeam work …)
  59. 59. 59 HyperConvergence in Research IT Infiniband EDR. Particularly the Mellanox stuff One Big Example (and topic to watch …) Mellanox ConnectX-4 VPI and EDR Infiniband gets you: 684 ports of 100Gig performance in one director class switch Split personality host adaptors supporting IB, Ethernet or Both Infiniband: 56Gb/s FDR or 100Gb/s EDR Ethernet: 1GbE, 10GbE, 40GbE, 56GbE, 100Gbe
  60. 60. 60 Infiniband Convergence EDR in the Core & FDR @ the Edge enable large non-blocking HPC designs This is enabling some cool stuff in large greenfield projects: Infiniband for parallel filesystem access AND low-latency MPI apps 10/40/100 Gig Ethernet wherever you need it 7-figure CAPEX cost savings at very large scale (**) Compute, storage & message passing on one managed fabric
  61. 61. 61 One more thing …
  62. 62. 62
  63. 63. 63 2015 Converged IT Summit 1st ever BioTeam conference series (w/ CHI of course) 2-day meeting of the minds / Total focus on life science topics Brochures @ BioTeam Booth #357 September 9-10, 2015 Intercontinental Hotel, San Francisco USA http://convergeditsummit.com
  64. 64. 64 Trends: Cloud
  65. 65. 65 Cloud-based Science: Still real. Still useful. Still growing strong. One quick recap and then a few slides on some BioTeam 2015 “firsts” Still polluted by marketeers and thick layers of BS
  66. 66. 66 Cloud-based Science: Cost and economics ARE NOT the primary drivers Neutral meeting ground for collaboration w/ competitors Primary cloud/science driver is CAPABILITY: Lab instruments are now capable of “write to cloud” operation Ease of data ingest/exchange IaaS environments like AWS/Google offer capabilities that we can’t easily match on-premise And many more …
  67. 67. 67 Cloud-based Science: Our first use of a 10Gig Direct Connect circuit to Amazon BioTeam IaaS Cloud Milestones (2014-2015) Our first use of Internet2 as layer2/3 transit for AWS Direct Connect Our first large-scale production project on Google Compute Our first large-scale use of Google Genomics API
  68. 68. 68 Cloud-based Science: If Amazon, 10Gig, template-driven HPC, multi-VPC security, Internet2 interests you …
  69. 69. Homework Assignment #4 2015 Bio IT World Expo ‣ 10:40am Thursday - Track 3 ‣ “Next-Generation Sequencing and Cloud Scale: A Journey to Large-Scale Flexible Infrastructures in AWS” • Jason Tetrault, Associate Director, Business and Information Architect, R&D IT, Biogen 69
  70. 70. 70 Cloud-based Science: If Google Genomics API interests you …
  71. 71. 71
  72. 72. 72 Autism Speaks: Variant Finding via Google Genomics API (I can’t disclose anything else) Interested? Chat with one of the dev @ BioTeam Booth
  73. 73. 73 Trends: Storage
  74. 74. 74 Storage: Lots of attention to this area in the 2014 talk Check out the 2014 Trends slides online at http://slideshare.net/chrisdag No time to recap all of the stuff that has only modestly changed Today: Quick recap followed by new stuff/thoughts/trends
  75. 75. 75 Storage: Quick Recap Still a huge pain point Still amazing ways to waste large amounts of money Petabyte-class storage has not been scary for years now Single Tier of Scale-out NAS or Parallel FS insufficient in 2015 Multiple Tiers are a requirement; probably multi-vendor
  76. 76. 76 Storage: Reasonable Tier Example 5-40TB SSD/Flash tier for ingest & IOPS-sensitive workflows 50-400TB tier (SATA,SAS,SSD mix) for active processing Petabyte-capable (Cloud/Object/SATA) nearline tier 100TB - 1PB “Trash” Tier (optional) 100TB - 500TB Fast Scratch (optional)
  77. 77. 77 Storage: Object Is the Future Not a trend. Yet. I’m still a believer though .. Object storage is the future of scientific data at rest. Expect a lot more on this in 2016 talk …
  78. 78. 78 Storage: Object Is the Future Don’t believe me? Check out how many object vendors are on the show floor this week! Amplidata, Avere*, CleverSafe, DDN, Swiftstack Inc. etc. etc.
  79. 79. 79 Storage: Object Is the Future This what my metadata looks like on a POSIX filesystem: Owner Group membership Read/write/execute permissions based on Owner or Group File size Creation/Modification/Last-access Timestamps
  80. 80. 80 Storage: Object Is the Future This is what I WOULD LIKE TO TRACK on a PER-FILE basis: What instrument produced this data? What funding source paid to produce this data? What revision was the instrument/flowcell at? Who is the primary PI or owner of this data? Secondary? What protocol was used to prepare the sample? Where did the sample come from? Where is the consent information? Can this data be used to identify an individual? What is the data retention classification for this file? What is the security classification for this file? Can this file be moved offsite? etc. etc. etc. …
  81. 81. 81 Storage: Object Is the Future Historically metadata has been tracked via several methods: DIY LIMS and Relational Database Systems iRODS or other “metadata aware” systems … all at significant human, development & operational cost
  82. 82. 82 Storage: Object Is the Future My gut feeling for the future: Economics and tech benefits like erasure coding will draw interest But most adoption will be motivated by the ease at which arbitrary metadata can be stored with each file or object … and later searched / sorted / retrieved based on queries against the stored metadata
  83. 83. 83 Storage: Object Is the Future Advice for the audience: It will take years for our field to get here. You've got time! When evaluating Object Storage Solutions: … consider scoring or evaluating them on how well they handle metadata search and retrieval operations
  84. 84. 84 Storage: Object Is the Future Would you like object storage with your iRODS? Go talk to the CleverSafe people on the show floor We did some neat stuff with them related to using iRODS with CleverSafe object store backend
  85. 85. 85 Time for a few 2014 war stories?
  86. 86. 86 Storage: Where the Action Is Primary storage is still challenging, but … The really interesting work lies at the edges: Small & Cheap Storage Ludicrously Large Storage Some examples …
  87. 87. 87 Storage War Story 1: Small Ingest Pharmaceutical company with an ingest issue Funky lab instrument puts 30,000 tiny files in one directory Copying each experiment across 1Gb to SAN took HOURS Root cause: SAN system choking on tiny file metadata Trivial from a size viewpoint — ~6GB per experiment
  88. 88. 88 Storage War Story 1: Small Ingest This was a VERY interesting project There are MANY large/expensive systems that can handle small file ingest. Don’t need BIG and can’t afford EXPENSIVE Incredibly difficult to find small 5-10TB usable solution with the right mix of hardware to handle small-file ingest Winner: NexSan due to their FASTier smart SSD caching
  89. 89. 89 Storage War Story 2: Backblaze Do these pictures look familiar?
  90. 90. 90
  91. 91. 91
  92. 92. 92
  93. 93. 93 Storage War Story 2: Backblaze Most popular BioTeam blog post ever 2011: 135TB raw storage for $12,000 Guess what?
  94. 94. 94 Next-Generation “Pod” Now available
  95. 95. 95 Storage War Story 2: Backblaze Backblaze ‘pod’ style still lives on Now with much better fault tolerance No more custom wiring harnesses! What can YOU do with 45x 6TB drives at rock bottom price?
  96. 96. 96
  97. 97. 97
  98. 98. 98
  99. 99. 99 Storage War Story 2: Backblaze … including updated ‘real world’ cost Hope to blog about this Summer 2015 … and their 30 drive totally silent chassis (!)
  100. 100. 100 One last storage war story … From a lab with long history of innovative storage projects
  101. 101. 101 Storage War Story 3: Petascale Disruption Very cool 2014 project BioTeam + Pineda Lab @ Johns Hopkins Intel Lustre + Linux + ZFS + Commodity HW
  102. 102. 102 Storage War Story 3: Petascale Disruption 2 Petabytes (raw) / 1.4PB usable for $165,000 PUE of 1.5 = $10,000/year in electrical savings Performance close to much more $$$ options Expect to see more details released in 2015
  103. 103. 103 Transition Time.
  104. 104. 104 Last section of this talk is going to discuss what keeps me up at night.
  105. 105. 105 This is what I expect my hardest projects will involve 2015-2016 and beyond …
  106. 106. 106 Tipping Point #1 Effort/cost of generating or acquiring vast piles of data in 2015 is far less than real world cost of storing and managing that data through a realistic lifecycle.
  107. 107. 107 Tipping Point #2 Scientists still believe storage is cheap & near-infinite. Data triage no longer sufficient. Scientists rarely asked to articulate a scientific/business case for storage.
  108. 108. 108 Tipping Point #3 Centralized infrastructure models are not sufficient and must be modified. Data & compute WILL span sites and locations with or without active IT involvement. “Data Spread” is real. We need to start preparing now.
  109. 109. 109 “Center Of Gravity” Problem
  110. 110. 110 “Center Of Gravity” Problem Current methods involving centralized storage and bringing “users” and “compute” very close “… to the data” are going to face significant problems in 2015 and beyond.
  111. 111. 111 “Center Of Gravity” Pain #1 Terabyte class instruments. Everywhere. Gulp. We can not stop this trend - large scale data generation will span labs, buildings, campus sites & WANs
  112. 112. 112 “Center Of Gravity” Pain #2 Collaborations & Peta-scale Open Access Data The future of large scale genomics|informatics increasingly involves multi-party / multi-site collaboration. Also: Petabytes of free data (!!)
  113. 113. 113 “Center Of Gravity” Pain #3 Object Storage Less Effective @ Single Site Object storage is the future of scientific data at rest. Some major side benefits (erasure coding, etc.) can only be realized when 3 or more sites are involved
  114. 114. 114 “Center Of Gravity” Summarized Data spread is unavoidable. Effectively Unstoppable. We have a WAN-scale data movement/access problem. There are ~2 viable approaches going forward ...
  115. 115. 115 Option 1 - “Stay Centralized”Still totally viable but much faster connectivity to instruments & collaborators will be essential Nutshell: Significant investment in edge/WAN connectivity required, likely requiring bandwidth exceeding 10Gbps
  116. 116. 116 Option 2 - “Go With The Flow”Embrace the distributed & “cloudy” future where compute & storage span multiple zones Nutshell: Still requires massive bandwidth upgrades to support metadata-aware or location-aware access & compute
  117. 117. 117 It all boils down to …
  118. 118. 118 Terabyte-scale data movement is going to be an informatics “grand challenge” for the next 2-3+ years And far harder/scarier than previous compute & storage challenges
  119. 119. 119 History Time …
  120. 120. Long history of engagement & cooperation Research IT vs. Enterprise IT ‣ Historically our infrastructure requirements often surpassed what the Enterprise uses to sustain day to day operation ‣ We’ve spent ~20 years working closely with Enterprise IT to enable “data intensive science” ‣ Relatively easy to align informatics IT infrastructure with established vendor, product, technology and architecture standards ‣ This held true until this year … 120
  121. 121. $#%(*&@#@*&^@!*^@!(*&# !!!!!!!!!!!!!!!!!!!!!!!! 121 Data Movement Prepare For Pain …
  122. 122. 122 2015 Grand Challenge Large-scale Data Movement (and why this will be very difficult …)
  123. 123. 123 Issue #1 Current LAN/WAN stacks bad for emerging use case Existing technology we’ve used for decades has been architected to support many small network flows; not a single big data flow
  124. 124. 124 Issue #2 Ratio of LAN:WAN bandwidth is out of whack We will need faster links to “outside” than most organizations have anticipated or accounted for in long-term technology planning
  125. 125. 125 Issue #3 Core, Campus, Edge and “Top of Rack” bandwidth Enterprise networking types can be *smug* about 10Gbps at the network core. Boy are they in for a bad surprise.
  126. 126. 126 Issue #4 Bigger blast radius when stuff goes wrong Compute & storage can be logically or physically contained to minimize disruption/risk when Research does stupid things. Networks, however, touch EVERYTHING EVERYWHERE. Major risk.
  127. 127. 127 What We Need: - Ludicrous bandwidth @ network core - Very fast (10-40Gbps) ToR, Edge, Campus links - 1Gbps - 10Gbps connections to “outside” - Switches/Routers/Firewalls that can support small #s of very large data flows
  128. 128. 128 Why this will be difficult to achieve
  129. 129. 129 Issue #4 Social, trust & cultural issues We lack the multi-year relationship and track record we’ve built with facility, compute & storage teams. We are “strangers” to many WAN and SecurityOps types
  130. 130. 130 Issue #5 Our “deep bench” of internal expertise is lacking Research IT usually has very good “shadow IT” skills but we don’t have homegrown experts in BGP, Firewalls, Dark Fiber, Routing etc.
  131. 131. 131 Issue #5 Cost. Cost. Cost. Have you seen what Cisco charges for a 100Gbps line card?
  132. 132. 132 Issue #5 Cisco. Cisco. Cisco. The elephant in the room. Cisco rarely 1st choice for greenfield efforts in this space but Cisco shops often refuse to entertain any alternatives. Massive existing install base & on-premise expertise must be balanced, recognized & carefully handled.
  133. 133. 133 Issue #5 Firewalls, SecOps & Incumbent Vendors Legacy security products supporting 10Gbps can cost $150,000+ and still utterly fail to perform without heroic tuning & deep config magic. Alternatives exist but massive institutional inertia to overcome. Deeply Challenging Issue.
  134. 134. Wrapping Up … 134
  135. 135. 135 ‣ Peta-scale becoming the norm, not exception ‣ Compute is a commodity; Storage getting there ‣ Historically it has been pretty easy to integrate “Research Computing” with “Enterprise” facilities and operational standards ‣ We can no longer assume the majority of our infrastructure will reside in a single datacenter
  136. 136. 136 ‣ We need a massive increase in end-to-end network connectivity & bandwidth ‣ … and kit that can handle large data flows ‣ Current state of “Enterprise” LAN/WAN networking is not aligned with emerging needs: • Cost, Capability, Performance, Security …
  137. 137. 137 ‣ New hardware, reference architectures, best practices and methods will be required ‣ There is no easy path forward …
  138. 138. 138 ‣ And this brings us to … ‣ ScienceDMZ
  139. 139. 139 ‣ Science DMZ • Only viable reference architecture & collection of operational practices / philosophy BioTeam has seen to date • In-use today. Real world. No BS. • High level visibility & support within US.GOV, grant funding agencies and supporters of data intensive science and R&E networks
  140. 140. 140 ‣ BioTeam has three current ScienceDMZ projects going on right now. Speeds ranging from 10Gig to 100Gig ‣ This is likely just the beginning of a long and difficult transformation in our world ‣ We are going to try to collect useful public info at http://sciencedmz.org starting this summer
  141. 141. 141 The “come to jesus” graph …
  142. 142. Two final announcements … 142
  143. 143. 143 ‣ Science DMZ Overview Webinar • May 18, 2-4pm EDT • http://bioteam-events.webex.com • No BS; No Hype; No Marketing - 60 min of content from the inventors of Science DMZ (ES.NET, of course!) - 60 min for questions/discussion
  144. 144. 144 ‣ Announcing BioTeam 100Gig ConvergedIT Lab • Hosted at Texas Advanced Computing Center (“TACC”) • Compute/storage/networking/security kit all available for use/experimentation • Access to TACC 100Gig Internet2 circuit • Access to STAMPEDE and other TACC Supercomputers • Support from Intel, Juniper and many other vendors (Hint, hint!) • Goal #1: Showcase and test ScienceDMZ reference architectures for LifeSci • Goal #2: Have a killer demo for SuperComputing 2016 :)
  145. 145. 145 end; Thanks! slideshare.net/chrisdag/ chris@bioteam.net @chris_dag
  146. 146. 146 I am only here today because #TurkeyFAILED at #Genocide #100YearsOfDenial www.neverforget1915.us

×