Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Trends from the Trenches: 2019

3,017 views

Published on

Annual address covering trends, emerging requirements, pain points and infrastructure issues in the "Bio-IT" aka life science informatics and HPC realm; Email me if you want a PDF of this talk - chris@bioteam.net

Published in: Technology
  • Legitimate jobs paying $40/h Tap into the booming online job, industry and start working now! ♥♥♥ https://tinyurl.com/y4urott2
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD FULL eBOOK INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF eBook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB eBook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. doc eBook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. PDF eBook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB eBook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. doc eBook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, CookeBOOK Crime, eeBOOK Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD FULL eBOOK INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF eBook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB eBook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. doc eBook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. PDF eBook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB eBook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. doc eBook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, CookeBOOK Crime, eeBOOK Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD FULL. BOOKS INTO AVAILABLE FORMAT, ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Trends from the Trenches: 2019

  1. 1. Trends from the Trenches 2019 Bio-IT World Conference Chris Dagdigian https://bioteam.net
  2. 2. Want these slides? slideshare.net/chrisdag or https://bioteam.net
  3. 3. Image by Deanna & Amy; used with permission https://metoostem.com/ ● Seems appropriate to include this ● Recurring 2019 theme for me has been listening to the stories of women forced early career academic paths or jobs because of systemic harassment & bias in STEM fields
  4. 4. @chris_dag - https://bioteam.net I’m Chris. I work for BioTeam ● Failed scientist turned infrastructure nerd ● 20 years working on infrastructure for life science research; Now I’m old & lame ● As a consultant I get to see how many different groups of smart people tackle similar challenges ● Often I’m allowed to talk about what I see so I collect trends, observations and common pain points ● Started talking at BioIT in 2010 and they won’t let me gracefully retire
  5. 5. Thought Excretor Magic Quadrant Competence / Domain Insight Can talk bluntly in public @hpc_guru @fdmnts@glennklockwood … you get the idea @{ many smart people } @{ vendor shills } @mndoci
  6. 6. @chris_dag - https://bioteam.net Tune Me Out or Filter My Words Accordingly ● Not a pundit ● Not a “thought leader” ● Not pretending to speak on behalf of our huge and diverse industry ● This is a personal talk delivered through the prism of prior work, clients, projects and conversations ● Lots of industry/government work recently ● My observations have the same diversity/inclusion problems as science/workplaces in general ● Heavily influenced by past and current projects and the interesting people I’ve spoken or interacted with
  7. 7. 2019 Catch-all: Observations, Anecdotes & Emergent Stuff
  8. 8. @chris_dag - https://bioteam.net 01: We’ve done OK entering “data intensive science” era Turbulent for sure but we’ve managed ... ● Compute ○ Physical, virtual and cloud based computing is a tractable problem at most scales ● Networking ○ > 10-Gbps still painful and expensive ○ Science DMZ design patterns are working ● Storage ○ Large capacity is a solved problem ○ Consumption rate still scary One of the biggest unsolved problems ● Data Management, Discovery, Cataloging and Classification ● It’s easy to store vast piles of data; we are still terrible at understanding what we have Question: How many vendors and products did you see this week at BioIT’19 explicitly focusing on data management, curation, metadata or discovery?
  9. 9. @chris_dag - https://bioteam.net 02: Scientific Computing: Still Undervalued By Leadership Prior Talks / Younger Me ● “Computers are digital benchtops, not the simple business process endpoints that Enterprise IT treats them as” ● “HPC capability is essential for R&D; we need leadership and investment parity with the wetlab folk” Today’s Talk / Older, Heftier & Wiser ● Still viewed as cost center to be minimized, optimized and “value engineered” ● Only a few treat “extract insight & value from data” as core competitive differentiator beyond vapid words in mission statements ● HR not touting as major recruitment and retention asset Incompetence in this space is an existential survival threat to your company or organization.
  10. 10. @chris_dag - https://bioteam.net 02: Scientific Computing: Still Undervalued By Leadership Incompetence in this space is an existential survival threat to your company or organization.
  11. 11. @chris_dag - https://bioteam.net 03: Scientific Computing: User Trends User Base Climbing Rapidly ● HPC and analytic capabilities are extending from discovery and spreading across the enterprise ● Pervasive need for HPC and analytic competence across the entire organization ○ % of staff outgrowing “laptop scale” analysis is climbing fast ○ Competitive differentiator ○ Recruitment/Retention resource ○ Survival requirement in 2019 Users getting MORE and LESS Sophisticated ● Users forced away from laptop-scale methods require significant training and onboarding ● Yet new hires and early-career recruits often show up with prior HPC and cloud expertise ● In general we are still pretty bad at training, best practice propagation and knowledge transfer ○ Especially in helping “intermediate” level users become experts Past talks ... Today ...
  12. 12. @chris_dag - https://bioteam.net 04: Definition of HPC Being Stretched In Extreme Ways ● “Beyond laptop scale” computing requirement is becoming pervasive across organizations ● We are often the only multi-user / shared-service entities with large scale compute, storage, memory, GPU and visualization capabilities ● HPC in danger of becoming the dumping ground for any problem that does not fit on a laptop ● Parasitic usage causes: ○ Infrastructure tuned & biased for generic workflows ○ Support org becomes even more overwhelmed ○ Angry users demanding high-touch support and special accommodations for niche stuff “Any analysis that can’t run on a cheap leased laptop must require HPC” -- league of bad mgmt
  13. 13. @chris_dag - https://bioteam.net 05: Compilers & Toolchains: Mini Trend? Coming out of a relatively stable era ● Intel dominated compute ● Genomics/informatics dominated workload ● Hardware & software well characterized ● 10+ years since I had to mess with commercial compilers This may be changing … ● Ludicrous rate of innovation seen in the instrument space is starting to appear in our tooling & applications ● Now? ○ Software in rapid improve/innovate cycles ○ Kernels and kernel modules matter ○ Compiler & glibc versions matter ○ Conservative RHEL/CentOS Linux distributions may be moving too slowly for some scientific domains ● May be time to re-evaluate some of our foundational environments & toolchains
  14. 14. @chris_dag - https://bioteam.net 05: Compilers & Toolchains: Mini Trend? The latest/fastest CPUs are expensive. GPUs are expensive. NVLINK is expensive. DGX-2 list price is $400,000/ea A reasonable investment in compiler and toolchain optimizations could pay significant dividends
  15. 15. @chris_dag - https://bioteam.net 06: Compilers & Toolchains : Relion CryoEM Homework Try this at home, kids! (if you have CPUs with AVX-512 support) ● Download latest Relion codebase from https://github.com/3dem/relion ● Test #1 ○ Build using stock compiler and developer tools with CPU acceleration enabled ○ Run & time “relion_refine” using the common benchmark data set and commands ● Test #2 ○ Repeat build with upgraded compiler and developer tools (ie GCC-7 on CentOS/RHEL 7) ○ Time how long the run takes ● Test #3 (if possible) ○ Repeat work using Intel ICC compiler (Intel Parallel Studio) ○ Time how long the run takes
  16. 16. @chris_dag - https://bioteam.net 07: Call To Action - Bigger Relion CryoEM Benchmark Sets ● The most prevalent/popular Relion benchmark uses a ~50GB input data set ● Everyone appears to be using it; especially vendors trying to sell you stuff ○ Small enough to fit in RAM and hit the caching effect of almost all storage systems ○ We (BioTeam) do not believe this is a realistic test in 2019 ■ … for anything other than getting compiler and CPU/GPU optimizations correct ● Seeking multi-terabyte CryoEM data organized for Relion 2D or 3D classification ○ We think the community needs MUCH larger benchmarking resources and data sets ○ We will happily host, share & re-distribute ○ We will publish our own results testing against this data ○ Contact chris@bioteam.net How many scientists do you know with CryoEM experimental data sets that are less than 60GB in size?
  17. 17. @chris_dag - https://bioteam.net 08: Machine Learning & Training Data - Awsome and Ugly ● Proper ML/AI requires lots of training data ● Need “training” & “validation” sets ● The data engineering work is non-trivial ● Metadata is essential; bad data will sink you ● Competitors with “better” data will beat you The race to acquire, generate or license has the potential to be both awesome and ugly ● Significant opportunities for both innovation and abuse Innovation Example ● We are starting to see organizations doing really interesting things to acquire the training data they need ● An Example: Publish/Host useful tools on the cloud ○ Users get access to sophisticated analysis resources they do not have locally ○ Opt-in data sharing process generates ... ○ 30,000 de-identified MRI scans per week
  18. 18. @chris_dag - https://bioteam.net Topic: Lean Times & Resource Scarcity
  19. 19. @chris_dag - https://bioteam.net "Unit cost of storage is decreasing but not as fast as data production is increasing. Our computing costs grow ~10%/year while budget grows at ~3% so we've had to cut [research] mission to preserve essential capability " -- Scientific leader @ nationally recognized institution
  20. 20. @chris_dag - https://bioteam.net Lean Times: Prior Talks ● Cheaper to repeat experiment than store the data over full lifecycle ● Unit cost of storage out of sync with ease of data generation ● Petabytes of open access data easily available; & valid reasons to use it ● IT knows you haven’t touched that data in years Also: ● Deleting raw and derived scientific data is OK ● Performing data triage is OK ● … as long as data deletion decisions are made by scientists, not IT
  21. 21. @chris_dag - https://bioteam.net Lean Times: Today ● Data management still a source of existential dread for Bio-IT ● Core problem has seeped beyond “it is easier to acquire vast piles of scientific data than it is to sensibly and safely store it over time” ● Today we see single scientists asking research questions that can totally consume a leadership-class supercomputer or system like ANTON-2 For biotech/pharma this means our researchers can easily swamp any system of any size or capability we can reasonably deliver. That is … not sustainable.
  22. 22. @chris_dag - https://bioteam.net Lean Times: What this could mean in coming years ● We stop half-assing governance in discovery-oriented Bio-IT? ● Scientific computing orgs tighten scope, scale & supported services ● HPC resource allocation explicitly under control of scientific leadership ○ Remember it is *never* appropriate for IT to make these types of decisions ● What about moonshots and open-ended research? ○ Maybe we adopt DOE/NSF national lab model and hand out internal credits, grants or allocations for researchers to “spend” however they see fit ...
  23. 23. @chris_dag - https://bioteam.net Lean Times: Effective Operation Principals Required: ● Governance driven by Science (not IT groups) becomes essential ● Honest & transparent operational cost data spanning cloud/on-prem ● Full transparency of usage and resource allocation metrics ● Good logging of scientific tools and codes being invoked
  24. 24. @chris_dag - https://bioteam.net Lean Times: It’s not all doom and gloom Talking to people who have lived this before: ● Forced hard examination of bespoke/custom/standalone systems (silos) ● Helped push for internal agreement and alignment re: adopting common platforms, APIs and shared services/sysadmin operations ● “Made us think hard about how to run technological operations in a different way”
  25. 25. @chris_dag - https://bioteam.net Topic: Silicon Matters Again
  26. 26. @chris_dag - https://bioteam.net Silicon Matters Again: Then & Now Prior Talks / Younger Me ● “Compute is commodity” ● “Intel x86 rules the world” ● “GPU usage starting to differentiate between visualization and MD/Chemistry” Today’s Talk / Older, Larger & Wiser ● Ahh crap … ● CPUs, GPUs, FPGA’s and custom silicone are back on the table again and it’s getting messy Bottom Line: ● Significantly more benchmark & eval work ● Developer preference vs. Cost/ROI analysis ● GIANT EXCEPTION ○ Serverless folk don’t care
  27. 27. @chris_dag - https://bioteam.net Silicon Matters Again: CPU ● AMD is back with EPYC ● … it’s benchmarking time again GPUs ● Increasingly complicated landscape ● Needed for: VDI, Viz, MD/Chem/Structure, ML/AI and CryoEM ● Pain points ○ Need different products (VDI vs. Science) ○ Need various GPU memory configs ○ Need various #s of GPUs per chassis ○ NVLINK - when, where & how much? ○ Will cloud have them when you need them? TPUs, FPGAs & Custom Silicone ● Many trying to differentiate in ML/AI space via custom devices ● Clouds now have proprietary accelerators ● More benchmarking required ● SDK/Framework decisions required ● Deeper engagement with IT required
  28. 28. @chris_dag - https://bioteam.net Topic: Facility & Infrastructure
  29. 29. @chris_dag - https://bioteam.net Facility & Infrastructure: General Observations ● Yes, you still have to do hybrid vs. cloud vs. on-prem analysis & math ● Economics still favor on-prem or colo for 24x7 scientific workloads ○ When other capability or business requirements don’t superseed cost concern ● Why? ○ Cloud-based on-demand elastic computing is easy and well understood ○ Serverless is effing transformative; both for capability and cost, but ... ○ … Persistent, accessible petascale cloud storage is still expensive month over month ○ At petascale egress fees start to matter
  30. 30. @chris_dag - https://bioteam.net Definite Trend: Colocation Suites & Cabinets ● Seeing this actively in 2019 ○ I’ve got an active on-prem to colo (Markley Group) project right now ● Sign of the times: ○ Steve Lister from Novartis is now CTO for HPC & Data Analytics @ Markley Group ● Drivers ○ Cost of new-build or upgrades to on-premise facilities ○ Poor cloud economics for certain 24x7 workloads and use cases ○ Growth, merger & consolidation activities ○ Colocation is the new “Network Hub” for ■ Offsite backup and data continuity efforts ■ Flexible aggregation of cloud connectivity (single cloud or multi-cloud) ■ Speciality links to partners, collaborators and high speed research networks (Internet2) ■ Bespoke IaaS, PaaS, SaaS offerings from colo operators
  31. 31. @chris_dag - https://bioteam.net Story: “Innovation Space” Horror Show
  32. 32. @chris_dag - https://bioteam.net Dumbest Thing I Saw in 2018 ● Where: Boston, Massachusetts ● What: Shiny new top tier incubator/innovation space for life science startups ● Wow: Office, lab space, managed stockroom/chem service, etc. ● WTF: ○ No IT/infrastructure space for tenants. At all. ■ Big Data? Exotic instruments? Data intensive science? Eff you. ■ Don’t want to place a tower server deskside or in the wet lab? Eff you. ■ Shared internet circuit & firewall (logical tennant isolation & traffic shaping though) ○ Any changes require 3-party negotiations (Space Operator, Floor leaseholder, Building owners)
  33. 33. @chris_dag - https://bioteam.net I’m not kidding - Dumbest thing I saw in 2018 ● Brand new incubator space targeting life science startups in the middle of Boston ● … did a new facility build with the assumption that biotech/pharma startups need nothing but laptops, 1Gig network drops and a bit of cloud + managed IT services ● Any physical IT infrastructure not owned by the space operator or managed IT service provider has to live under-desk or inside the wet lab space Yes. A subset of agile, fast-moving startups need nothing more than internet and a set of cloud-based wifi & domain controllers. Building a new facility that caters only to these shops means you’ve guzzled a bit too much telecom vendor and cloud marketing (or you fell asleep on a pile of “CIO Magazine”)
  34. 34. @chris_dag - https://bioteam.net Topic: Org Charts & Scientific Support
  35. 35. @chris_dag - https://bioteam.net Support: Data Intensive Life Science Is A Different Beast Other HPC / Supercomputing : ● Modest set of dominant, well profiled & well optimized domain-specific codes ● May have large user base or every extreme HPC needs but the domain and application landscape is approachable In life science HPC … ● 5 - 5000 users ● 600+ applications spanning 10+ domains ○ Molecular Dynamics, Fluid Dynamics, Structural Biology, Chemistry, Genomics, Bioinformatics, Medical/Clinical, Optical Imaging, EM Imaging, Sensor/IoT, etc. etc. ○ Each of these breaks down into specialities typed by species, disease, organ, pathway etc. etc.
  36. 36. @chris_dag - https://bioteam.net Support: Data Intensive Life Science Is A Different Beast I hate to bust out the “... but life science is SPECIAL and UNIQUE …” take But … If you survey commodity supercomputing and capability supercomputing environments at both small and “national lab” scale you will see stark differences Domain and workload diversity (and crap code) are our distinguishing characteristics
  37. 37. @chris_dag - https://bioteam.net Support: Data Intensive Life Science Is A Different Beast Not a trend because org charts vary wildly by mission & org but ... ● Domain expertise needs to spread to the edge of the org while Scientific Computing groups retain and grow the expertise that spans groups/projects/orgs and domains Domain-Specific Expertise: ● Embedded within the group, lab, program or R&D organization Cross-Domain/Cross-Org Expertise: ● Science Gateways, Portals, Middleware & APIs ● User & Workflow Optimization ● High Value Application Optimization ● Data Engineering ● Data Science & Analytics ● Data Visualization ● CUDA / ML / AI Expertise ● Training Broadly useful cross-org capabilities get consolidated within Scientific Computing; Exotic domain expertise moves to stakeholder teams.
  38. 38. @chris_dag - https://bioteam.net Model We Like: Service Oriented Delivery ● Large scientific computing shops reorganize around scientific use cases and end-user requirements ● … not on technological expertise ● Great way to blow away traditional IT silos ● “Team of teams” approach to service delivery
  39. 39. @chris_dag - https://bioteam.net Topic: Storage
  40. 40. @chris_dag - https://bioteam.net Storage Landscape: Prior Talks ● Everyone needs peta-capable storage ● { insert scary growth of storage graph } OMG OMG OMG ● In tough times it is OK to favor storage capacity over performance ● Scale-out NAS is best storage platform for most ● Parallel File Systems when workload requires due to higher ops burden ● Object Storage is the future of scientific data at rest ● It is easier to generate/acquire vast piles of data than it is to safely and sensibly store and manage it over a full lifecycle - this is a big problem Legacy Dag
  41. 41. @chris_dag - https://bioteam.net Storage Landscape: Today [1] Major fundamental changes ● The capacity|performance calculus has swung the other way ● We now need very fast storage to handle machine learning, AI and image-based workflow requirements ● ML training & validation requires persistent access to the “old” data so we still need massive storage capacity ● Dominant file type at the moment is image-based, no longer genomes Current Dag
  42. 42. @chris_dag - https://bioteam.net Storage Landscape: Today [2] Contributing Factors ● Lots of deployed storage is nearing EOL or end of support contract ● Some really interesting next-gen storage companies have launched ● Parallel storage is a lot more attractive w/ performance as key driver ● Operational benefits of scale-out NAS slightly less valuable in context Current Dag
  43. 43. @chris_dag - https://bioteam.net Storage Landscape: Data As Currency ● Your organization has a big problem when the default stance among leadership or users is “all our data is important” - way worse than “we don’t know how to figure out what is important …” ● Not understanding the true value of data leads to hoarding, massive inefficiencies and inability to properly leverage the data at hand ● Data management, scientifically-relevant metadata, tracking the use, derivative uses, and amount of repeated uses of data could totally change how we approach scientific data storage ● It's about the data, not the storage platform Current Dag
  44. 44. @chris_dag - https://bioteam.net Storage Tiering & Namespaces: Prior Talks ● Single namespace storage is really important ● If we don’t give users a single view of storage we end up with: ○ Multiple islands of data ○ Scientists store the same data in N different locations ○ Nasty data location and data provenance issues ● Seamless tiering within the namespace is desirable if possible Legacy Dag
  45. 45. @chris_dag - https://bioteam.net Storage Tiering & Namespaces: Today [1] ● We've done a bad job at encouraging active data handling ● Data is currency; IT training focuses on "spend wisely" not "manage effectively" ● This is a multi-partner, multi-platform, multi-cloud world ● Global data protection methods hedge against disaster but not personal/group/lab/publication needs ● Still inappropriate for IT to make data classification decisions Current Dag
  46. 46. @chris_dag - https://bioteam.net Storage Tiering & Namespaces: Today [2] Current Dag My biggest attitude change: IT attempts to make seamless namespaces and automatic tiering generally have failed to meet expectations; also hard to do efficiently or without researcher input anyway We need to place data management responsibilities back onto the end-users* Users whining about having to move/manage data when their career and publications are based on “data intensive science” will no longer be coddled. SCIENTIFIC DATA IS YOUR JOB.
  47. 47. @chris_dag - https://bioteam.net Storage Tiering & Namespaces: Today [4] * Some Exceptions: ● There ARE data management tasks that are a waste of time and skill for highly trained scientists ● Biggest example: large-scale physical data ingest and export - scientists should not be dealing with portable hard drives beyond a certain scale ● Large-scale physical data movements needs written SOPs and a process owned by IT Current Dag
  48. 48. @chris_dag - https://bioteam.net Storage Tiering & Namespaces: Today [3] New social contract between IT and Users IT Provides: ● Storage that meets business and scientific requirements ○ Including scratch, active, nearline, archive and object ○ Durable, available and reliable ● Metrics, monitoring, reporting and tools that enable user self-service End User Responsibilities: ● Users responsible for scientific data management through full lifecycle ○ Including classifying, curating, organizing and moving it Current Dag
  49. 49. @chris_dag - https://bioteam.net Storage: What this all means ● The new requirements for speed + capacity is deeply scary ● Image workloads and ML/AI mean we can’t trade away performance in exchange for larger capacity any more ● Enterprise IT has more justification to transition platforms: ○ Conservative shops can buy the faster flash-powered levels of Scale-out NAS ○ Conservative shops can go IBM Spectrum Scale (managed GPFs) ○ Forward-looking shops will bring in new platforms and vendors ○ BeeGFS, Ceph & Lustre will find new audiences ● I’m cool with tiers, namespaces and making end-users more responsible Current Dag
  50. 50. @chris_dag - https://bioteam.net Storage: Interesting Players Metadata, Discovery, Data Protection ● Starfish Storage, https://starfishstorage.com/ ● Atavium, https://www.atavium.com/ ● Arcitecta, https://www.arcitecta.com/ ● Igneous, https://www.igneous.io/ Next-Gen / Flash Storage Architectures ● VAST Data, https://www.vastdata.com/ ● WekaIO, https://www.weka.io/ ● Pure Storage, https://www.purestorage.com/ Current Dag
  51. 51. @chris_dag - https://bioteam.net Storage: Interesting Players, Continued Data Movement ● Globus, https://www.globus.org/ ● DataDobi, https://datadobi.com/ ● Zettar, https://www.zettar.com/ Current Dag
  52. 52. @chris_dag - https://bioteam.net Topic: Networking
  53. 53. @chris_dag - https://bioteam.net Networking: Still the #1 hassle but little change since 2018 Still the #1 IT infrastructure problem in data intensive life science ● Still have trouble moving scientific data at scale across networks ● We still lag in deploying 40-gig and 100-gig networking ● Enterprise IT still focusing on datacenter rather than edge & lab ● We still need to separate business network traffic from science data traffic using Science DMZ design patterns ● Our connections to the Internet and Cloud are still too small ● Our firewalls and security controls are still designed for business traffic and not monster “elephant” flows ● Biggest new thing was Nvidia purchasing Mellanox ! Past & current!
  54. 54. @chris_dag - https://bioteam.net Topic: Cloud
  55. 55. @chris_dag - https://bioteam.net Cloud: Meta issues still the same but some changes ... Past & current! Consistent message for 10 years now ● Cloud is a capability play for life science research organizations ● Saving money is not the primary driver*
  56. 56. @chris_dag - https://bioteam.net Cloud: Meta issues still the same but some changes ... * About that “not a cost saving thing” message … ● Serverless Computing is transformational for capability ● Serverless Computing is transformational for cost Read this: https://rise.cs.berkeley.edu/blog/a-berkeley-view-on-serverless-computing/ Search engine shortcut: “berkeley view on serverless 2019” ● Primary caveat is that discovery oriented science still relies heavily on interactive human efforts with bespoke tooling. A large chunk of our Bio-IT landscape cannot be codified into APIs & service mesh architectures
  57. 57. @chris_dag - https://bioteam.net Cloud: Meta issues still the same but some changes ... ● Microsoft acquisition of Cycle Computing is really starting to become apparent on Azure Cloud - lots of interesting HPC and storage offerings ● Cloud efforts to build bespoke accelerated hardware for AI/ML and inference is of some concern. What used to be a simple cost or capability eval now will require deep IT interaction with end-users to learn their preferences and needs for SDKs, frameworks and tooling ● Scarcity of GPU resources on AWS has been a consistent trend across multiple projects. We can’t get them at all, let alone within a placement group!
  58. 58. @chris_dag - https://bioteam.net Wrapping Up
  59. 59. @chris_dag - https://bioteam.net Recap - Bottom Line 2019 Summary 1. Unit cost of storage vs. consumption rate will force hard choices and new governance 2. Data discovery, management, curation and movement are still major concerns 3. Storage selection pendulum has moved in a big way. We now have to be BIG and FAST. This will have a major impact 4. Responsibility for scientific data management must rely with end-user and not IT 5. Compilers, toolchains and silicon matter again; it’s time to resurrect the benchmark and eval crew 6. Science users can now swamp systems of any scale with valid research questions; expect governance and service scope constraints to become more prevalent 7. Colo Facilities are being used more often 8. Life Science stands apart in the HPC and supercomputing worlds for the sheer size and diversity of our domains and workloads
  60. 60. @chris_dag - https://bioteam.net Crowdsourcing thanks! Sincere thanks to the folk who responded online with comments and suggestions. Including: ● Philippe Neron ● Matthew Trunnell ● Glenn Lockwood ● Tim Cutts ● Nick Weber ● Tom Bolton ● Dirk Petersen ● Gregg TeHennepe ● Eduardo Zaborowski ● Remy Evard ● Tom Plasterer ● Joe Stanganelli ● Jason Tetrault 2020 is the 10-year BioIT World anniversary! The conference organizers are very interested in what you’d like to see and hear to make next year very special.
  61. 61. End; Thanks!; Want these slides? slideshare.net/chrisdag or https://bioteam.net
  62. 62. Portrait commissioned from the artist who did the illustrations for the “Heroines of JavaScript Trading Cards”. Want your own? https://twitter.com/mirlu_exe

×