- The speaker observes trends in how research infrastructure is changing more rapidly than IT can refresh systems, creating challenges. This includes new instruments generating vastly more data.
- There is a blurring of roles between scientists, sysadmins, and programmers as everything becomes more automated and "scriptable." Sysadmins must learn programming and researchers can now self-provision resources.
- Virtualization is widely used even in HPC to provide flexibility and address business needs. Very large "fat node" servers are replacing clusters of smaller nodes. Local disk is coming back as a hedge against big data requirements.
- Object storage is becoming more viable and approachable on commodity hardware for a
5. I’m Chris.
I’m an infrastructure geek.
I work for the BioTeam.
www.bioteam.net - Twitter: @chris_dag 5
6. BioTeam
Who, What, Why ...
‣ Independent consulting shop
‣ Staffed by scientists forced to
learn IT, SW & HPC to get our
own research done
‣ 10+ years bridging the “gap”
between science, IT & high
performance computing
6
7. If you have not heard me speak ...
Apologies in advance
‣ “Infamous” for speaking
very fast and carrying a
huge slide deck
• ~70 slides for 25 minutes
about average for me
• Let me mention what
happened after my Pharma
HPC best practices talk
yesterday ...
By the time you see this slide
I’ll be on my ~4th espresso
7
8. Why I do this talk every year ...
‣ Bioteam works for
everyone
• Pharma, Biotech, EDU,
Nonprofit, .Gov, etc.
‣ We get to see how
groups of smart people
approach similar
problems
‣ We can speak honestly &
objectively about what
we see “in the real
world”
8
9. Standard Dag Disclaimer
Listen to me at your own risk
‣ I’m not an expert, pundit,
visionary or “thought leader”
‣ Any career success entirely due
to shamelessly copying what
actual smart people do
‣ I’m biased, burnt-out & cynical
‣ Filter my words accordingly
9
12. Big Picture / Meta Issue
‣ HUGE revolution in the rate at which
lab platforms are being redesigned,
improved & refreshed
• Example: CCD sensor upgrade on that
confocal microscopy rig just doubled
storage requirements
• Example: The 2D ultrasound imager is
now a 3D imager
• Example: Illumina HiSeq upgrade just
doubled the rate at which you can acquire
genomes. Massive downstream increase
in storage, compute & data movement
needs
‣ For the above examples, do you
think IT was informed in advance?
12
13. The Central Problem Is ...
Science progressing way faster than IT can refresh/change
‣ Instrumentation & protocols are changing FAR
FASTER than we can refresh our Research-IT &
Scientific Computing infrastructure
• Bench science is changing month-to-month ...
• ... while our IT infrastructure only gets refreshed every
2-7 years
‣ We have to design systems TODAY that can
support unknown research requirements &
workflows over many years (gulp ...)
13
14. The Central Problem Is ...
‣ The easy period is over
‣ 5 years ago we could toss
inexpensive storage and
servers at the problem;
even in a nearby closet or
under a lab bench if
necessary
‣ That does not work any
more; real solutions
required
14
16. And a related problem ...
‣ It has never been easier to
acquire vast amounts of data
cheaply and easily
‣ Growth rate of data creation/
ingest exceeds rate at which
the storage industry is
improving disk capacity
‣ Not just a storage lifecycle
problem. This data *moves*
and often needs to be shared
among multiple entities and
providers
• ... ideally without punching holes in
your firewall or consuming all
available internet bandwidth
16
17. If you get it wrong ...
‣ Lost opportunity
‣ Missing capability
‣ Frustrated & very vocal scientific staff
‣ Problems in recruiting, retention,
publication & product development
17
22. DevOps & Scriptable Everything
‣ On (real) clouds,
EVERYTHING has an
API
‣ If it’s got an API you can
automate and
orchestrate it
‣ “scriptable datacenters”
are now a very real thing
22
23. DevOps & Scriptable Everything
‣ Incredible innovation in
the past few years
‣ Driven mainly by
companies with
massive internet
‘fleets’ to manage
‣ ... but the benefits
trickle down to us little
people
23
24. DevOps will conquer the enterprise
‣ Over the past few years
cloud automation/
orchestration methods
have been trickling
down into our local
infrastructures
‣ This will have
significant impact on
careers, job
descriptions and org
charts
24
25. Scientist/SysAdmin/Programmer
2013: Continue to blur the lines between all these roles
‣ Radical change in how IT
www.opscode.com
is provisioned, delivered,
managed & supported
• Technology Driver:
Virtualization & Cloud
• Ops Driver:
Configuration Mgmt, Systems
Orchestration & Infrastructure
Automation
‣ SysAdmins & IT staff need
to re-skill and retrain to
stay relevant
25
26. Scientist/SysAdmin/Programmer
2013: Continue to blur the lines between all these roles
‣ When everything has an
API ...
‣ ... anything can be
‘orchestrated’ or
‘automated’ remotely
‣ And by the way ...
‣ The APIs (‘knobs &
buttons’) are accessible to
all, not just the bearded
practitioners sitting in that
room next to the datacenter
26
27. Scientist/SysAdmin/Programmer
2013: Continue to blur the lines between all these roles
‣ IT jobs, roles and
responsibilities are going
to change significantly
‣ SysAdmins must learn to
program in order to
harness automation tools
‣ Programmers &
Scientists can now self-
provision and control
sophisticated IT
resources
27
28. Scientist/SysAdmin/Programmer
2013: Continue to blur the lines between all these roles
‣ My take on the future ...
• SysAdmins (Windows & Linux) who
can’t code will have career issues
• Far more control is going into the
hands of the research end user
• IT support roles will radically change
-- no longer owners or gatekeepers
‣ IT will “own” policies,
procedures, reference patterns,
identity mgmt, security & best
practices
‣ Research will control the
“what”, “when” and “how big”
28
30. Facility 1: Enterprise vs Shadow IT
‣ Marked difference in the
types of facilities we’ve
been working in
‣ Discovery Research
systems are firmly
embedded in the
enterprise datacenter
‣ ... moving away from “wild
west” unchaperoned
locations and mini-
facilities
30
31. Facility 2: Colo Suites for R&D
‣ Marked increase in use of commercial colocation
facilities for R&D systems
• And they’ve noticed!
- Markly Group (One Summer) has a booth
- Sabey is on this afternoon’s NYGenome panel
‣ Potential reasons:
• Expensive to build high-density hosting at small scale
• Easier metro networking to link remote users/sites
• Direct connect to cloud provider(s)
• High-speed research nets only a cross-connect away
31
32. Facility 3: Some really old stuff ...
‣ Final facility observation
‣ Average age of infrastructure we work on seems to be
increasing
‣ ... very few aggressive 2-year refresh cycles these days
‣ Potential reasons
• Recession & consolidation still effecting or deferring major
technology upgrades and changes
• Cloud: local upgrades deferred pending strategic cloud decisions
• Cloud: economic analysis showing stark truth that local setups
need to be run efficiently and at high utilization in order to justify
existence
32
33. Facility 3: Virtualization
‣ Every HPC environment
we’ve worked on since
2011has included (or
plans to include) a local
virtualization environment
• True for big systems: 2k
cores / 2 petabyte disk
• True for small systems: 96
core CompChem cluster
‣ Unlikely to change; too
many advantages
33
34. Facility 3: Virtualization
‣ HPC + Virtualization solves a lot of problems
• Deals with valid biz/scientific need for researchers to
run/own/manage their own servers ‘near’ HPC stack
‣ Solves a ton of research IT support issues
• Or at least leaves us a clear boundary line
‣ Lets us obtain useful “cloud” features without
choking on endless BS shoveled at us by
“private cloud” vendors
• Example: Server Catalogs + Self-service Provisioning
34
36. Compute:
‣ Still feels like a solved
problem in 2013
‣ Compute power is a
commodity
• Inexpensive relative to other
costs
• Far less vendor differentiation
than storage
• Easy to acquire; easy to
deploy
36
37. Compute: Fat Nodes
Fat nodes are wiping out small and midsized clusters
‣ This box has 64 CPU Cores
• ... and up to 1TB of RAM
‣ Fantastic Genomics/
Chemistry system
• A 256GB RAM version only
costs $13,000*
‣ BioIT Homework:
• Go visit the Sillicon Mechanics
booth and find out the current
cost of a box with 1TB RAM
37
39. Compute: Local Disk is Back
Defensive hedge against Big Data / HDFS
‣ We’ve started to see organizations move
away from blade servers and 1U pizza box
enclosures for HPC
‣ The “new normal” may be 4U enclosures
with massive local disk spindles - not
occupied, just available
‣ Why? Hadoop & Big Data
‣ This is a defensive hedge against future
HDFS or similar requirements
• Remember the ‘meta’ problem - science is
changing far faster than we can refresh IT. This
is a defensive future-proofing play.
‣ Hardcore Hadoop rigs sometimes operate
at 1:1 ratio between core count and disk
count
39
41. Network:
‣ 10 Gigabit Ethernet still the
standard
• ... although not as pervasive as I
predicted in prior trend talks
‣ Non-Cisco options attractive
• BioIT homework: listen to the Arista
talks and visit their booth.
‣ SDN still more hype than reality
in our market
• May not see it until next round of
large private cloud rollouts or new
facility construction (if even)
41
42. Network:
‣ Infiniband for message passing
in decline
• Still see it for comp chem, modeling &
structure work; Started building such
a system last week
• Still see it for parallel and clustered
storage
• Decline seems to match decreasing
popularity of MPI for latest generation
of informatics and ‘omics tools
‣ Hadoop / HDFS seems to favor
throughput and bandwidth over
latency
42
44. Storage
‣ Still the biggest expense, biggest headache and scariest
systems to design in modern life science informatics
environments
‣ Most of my slides for last year’s trends talk focused on
storage & data lifecycle issues
• Check http://slideshare.net/chrisdag/ if you want to see what I’ve said
in the past
• Dag accuracy check: It was great yesterday to see DataDirect talking
about the KVM hypervisor running on their storage shelves! I’m
convinced more and more apps will run directly on storage in the future
‣ ... not doing that this year. The core problems and common
approaches are largely unchanged and don’t need to be
restated
44
45. It’s 2013, we know what questions to ask of our storage
45
46. NGS new data generation: 6-month window
Data like this lets us make realistic capacity planning and purchase decisions
46
47. Storage: 2013
‣ Advice: Stay on top of the
“compute nodes with
many disks” trends.
‣ HDFS if suddenly required
by your scientists can be
painful to deploy in a
standard scale-out NAS
environment
47
49. Storage: 2013
Object Storage + Commodity Disk Pods
‣ Object storage is far more approachable
• ... used to see it in proprietary solutions for specific niche needs
• potentially on it’s way to the mainstream now
‣ Why?
• Benefits are compelling across a wide variety of interesting use cases
• Amazon S3 showed what a globe-spanning general purpose object
store could do; this is starting to convince developers & ISVs to modify
their software to support it
• www.swiftstack.com and others are making local object stores easy,
inexpensive and approachable on commodity gear
• Most of your Tier1 storage and server vendors have a fully supported
object store stack they can sell to you (or simply enable in a product
you already have deployed in-house)
49
52. Storage: 2013
‣ There are MANY reasons why you should
not build that $12K backblaze pod
• ... done wrong you will potentially inconvenience
researchers, lose critical scientific information and
(probably) lose your job
‣ Inexpensive or open source object storage
software makes the ultra-cheap storage
pod concept viable
52
53. Storage: 2013
‣ A single unit like this is risky and should only
be used for well known and scoped use cases.
Risks generally outweigh the disruptive price
advantage
‣ However ...
‣ What if you had 3+ of these units running an
object store stack with automatic triple
location replication, recovery and self-healing?
• Then things get interesting
• This is one of the ‘lab’ projects I hope to work on in ’13
53
54. Storage: 2013
‣ Caveat/Warning
• The 2013 editions of “backblaze-like” enclosures mitigate
many of the earlier availability, operational and reliability
concerns
• Still a aggressive play that carries risk in exchange for a
disruptive price point
‣ There is a middle ground
• Lots of action in the ZFS space with safer & more mainstream
enclosures
• BioIT Homework: Visit the Silicon Mechanics booth and
check out what they are doing with Nexenta’s Open Storage
stuff.
54
58. Cloud: 2013
Core Advice
‣ Research Organizations need a cloud
strategy today
• Those that don’t will be bypassed by frustrated
users
‣ IaaS cloud services are only a departmental
credit card away ... and some senior
scientists are too big to be fired for violating
IT policy
58
59. Cloud Advice
Design Patterns
‣ You actually need three tested cloud design
patterns:
‣ (1) To handle ‘legacy’ scientific apps & workflows
‣ (2) The special stuff that is worth re-architecting
‣ (3) Hadoop & big data analytics
59
60. Cloud Advice
Legacy HPC on the Cloud
‣ MIT StarCluster
• http://web.mit.edu/star/cluster/
‣ This is your baseline
‣ Extend as needed
60
61. Cloud Advice
“Cloudy” HPC
‣ Some of our research workflows are important
enough to be rewritten for “the cloud” and the
advantages that a truly elastic & API-driven
infrastructure can deliver
‣ This is where you have the most freedom
‣ Many published best practices you can borrow
‣ Warning: Cloud vendor lock-in potential is
strongest here
61
62. Hadoop & “Big Data”
What you need to know
‣ “Hadoop” and “Big Data” are now general
terms
‣ You need to drill down to find out what people
actually mean
‣ We are still in the period where senior
leadership may demand “Hadoop” or “BigData”
capability without any actual business or
scientific need
62
63. Hadoop & “Big Data”
What you need to know
‣ In broad terms you can break “Big Data” down into two
very basic use cases:
1. Compute: Hadoop can be used as a very powerful
platform for the analysis of very large data sets. The
google search term here is “map reduce”
2. Data Stores: Hadoop is driving the development of very
sophisticated “no-SQL” “non-Relational” databases and
data query engines. The google search terms include
“nosql”, “couchdb”, “hive”, “pig” & “mongodb”, etc.
‣ Your job is to figure out which type applies for the
groups requesting “Hadoop” or “BigData” capability
63
64. Cloud: 2013
What has changed ..
‣ Lets revisit some of my bile from prior years
‣ “... private clouds: still utter crap”
‣ “... some AWS competitors are delusional
pretenders”
‣ “... AWS has a multi-year lead on the
competition”
64
65. Private Clouds in 2013:
‣ I’m no longer dismissing them as “utter crap”
‣ Usable & useful in certain situations
‣ BioTeam positive experiences with OpenStack
‣ Hype vs. Reality ratio still wacky
‣ Sensible only for certain shops
• Have you seen what you have to do
to your networks & gear?
‣ Still important to remain cynical and perform proper due dillegenge
66. Non-AWS IaaS in 2013
‣ Three main drivers for BioTeam’s evolving IaaS practices and thinking
for 2013:
‣ (1) Real world success with OpenStack & BT
‣ (2) Real world success with Google Compute
‣ (3) Real world multi-cloud DevOps
‣ Just to remain honest though:
• AWS still has multi-year lead in product, service and features
• .. and many novel capabilities
• But some of the competition has some interesting benefits that AWS can’t match
67. BioTeam, BT & OpenStack
‣ We’ve been working with BT for a while now on
various projects
‣ BT Cloud using OpenStack under the hood with some
really nice architecture and operational features
‣ BioTeam developed a Chef-based HPC clustering
stack and other tools that are currently being used by
BT customers
• ... some of whom have spoken openly at this meeting
68. BioTeam & Google Compute Engine
‣ We can’t even get into the preview program
‣ But one of our customers did
‣ ... and we’ve been able to do some successful and
interesting stuff
• Without changing operations or DevOps tools our client is capable of
running both on AWS and Google Compute
• For this client and a few other use cases we believe we can span both
clouds or construct architectures that would enable fast and relatively
friction-free transitions
69. Chef, AWS, OpenStack & Google
Wrapping up ...
‣ 2012 was the 1st year we did real work spanning multiple
IaaS cloud platforms or at least replicating workloads on
multiple platforms
‣ We’ve learned a lot - I think this may result in some
interesting talks at next year’s Bio-IT meeting
- By BioTeam and actual end-users
‣ What makes this all possible is the DevOps / Orchestration
stuff mentioned at the beginning of this presentation.