Talk slides from the annual "trends from the trenches" address at BioITWorld Expo. 2014 Edition.
### Email chris@bioteam.net if you'd like a PDF copy of this deck ###
2014 BioIT World - Trends from the trenches - Annual presentation
1. 1
Trends from the trenches: 2014
slideshare.net/chrisdag/ chris@bioteam.net @chris_dag #BioIT14
Wednesday, April 30, 14
2. 2
I’m Chris.
I’m an infrastructure geek.
I work for the BioTeam.
Wednesday, April 30, 14
3. Apologies in advance
3
If you have not heard me speak ...
‣ ‘Infamous’ for speaking
very fast and carrying a
huge slide deck
‣ In 2014 CHI finally gave
up and just gave me a
60min talk slot
‣ Aiming to end with
enough time for
questions & discussions
By the time you see this slide
I’ll be on my ~4th espresso
Wednesday, April 30, 14
4. Who, What, Why ...
4
BioTeam
‣ Independent consulting shop
‣ Staffed by scientists forced to
learn IT, SW & HPC to get our
own research done
‣ 12+ years bridging the “gap”
between science, IT & high
performance computing
‣ Our wide-ranging work is what
gets us invited to speak at
events like this ...
Wednesday, April 30, 14
5. 5
Why I do this talk every year ...
‣ Bioteam works for
everyone
• Pharma, Biotech, EDU,
Nonprofit, .Gov, etc.
‣ We get to see how
groups of smart people
approach similar
problems
‣ We can speak honestly &
objectively about what
we see “in the real
world”
Wednesday, April 30, 14
6. Listen to me at your own risk
6
Standard Disclaimer
‣ I’m not an expert, pundit,
visionary or “thought leader”
‣ There are ~2000 smart people
at this event; I don’t presume to
speak for us as a whole
‣ All career success entirely due
to shamelessly copying what
actual smart people do
‣ I’m biased, burnt-out & cynical
‣ Filter my words accordingly
Wednesday, April 30, 14
8. aka ‘spreading the blame ...’
8
What’s new 1: Acknowledgements
‣ This talk used to be
made in a vacuum
each year
• ... often mere minutes
before the scheduled talk
time
‣ Not this year
• Heavily influenced by
peer group of smarter
people who get chatty
when given beer
‣ Non-comprehensive
blame gang:
• Ari Berman
• Aaron Gardner
• Adam Kraut
• Chris Botka (Harvard)
• Chris Dwan (Broad)
• James Cuff (Harvard)
• ... many more ...
Wednesday, April 30, 14
9. What has not changed in recent talks
Not new 2: Recycled Content
‣ The core Bio-IT ‘meta’
issue remains unchanged
‣ Minor updates to report
for cloud landscape
‣ Compute landscape
largely unchanged
• ... a few updates to share in
this space but nothing earth
shattering
9
Wednesday, April 30, 14
11. 11
The #1 ‘meta issue’ is unchanged in 2014
Wednesday, April 30, 14
12. 12
It’s a risky time to be doing Bio-IT
Wednesday, April 30, 14
13. 13
Meta: Science evolving faster than IT
can refresh infrastructure & practices
Wednesday, April 30, 14
14. This is what keeps Bio-IT folks up at night
The Central Problem Is ...
‣ Instrumentation & protocols are changing FAR
FASTER than we can refresh our Research-IT &
Scientific Computing infrastructure
• Bench science is changing month-to-month ...
• ... while our IT infrastructure only gets refreshed every
2-7 years
‣ Our job is to design systems TODAY that can
support unknown research requirements &
workflows over multi-year spans (gulp ...)
14
Wednesday, April 30, 14
15. The Central Problem Is ...
‣ The easy period is over
‣ 5 years ago we could toss
inexpensive storage and
servers at the problem;
even in a nearby closet or
under a lab bench if
necessary
‣ That does not work any
more; real solutions
required
15
Wednesday, April 30, 14
16. 16
This is our “new normal” for informatics
Wednesday, April 30, 14
17. 17
The Central Problem Is ...
‣ Lab technology is being
refreshed, upgraded and
replaced at an
astonishing rate
• Bigger, faster, parallel
• Requiring increasingly
sophisticated IT support
• Cheap and easily obtainable
Wednesday, April 30, 14
18. 18
The Central Problem Is ...
‣ ... and IT still being
caught by surprise in
2014
• Procurement practices and
cheaper instrument prices
result in situations where IT is
bypassed or not consulted in
advance
Wednesday, April 30, 14
19. True Story - 48 Hours Ago
19
Wednesday, April 30, 14
20. A conversation with a client
Just 48 hours ago ...
‣ Scientists tell IT that they
are getting a new PacBio
sequencing platform
• Gave IT a 5-node cluster
quote that PacBio provided
as blueprint for SMRT Portal
• Wanted confirmation that
everything was cool with IT
support
20
Wednesday, April 30, 14
21. A conversation with a client
Just 48 hours ago ...
‣ Partial “Minor” Issue List:
• Scientists had no clue about power
requirements. A pair of 60amp 220v
power outlets = multi-month facility
project
• ... assumed IT would be cool
accepting and supporting a one-off
HPC system sized for 1 instrument &
1 workgroup
• ... also appeared to believe that
storage was infinite and free. At
least that is what their budget
assumed.
21
Wednesday, April 30, 14
23. We can’t blame the science/lab side for everything
One more thing ...
‣ Can’t blame the lab-side for all our woes
‣ IT innovation is causing headaches in research
and program management
‣ Grant funding agencies, regulatory rules and
internal risk/program management practices
not updated to reflect current and emerging IT
capabilities, architectures & practices
• Rules & policies often simply do not cover what we are
capable of doing right now
23
Wednesday, April 30, 14
25. This also hurts ...
‣ It has never been easier to
acquire vast amounts of data
cheaply and easily
‣ Growth rate of data creation/
ingest exceeds rate at which
the storage industry is
improving disk capacity
‣ Not just a storage lifecycle
problem. This data *moves*
and often needs to be shared
among multiple entities and
providers
• ... ideally without punching holes in
your firewall or consuming all
available internet bandwidth
25
Wednesday, April 30, 14
26. The future is not looking pretty for the ill prepared
26
Wednesday, April 30, 14
27. High Costs For Getting It Wrong
‣ Lost opportunity
‣ Missing capability
‣ Frustrated & very vocal scientific staff
‣ Problems in recruiting, retention,
publication & product development
27
Wednesday, April 30, 14
32. 32
DevOps & Scriptable Everything
‣ On (real) clouds,
EVERYTHING has an
API
‣ If it’s got an API you can
automate and
orchestrate it
‣ “scriptable datacenters”
are now a very real thing
Wednesday, April 30, 14
33. 33
DevOps & Scriptable Everything
‣ Incredible innovation in
the past few years
‣ Driven mainly by
companies with
massive internet
‘fleets’ to manage
‣ ... but the benefits
trickle down to us
mere mortals
Wednesday, April 30, 14
34. 34
DevOps will conquer the enterprise
‣ Over the past few years
cloud automation/
orchestration methods
have been trickling
down into our local
infrastructures
‣ This will have
significant impact on
careers, job
descriptions and org
charts
Wednesday, April 30, 14
35. 2014: Continue to blur the lines between all these roles
35
Scientist/SysAdmin/Programmer
‣ Radical change in how IT is
provisioned, delivered,
managed & supported
• Technology Driver:
Virtualization & Cloud
• Ops Driver:
Configuration Mgmt, Systems
Orchestration & Infrastructure
Automation
‣ SysAdmins & IT staff need to
re-skill and retrain to stay
relevant
www.opscode.com
Wednesday, April 30, 14
36. 2014: Continue to blur the lines between all these roles
36
Scientist/SysAdmin/Programmer
‣ When everything has an
API ...
‣ ... anything can be
‘orchestrated’ or
‘automated’ remotely
‣ And by the way ...
‣ The APIs (‘knobs &
buttons’) are accessible to
all, not just the expert
practitioners sitting in that
room next to the
datacenter
Wednesday, April 30, 14
37. 2014: Continue to blur the lines between all these roles
37
Scientist/SysAdmin/Programmer
‣ IT jobs, roles and
responsibilities are
changing
‣ SysAdmins must learn to
program in order to
harness automation tools
‣ Programmers &
Scientists can now self-
provision and control
sophisticated IT
resources
Wednesday, April 30, 14
38. 2014: Continue to blur the lines between all these roles
38
Scientist/SysAdmin/Programmer
‣ My take on the future ...
• SysAdmins (Windows & Linux) who
can’t code will have career issues
• Far more control is going into the
hands of the research end user
• IT support roles will radically change
-- no longer owners or gatekeepers
‣ IT will “own” policies,
procedures, reference patterns,
identity mgmt, security & best
practices
‣ Research will control the
“what”, “when” and “how big”
Wednesday, April 30, 14
39. 2014 Summary
Trend: DevOps & Automation
‣ Almost every HPC project (all sizes) BioTeam worked
on in 2014 included
• A bare-metal OS provisioning service (Cobbler, etc.)
• A ‘next-gen’ configuration management service (Chef, Puppet,
Saltstack, etc.)
‣ Gut feeling: This is going to be very useful for
regulated environments
• Not BS or empty hype: IT infrastructure and server/OS/service
configuration encoded as text files
• Easy to version control, audit, revert, rebuild, verify and fold into
existing change management & documentation systems
39
Wednesday, April 30, 14
41. Compute related design patterns largely static
41
Core Compute
‣ Linux compute clusters
are still the baseline
compute platform
‣ Even our lab instruments
know how to submit jobs
to common HPC cluster
schedulers
‣ Compute is not hard. It’s a
commodity that is easy to
acquire & deploy in 2014
Wednesday, April 30, 14
42. Defensive hedge against Big Data / HDFS
42
Compute: Local Disk Matters
‣ This slide is from 2013; trend is
continuing
‣ The “new normal” may be 4U enclosures
with massive local disk spindles - not
occupied, just available
‣ Why? Hadoop & Big Data
‣ This is a defensive hedge against future
HDFS or similar requirements
• Remember the ‘meta’ problem - science is
changing far faster than we can refresh IT. This
is a defensive future-proofing play.
‣ Hardcore Hadoop rigs sometimes
operate at 1:1 ratio between core count
and disk count
Wednesday, April 30, 14
43. Faster networks are driving compute config changes
43
Compute: NICs and Disks
‣ One pain point for me in 2013-2014:
• Network links to my nodes are getting
faster
• It’s embarrassing my disks are slower
than the network feeding them
• Need to be careful about selecting and
configuring high speed NICs
- Example: that dual-port 10Gig card may
not actually be able to drive both ports if
the card was engineered for an
active:passive link failover scenario
• Also need to re-visit local disk
configurations
Wednesday, April 30, 14
44. New and refreshed HPC systems running many node types
44
Compute: Huge trend in ‘diversity’
‣ Accelerated trend since at least 2012 ...
• HPC compute resources no longer homogenous; many
types and flavors now deployed in single HPC stacks
‣ Newer clusters mix-and-match to match the
known use cases:
• GPU nodes for compute
• GPU nodes for visualization
• Large memory nodes (512GB +)
• Very Large memory nodes (1TB +)
• ‘Fat’ nodes with many CPU cores
• ‘Thin’ nodes with super-fast CPUs
• Analytic nodes with SSD, FusionIO, flash or large local
disk for ‘big data’ tasks
Wednesday, April 30, 14
45. GPUs, Coprocessors & FPGAs
45
Compute: Hardware Acceleration
‣ Specialized hardware
acceleration has it’s place
but will not take over the
world
• “... the activation energy required
for a scientist to use this stuff is
generally quite high ...”
‣ GPU, Phi and FPGA best
used in large scale pipelines
or as specific solution to a
singular pain point
Wednesday, April 30, 14
46. Compute: Big Data & Analytics
‣ BioTeam is starting to build
“Big Data” labs and
environments for clients
‣ The most interesting trend:
• We are not designing for specific
analytic use cases; in most projects
are are adding in basic “capabilities”
with the expectation that the apps
and users will come later
• ... defensive IT hedge against
rapidly changing science
requirements, remember?
46
Wednesday, April 30, 14
47. Compute: Big Data & Analytics
‣ This translates to infrastructure designed
to support certain capabilities rather than
specific software or application.
‣ Example:
• Beefy HDFS friendly servers
• 100% bare metal provisioning and dynamic
system reconfiguration
• Systems for ingest
• Very large RAM systems
• Big PCIx bus systems
• Memory-resident database systems
• Mix of very fast and capacity optimized storage
• Very fast core, top-of-rack and server networking
47
Wednesday, April 30, 14
48. Also known as hybrid clouds
Emerging Trend: Hybrid HPC
‣ No longer “utter crap” or “cynical
vendor-supported reference case”
• small local footprint
• large, dynamic, scalable, orchestrated
public cloud component
‣ DevOps is key to making this work
‣ High-speed network to public cloud
required
‣ Software interface layer acting as the
mediator between local and public
resources
‣ Good for tight budgets, has to be
done right to work
‣ Still best approached very carefully
48
Wednesday, April 30, 14
49. BioIT World Homework
‣ We’ve got interesting hardware vendors on the
show floor this week; check them out
• Silicon Mechanics, Thinkmate, Microway:
cool commodity
• Intel, IBM, Dell, SGI: Large & enterprise
• Timelogic: hardware acceleration
• ...
49
Wednesday, April 30, 14
52. 52
Network: Speed @ Core and Edge
‣ Huge potential pain point
‣ May surpass storage as our #1
infrastructure headache
‣ Petascale data is useless if you
can’t move it or access it fast
enough
‣ Don’t be smug about 10 Gigabit
- folks need to start thinking
*now* about 40 and even 100
Gigabit Ethernet
‣ You may need 10Gig to some
desktops for data ingest/export
Wednesday, April 30, 14
53. 53
Network: Speed @ Core and Edge
‣ Remember ~2004 when
research storage
requirements started to dwarf
what the enterprise was
using?
‣ Same thing is happening now
for networking
‣ Research core, edge and top-
of-rack networking speeds
may exceed what the rest of
the organization has
standardized on
Wednesday, April 30, 14
54. Massive data movement needs are driving innovation pain
This is going to be painful
‣ Enterprise networking folks
are even more aloof than
storage admins we battled in
’04
‣ Often used to driving
requirements and methods;
unhappy when science starts
to drive them out of their
comfort zones
‣ Research needs to start
pushing harder and faster for
network speeds above 10GbE
• This will take a long time so start
now!
54
Wednesday, April 30, 14
55. Not sure how this will play out
‣ It will be interesting to see what large-scale data
movement does to our local infrastructure and
desktop experience
‣ Especially with other trends like BYOD
‣ My $.02
• Speeds to our desktops are going get very fast, or
• We give up on growing massive bandwidth to the client
and embrace a full VDI model where the users just
“remote desktop” into a well-networked scientific
informatics environment
55
Wednesday, April 30, 14
56. BioIT World Homework
‣ Visit the Internet2 booth to chat high speed
networking
• Ask about their free or low-cost training events and
technical workshops; start thinking about how you can
get your internal networking teams/leadership to attend
• Ask them about the new trend of private/corporate links
into I2 and other fast research networks
‣ Arista is here. Talking and exhibiting. They are
not Cisco. Listen, visit & talk to them.
56
Wednesday, April 30, 14
58. It’s real and becoming necessary
Network: ‘ScienceDMZ’
‣ BioTeam building them in 2014 and beyond
‣ Central premise:
• Legacy firewall, network and security methods
architected for “many small data flows” use cases
• Not built to handle smaller #s of massive
data flows
• Also very hard to deploy ‘traditional’ security gear
on 10Gigabit and faster networks
‣ More details, background & documents at
http://fasterdata.es.net/science-dmz/
58
Background
traffic or
competing bursts
DTN traffic with
wire-speed
bursts
10GE
10GE
10GE
Wednesday, April 30, 14
59. Network: ‘ScienceDMZ’
‣ Start thinking/discussing this sooner rather
than later
‣ Just like “the cloud” this may fundamentally
change internal operations and technology
‣ Will also require conscious buy-in and
support from senior network, security and
risk management professionals
• ... these talks take time. Best to plan ahead
59
Wednesday, April 30, 14
60. Network: ‘ScienceDMZ’
‣ A Science DMZ has 3 required components:
1. Very fast “low-friction” network links and paths with
security policy and enforcement specific to scientific
workflows
2. Dedicated, high performance data transfer nodes
(“DTNs”) highly optimized for high speed data xfer
3. Dedicated network performance/measurement nodes
60
Wednesday, April 30, 14
61. Network: ‘ScienceDMZ’
‣ Implementation specifics are complex; the
basic concept is not:
1. Research need to move scientific data at high speeds
is already being negatively affected by networks not
designed for this requirement
2. Likely to force fundamental changes in core enterprise
architectures on a similar disruptive scale as what
genome data storage forced in ~2004
3. Firewalls/IDS and security in particular will be affected
61
Wednesday, April 30, 14
62. 62
Simple Science DMZ:
Image source: “The Science DMZ: Introduction & Architecture” -- esnet
Wednesday, April 30, 14
63. Network: ‘ScienceDMZ’
‣ My gut feeling:
1. The fanciest and most complex Science DMZ architectures in the literature right
now are not suitable for our world
• Expensive specialized equipment; Expensive specialist staff expertise required
• Often still experimental, not something enterprise IT would want to drop into a
production environment
2. Science DMZ concepts are sound and simple implementations are possible today
3. Start small:
• Incorporate these sorts of concepts/ideas into long term planning ASAP
• Start adding network performance monitoring nodes to research networks, DMZs and
external circuit connections now; this entire concept falls over without actionable flow
and performance data
• Start work on policies and procedures for manual bypass of firewall/IDS rules when
known sender/receivers are freighting high speed data; automation comes later!
63
Wednesday, April 30, 14
64. BioIT World Homework
‣ Bookmark http://fasterdata.es.net and check
out the published materials and advice
‣ Monitor http://www.oinworkshop.com/ to see
when a workshop/event may be coming near
you (send your networking people ...)
‣ Both ESNet and Internet2 run training and
technical workshops that deliver far more value
for price than the usual training junkets
64
Wednesday, April 30, 14
65. Check out this talk
BioIT World Homework
‣ Track 1 - 3:10pm today:
• Christian Todorov talks “Accelerating Biomedical
Research Discovery: The 100G Internet2 Network – Built
and Engineered for the Most Demanding Big Data
Science Collaborations”
65
Wednesday, April 30, 14
66. Not very significant trend in 2014:
Software Defined Networking (“SDN”)
66
Wednesday, April 30, 14
67. More hype than useful reality at the moment
67
Network: SDN Hype vs. Reality
‣ Software Defined Networking (“SDN”) is
the new buzzword
‣ It WILL become pervasive and will
change how we build and architect things
‣ But ...
‣ Not hugely practical at the moment for
most environments
• We need far more than APIs that control port
forwarding behavior on switches
• More time needed for all of the related
technologies and methods to coalesce into
something broadly useful and usable
Wednesday, April 30, 14
68. More hype than useful reality at the moment
68
Network: SDN
‣ My gut feeling:
• It is the future but right now we are still in the
“mostly empty hype” phase if you wanna be
cynical about it; best to wait and watch
• Production enterprise use: OpenFlow
and similar stuff does not provide value
relative to implementation effort right now
• Best bang for the buck in ’14 will be getting
‘SDN’ features as part of some other
supported stack
- OpenStack, VMWare, Cloud, etc.
Wednesday, April 30, 14
70. 70
Storage
‣ Still the biggest expense, biggest headache and
scariest systems to design in modern life science
informatics environments
‣ Many of the pain points we’ve talked about for years
are still in place:
• Explosive growth forcing tradeoffs in capacity over performance
• Lots of monolithic single tiers of storage
• Critical need to actively manage data through it’s full life cycle
(just storing data is not enough ...)
• Need for post-POSIX solutions such as iRODS and other
metadata-aware data repositories
Wednesday, April 30, 14
71. 71
Storage Trends
‣ The large but monolithic storage platforms we’ve
built up over the years are no longer sufficient
• Do you know how many people are running a single large
scale-out NAS or parallel filesystem? Most of us!
‣ Tiered storage is now an absolute requirement
• At a minimum we need an active storage tier plus
something far cheaper/deeper for cold files
‣ Expect the tiers to involve multiple vendors,
products and technologies
• The Tier1 storage vendors tend to have higher-end pricing
for their “all in one” tiered data management solutions
Wednesday, April 30, 14
72. 72
Storage - The Old Way
‣ Single tier of scale-out NAS or parallel FS
‣ Why?
• Suitable for broadest set of use cases
• Easy to procure/integrate
• Lowest administrative & operational burden
‣ Example:
• 400TB - 1PB of ‘something’ stores ‘everything’
Wednesday, April 30, 14
73. 73
Storage - The New Way
‣ Multiple tiers; potentially from multiple vendors
‣ Why?
• Way more cost efficient (size the tier to the need)
• Single tier no longer capable of supporting all use cases and
workflow patterns
• Single tiers waste incredible money at large scale
‣ Example:
• 10-40 TB SSD/Flash for ingest & IOPS-sensitive workloads
• 50-400 TB tier (SATA/SAS/SSD mix) for active processing
• Multi-petabyte tier (Cloud, Object, SATA) for cost and operationally
efficient long term (yet reachable) storage of scientific data at rest
Wednesday, April 30, 14
74. Sticking 100% with Tier 1 vendors gets expensive
74
Storage: Disruptive stuff ahead
‣ BioTeam has built 1Petabyte ZFS-based storage pools from
commodity whitebox kit for about ~$100,000 in direct hardware
costs (engineering effort & admin not included in this price ...)
‣ There are many storage vendors in the middle tier who can
provide storage systems that are less ‘risky’ than DIY
homebuilt setups yet far less expensive than the traditional
Tier 1 enterprise storage options
• Several of these vendors are here at the show!
‣ Companies like Avere Systems are producing boxes that unify
disparate storage tiers and link them to cloud and object
stores
• This is a route to unifying “tier 1” storage with the “cheap & deep” storage
Wednesday, April 30, 14
75. Infinidat aka http://izbox.com
The new thumper.
‣ 1 petabyte usable NAS
shipped as a single
integrated rack
• List price: $500 per usable
terabyte
‣ More expensive than DIY
ZFS on commodity
chassis but less
expensive than current
mainstream products
‣ Lots of interesting use
cases for ‘cheap & deep’
75
Wednesday, April 30, 14
76. Avere Systems
Wait, I can DO that?
‣ These folks caught my eye in late 2013 for
one very specific use case
‣ Since then I keep them in mind for 4-5
common problems I regularly face
‣ It can:
• Add performance layer on top of storage bought
to be “cheap & deep”
• Virtualize many NAS islands into a single
namespace
• Replicate & move data between tiers and sites
• Act as CIFS/NFS gateway to on-premise or
offsite object stores ***
• Treat Amazon S3 and Glacier as simply another
storage tier fully integrated into your environment
76
Wednesday, April 30, 14
77. Object Storage
‣ Object storage is the future for scientific data at rest
• Total no brainer; makes more sense than the “files and
folders” paradigm, especially for automated analysis
• Plus Amazon does it for super cheap
‣ But ... There will be a long transition period due to all
of our legacy codes and workflows
• This is where gateway devices can play
‣ It can:
• Provide a much better workflow design pattern than
assuming “files and folders” data storage
• Save millions of dollars via efficiencies of erasure coding
• Provide a much more robust and resilient peta-scale storage
framework
• Hide behind a metadata-aware layer such as IRODS to
provide very interesting capabilities
77
Wednesday, April 30, 14
78. Object Storage
‣ Erasure coding distributed
object stores are very interesting
at peta-scale ...
‣ Think about how you would
handle & replicate 20 petabytes
of data the “traditional way”
• Purchase 2x or 3x storage capacity to
handle replication overhead
• Ignore the nightmare scenario of
having to restore from one of the
distributed replicas
78
Wednesday, April 30, 14
79. Object Storage
‣ Efficiencies of erasure coding allow
for LESS raw disk to be distributed
across MORE geographic sites
‣ End result is a “single” usable
system that tolerant to the failure
of an entire datacenter/site
‣ For the 20 petabyte problem
instead of purchasing 2x disk you
buy ~1.8x and use the capex
savings to add an extra colo facility
or increase WAN link speed
79
Wednesday, April 30, 14
80. Exercise
BioIT World Homework
‣ Pick a storage size that make sense for you (100TB or
1PB suggested)
‣ Visit the various storage vendors on the show floor and
price out what 100TB or 1PB would cost
‣ You will see an awesome diversity of products,
performance, features and capabilities at various price
points
• DO NOT fixate on price alone. This is a mistake.
‣ This is REALLY worth doing - there is incredible
diversity in the mix of price/features/performance/
capability out there
80
Wednesday, April 30, 14
81. Check out these booths
BioIT World Homework
‣ Object storage:
• Amplidata & CleverSafe
‣ Glue/Gateway/Acceleration:
• Avere Systems
‣ Enterprise:
• EMC Isilon, IBM, Dell, SGI, Hitachi, Panasas
‣ Mid-tier/Commodity:
• Silicon Mechanics, Thinkmate, RAID Inc., Xyratex
81
Wednesday, April 30, 14
82. Check out these talks
BioIT World Homework
‣ Track 5 - noon today:
• Aaron Gardner talks “Taming big scientific data growth with
converged infrastructure”
‣ Track 1 - 2:55pm today:
• Jacob Farmer talks “Bridging the Worlds of Files, Objects,
NAS, and Cloud: A Blazing Fast Crash Course in Object
Storage”
‣ Track 1 - 4:30pm today:
• Dirk Petersen talks “ Deploying Very Low Cost Cloud Storage
Technology in a Traditional Research HPC Environment
82
Wednesday, April 30, 14
83. 83
Can you do a Bio-IT talk without using the ‘C’ word?
Wednesday, April 30, 14
84. 84
Cloud: 2014
‣ Core advice remains the same
‣ A few new permutations ...
Wednesday, April 30, 14
85. Core Advice
85
Cloud: 2014
‣ Research Organizations need a cloud
strategy today yesterday
• Those that don’t will be bypassed by frustrated
users or sneaky “cloud aware” devices
‣ IaaS cloud services are only a departmental
credit card away ... some senior scientists
are too big to be fired for violating IT policy
‣ Instrument vendors are forcing the issue
‣ Storage vendors are forcing the issue
Wednesday, April 30, 14
86. Design Patterns
86
Cloud Advice
‣ We actually need several tested cloud
design patterns:
‣ (1) To handle ‘legacy’ scientific apps & workflows
‣ (2) The special stuff that is worth re-architecting
‣ (3) Hadoop & big data analytics
‣ ... and maybe (4) Regulated/sensitive efforts...
‣ ...and maybe (5) a way to evaluate Commercial
solutions
Wednesday, April 30, 14
87. Legacy HPC on the Cloud
87
Cloud Advice
‣ MIT StarCluster
• http://star.mit.edu/cluster/
• This is your baseline
• Extend as needed
‣ Also check out Univa
• Commercially supported Grid Engine
stack with compelling roadmap and
native cloud capabilities
Wednesday, April 30, 14
88. “Cloudy” HPC
88
Cloud Advice
‣ Some of our research workflows are important
enough to be rewritten for “the cloud” and the
advantages that a truly elastic & API-driven
infrastructure can deliver
‣ This is where you have the most freedom
‣ Many published best practices you can borrow
‣ Warning: Cloud vendor lock-in potential is
strongest here
Wednesday, April 30, 14
89. What has changed ..
Cloud: 2014
‣ Lets revisit some of my bile from prior years
‣ “... private clouds: still utter crap”
‣ “... some AWS competitors are delusional
pretenders”
‣ “... AWS has a multi-year lead on the
competition”
89
Wednesday, April 30, 14
90. Private Clouds in 2014:
‣ I’m no longer dismissing them as “utter crap”
• However it is a lot of work and money to build a system that only has 5% of the
features that AWS can deliver today (for a cheaper price). Need to be careful
about the use case, justification and operational/development burden.
‣ Usable & useful in certain situations
‣ BioTeam positive experiences with OpenStack
‣ Starting to see OpenStack pilots among our clients
‣ Hype vs. Reality ratio still wacky
‣ Sensible only for certain shops
• Have you seen what you have to do
to your networks & gear?
‣ Still important to remain cynical and perform proper due diligence
Wednesday, April 30, 14
91. Not all AWS competitors are delusional
‣ Google Compute is viable in 2014 for scientific workflows
• Compute/Memory: Late start into IaaS means CPUs and memory are current generation; we have
‘war stories’ from AWS users who probe /proc/cpuinfo on EC2 servers so they can instantly kill any
instance running on older chipsets
• Price: Competitive on price although the shooting war between IaaS providers means it is hard to
pin down the current “winner”; The “sustained use” pricing is easier to navigate than AWS Reserved
Instances. Overall AWS pricing algorithms for various services seem more complicated than Google
equivalents.
• Network performance: Fantastic networking and excellent performance/latency figures between
regions and zones. VPC type features are baked into the default resource set
• Ops: Priced in 1min increments; no more need to hunt and kill servers at 55 min past the hour.
Google has a concept of “Projects” with assigned collaborators and quotas. Quite different from the
AWS account structure and IAM-based access control model. Project-based paradigm easier to
think about for scientific use case.
• IaaS Building Blocks: Still far fewer features than AWS but the core building blocks that we need
for science and engineering workflows are present.
‣ My $.02
• AWS is still the clear leader but Google Compute is now a viable option and it is worth ‘kicking the
tires’ in 2014 and beyond ... to me AWS has had no serious competition until now
Wednesday, April 30, 14
92. Cloud Science Facilitators
‣ Cycle Computing is legit
• They’ve proven themselves
on some of largest IaaS HPC
grids ever built
• Experience with hybrid
systems (cloud & premise)
‣ Smart people. Nice
people.
‣ They have a booth, stop
by and chat them up ...
Wednesday, April 30, 14
94. This has been a slow moving trend for years now ...
94
POSIX Alternatives Coming
‣ The scope of organizations faced with
the limitations of POSIX filesystem will
continue to expand
‣ We desperately need some sort of
“metadata aware” data management
solution in life science
‣ Nobody has an easy solution yet;
several bespoke installations but no
clear mass-market options
‣ IRODS front-ending “cheap & deep”
storage tiers or object stores appears
to be gaining significant interest out in
our community
Wednesday, April 30, 14
95. Application Containers are getting interesting
95
Watch out for: Containerization
‣ Application containerization via methods like
http://docker.io gaining significant attention
• Docker support now in native RHEL kernel
• AWS Elastic Beanstalk recently added Docker
support
‣ If broadly adopted, these techniques will
stretch research IT infrastructures in
interesting directions
• This is far more interesting to me than moving virtual
machines around a network or into the cloud
‣ ... with a related impact on storage location,
features & capability
‣ Major new news and progress expected in
2014
Wednesday, April 30, 14
96. 96
Keep an eye on: Storage
‣ Data generation out-pacing
technology
‣ Really interesting disruptive
stuff on the market now
‣ Cheap/easy laboratory
assays taking over
• Researchers largely don’t know
what to do with it all
• Holding on to the data until
someone figures it out
• This will cause some interesting
headaches for IT
• Huge need for real “Big Data”
applications to be developed
Wednesday, April 30, 14
97. 97
Keep an eye on: Networking
‣ Unless there’s an investment
in ultra-high speed
networking, need to change
thought on analysis
‣ Data commons are becoming
a precedent
• Need to minimize the movement of
data
• Include compute power and
analysis platform with data
commons
‣ Move the analysis to the data,
don’t move the data
• Requires sharing/Large core
institutional resources
Wednesday, April 30, 14
98. 98
Long term trends ...
‣ Compute continues to become easier
‣ Data movement and ingest (physical & network)
gets harder
‣ Cost of storage will be dwarfed by “cost of
managing stored data”
‣ We can see end-of-life for our current IT
architecture and design patterns; new patterns
will start to appear over next 2-5 years
Wednesday, April 30, 14
100. Embrace The Innovation
100
Ending Advice: 1 of 5
‣ Understand the ‘interesting times’ we are in
• Science is changing faster than we can refresh IT
• This is not going to change any time soon
‣ Advice:
• Spend as much time thinking about future flexibility as
you spend on actual current needs & requirements
• Design for agility & responsiveness
Wednesday, April 30, 14
101. Capacity
101
Ending Advice: 2 of 5
‣ Many of us will need ‘petabyte capable’ storage
‣ However:
• Only some of us will ever have 1PB+ under management
• The hard part is knowing whom that will be
Wednesday, April 30, 14
102. Tiers are in your future
102
Ending Advice: 3 of 5
‣ Tiers are now a requirement, at least long-term
• At a minimum we need an ‘active’ tier for processing &
ingest
• ... and some sort of inexpensive cold/nearline/archive
option as well
‣ Advice:
• It’s OK to buy a single block/tier of disk
• ... but always have a strategy for diversification
Wednesday, April 30, 14
103. 103
Ending Advice: 4 of 5
‣ Above a certain scale, inefficient data management
& simple storage practices are hugely wasteful
‣ Advice:
• The cost of a new hire “data manager” or curator role may
be cheaper and far more beneficial to your organization than
continuing to throw CapEx dollars at keeping a badly run
storage platform under it’s capacity limit
• Many opportunities to get clever & recapture efficiency &
capability: tiers, replication, cloud, dedupe, CRAM
compression, iRODS
• BROADEN YOUR PERSPECTIVE
Wednesday, April 30, 14
104. 104
Ending Advice: 5 of 5
‣ You need a cloud strategy. Yesterday.
- Users, instrument makers & IT vendors are forcing the issue
- Economic trends indicate cloud storage is inescapable
- 90% of cloud is “easy”. Remaining 10% takes time & effort
‣ Advice:
• The technical aspects of using “the cloud” are trivial
• The political, policy and risk management aspects are
difficult and time consuming; start these ASAP
Wednesday, April 30, 14