Bio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting

Bio-IT & Cloud Sobriety
Beyond the Genome, San Francisco 2013
Thursday, October 3, 13

2
The ‘Meta’ Issue
What is driving all of this?
Drivers For Cloud Adoption In Bio-IT
What The Cloud Salespeople Will Not Tell You
Private Clouds & Practical Advice
Intro & Terminology
Getting our buzzwords straight
The Road Ahead
1
2
3
4
5
6

3
I’m Chris.
I’m an infrastructure geek.
I work for the BioTeam.
Twitter: @chris_dag

Who, What, Why ...
4
BioTeam
‣ Independent consulting shop
‣ Staﬀed by scientists forced to
learn IT, SW & HPC to get our
own research done
‣ 10+ years bridging the “gap”
between science, IT & high
performance computing
‣ Our wide-ranging work is what
gets us invited to speak at
events like this ...

Seriously.
Listen to me at your own risk
‣ Clever people ﬁnd multiple
solutions to common issues
‣ I’m fairly blunt, burnt-out
and cynical in my advanced
age
‣ Signiﬁcant portion of my
work has been done in
demanding production
Biotech & Pharma
environments
‣ Filter my words accordingly
5

6
Getting our buzzwords
straight Image: Kevin Dooley via Flickr

7
Deﬁning Terms
‣ The term ‘cloud computing’ is almost meaning-
free today – too many marketers have fuzzed
and co-opted the term
‣ Before serious discussion can occur it is
essential that all parties are operating from
similar baseline presumptions

Gartner
8
Deﬁning Terms
‣ Gartner:
• “Cloud computing is a style of computing where
scalable and elastic IT-enabled capabilities are
delivered as a service to external customers using
Internet technologies.”

9
My preferred deﬁnition
‣ Jinesh Varia on Amazon Web Services:
• “… a highly reliable and scalable infrastructure for
deploying web-scale solutions, with minimal support
and administration costs, and more ﬂexibility than
you’ve come to expect from your own infrastructure,
either on-premise or at a datacenter facility.”

I’m an infrastructure geek, which do you think I prefer?
10
Cloud Subtypes
‣ Software as a Service
(SaaS)
‣ Platform as a Service
(PaaS)
‣ Infrastructure as a Service
(IaaS)

11
This is an IaaS cloud talk
‣ We need flexible scientific computing and
informatics capability “on the cloud”
‣ Service and Platform clouds are not a good fit
for the flexible/general use case
‣ IaaS clouds provide “building blocks” that allow
us to build the informatics environments we
require

Disclaimer.

I’m not an Amazon shill.

Really.

The IaaS competition just can’t compete.

AWS lets me build useful stuff.

When stuff gets built, I get paid.

Installing VMware & excreting a press
release does not turn a company into
a cloud provider.

I need more than just virtual compute and
block storage. AWS has tons of glue and
many useful IaaS building blocks.

IaaS competitors lag far behind in features
and service offerings.

Speaking of pretenders…

No APIs?
Not a cloud.

No self-service?
Not a cloud.

I have to email a human?
Not a cloud.

50% failure rate on server launch?
Lame cloud.

Virtual servers & block storage only?
Barely a cloud.

insufferable, huh?
Lets look at a tiny example ...

28
Real world simulation project

29
16 of AWS’s biggest servers + 22 GPU nodes
... at a cost of $30/hour via Spot Market
Non Trivial HPC on the cloud

Why this work was ‘easy’ on Amazon AWS ...
30
Diﬃcult on any other cloud
‣ Lets discuss why this simulation workload
would be much, much harder to do on some
other cloud platform ...

Why this work was ‘easy’ on Amazon AWS ...
31
Nightmare on any other cloud
1. Virtual Servers
2. Block Storage
3. Object Storage
4. ... and maybe
some other stuﬀ
if I’m lucky
‣ EC2, S3, EBS, RDS, SNS,
SQS, SWS, GPUs, SSDs,
CloudFormation, VPC,
ENIs, SecurityGroups,
10GbE, DirectConnect,
Reserved Instances,
ImportExport, Spot Market
‣ And ~30 other products
and service features with
more added monthly
Brand ‘X’ Cloud Amazon

Easy on AWS; much harder elsewhere
One very speciﬁc example
32
‣ The widely used
FLEXlm license server
uses NIC MAC
addresses when
generating license keys
‣ Diﬀerent MAC? Science
stops. Screwed.
‣ VPC ENIs allow
separation of MAC
address from Network
Interface. Badass.

33
Intro & Terminology
The Road Ahead
1
2
3
4
5
6

34
The big picture
Why we need IaaS clouds ...

35
Big Picture / Meta Issue
‣ HUGE revolution in the rate at which
lab platforms are being redesigned,
improved & refreshed
• Example: CCD sensor upgrade on that
confocal microscopy rig just doubled
storage requirements
• Example: The 2D ultrasound imager is
now a 3D imager
• Example: Illumina HiSeq upgrade just
doubled the rate at which you can acquire
genomes. Massive downstream increase
in storage, compute & data movement
needs
‣ For the above examples, do you
think IT was informed in advance?

Science progressing way faster than IT can refresh/change
The Central Problem Is ...
‣ Instrumentation & protocols are changing FAR
FASTER than we can refresh our Research-IT &
Scientiﬁc Computing infrastructure
• Bench science is changing month-to-month ...
• ... while our IT infrastructure only gets refreshed every
2-7 years
‣ We have to design systems TODAY that can
support unknown research requirements &
workﬂows over many years (gulp ...)
36

The Central Problem Is ...
‣ The easy period is over
‣ 5 years ago we could toss
inexpensive storage and
servers at the problem;
even in a nearby closet or
under a lab bench if
necessary
‣ That does not work any
more; real solutions
required
37

And a related problem ...
‣ It has never been easier to
acquire vast amounts of data
cheaply and easily
‣ Growth rate of data creation/
ingest exceeds rate at which
the storage industry is
improving disk capacity
‣ Not just a storage lifecycle
problem. This data *moves*
and often needs to be shared
among multiple entities and
providers
• ... ideally without punching holes in
your ﬁrewall or consuming all
available internet bandwidth
38

If we get it wrong ...
‣ Lost opportunity
‣ Missing capability
‣ Beaten by the competition
‣ Frustrated & very vocal scientiﬁc staﬀ
‣ Problems in recruiting, retention,
publication & product development
39

40
Intro & Terminology
The Road Ahead
1
2
3
4
5
6

41
Bio-IT Cloud Drivers
Image: Kevin Dooley via Flickr

Mainstream in life science for quite some time ...
42
Public IaaS Clouds
‣ Public infrastructure clouds offer
excellent “pressure release valve”
when rapidly changing scientific
requirements can’t be satisfied by
on-premise infrastructure
‣ Economics can’t be ignored
‣ Popular meeting ground for data
swapping and collaboration
‣ ‘Scriptable Datacenters’ enabling
entirely new capabilities
‣ Money people like converting
CapEx to OpEx

The ‘neutral’ meeting ground ..
43
Cloud Hubs & Portals
‣ Many types of entities need
to meet, collaborate and
exchange life science data
‣ Data sharing hubs and
portals becoming popular on
public IaaS clouds like AWS
‣ Why?
• Far easier than punching holes in
your ﬁrewall and issuing VPN
credentials to outsiders

Compelling economics
44
Cloud Data Repositories
‣ IaaS clouds becoming ‘center of
gravity’ for some large scale
scientiﬁc data hosting
‣ Why?
• Compelling pricing
• No need to own & operate mirror sites
• AWS has some very interesting
‘downloader pays’ models that seem
to be a good ﬁt for grant-funded
science with mandated multi-year
data accessibility requirements
www.1000genomes.org

My $.02
Amazon vs. Everyone Else
‣ AWS clear leader for Bio IT IaaS cloud use
‣ Why?
• By far the largest number of IaaS building blocks
• Rate of innovation puts AWS years ahead of competition
‣ Exceptions
• For speciﬁc high-value pipelines & workstreams, Google
& Microsoft are valid alternatives
45

46
Intro & Terminology
The Road Ahead
1
2
3
4
5
6

What the salesfolk won’t tell you ...
47
‣ There is no one-size-fits-all
research design pattern ...
‣ You are not going to toss everything
and replace it with “Big Data”
‣ Very few of us have a single pipeline
or workflow that we can devote
endless engineering effort to
‣ We are not going to toss out
hundreds of legacy codes and
rewrite everything for GPUs or
MapReduce
‣ For research HPC it’s all about the
building blocks { and how we can
effectively use/deploy them }

48
What the salesfolk won’t tell you
‣ Your organization actually needs THREE
tested cloud design patterns:
‣ (1) To handle ‘legacy’ scientific apps &
workflows
‣ (2) The special stuff that is worth re-architecting
‣ (3) Hadoop & big data analytics

Legacy HPC on the Cloud
49
Design Pattern #1 - Legacy
‣ There are many hundreds of
existing algorithms and
applications in the life
science informatics space
‣ We’ll be running/using these
codes for years to come
‣ Many can’t or will never be
refactored or rewritten
‣ I call this the “legacy”
design pattern

50
One
Easy
Solu5on.

StarCluster
51
Design Pattern #1 - Legacy
‣ MIT StarCluster
• http://web.mit.edu/star/cluster/
‣ Inﬁnite Awesomeness. Worth a talk by itself.
‣ This is your baseline
‣ Extend as needed

52
Design Pattern #2 - “Cloudy”
‣ Some of our research workﬂows are important
enough to be rewritten for “the cloud” and the
advantages that a truly elastic & API-driven
infrastructure can deliver
‣ This is where you have the most freedom
‣ Many published best practices you can borrow
‣ Warning: Cloud vendor lock-in potential is
strongest here

53
Design Pattern #3 - Hadoop/BigData
‣ Hadoop and “big data” need to be on your
radar
‣ Be careful though, you’ll need a gas mask to
avoid the smog of marketing and vapid hype
‣ The utility is real and this does represent one
“future path” for analysis of large data sets

54
‣ It’s gonna be a MapReduce world, get used to it
‣ Little need to roll your own Hadoop in 2013
‣ ISV & commercial ecosystem already healthy
‣ Multiple providers today; both onsite & cloud-
based
‣ Often a slam-dunk cloud use case

What you need to know
55
‣ “Hadoop” and “Big Data” are now general
terms
‣ You need to drill down to ﬁnd out what people
actually mean
‣ We are still in the period where senior
leadership may demand “Hadoop” or “BigData”
capability without any actual business or
scientiﬁc need

56
Hadoop & “Big Data”
‣ In broad terms you can break “Big Data” down into two
very basic use cases:
1. Compute: Hadoop can be used as a very powerful
platform for the analysis of very large data sets. The
google search term here is “map reduce”
2. Data Stores: Hadoop is driving the development of very
sophisticated “no-SQL” “non-Relational” databases and
data query engines. The google search terms include
“nosql”, “couchdb”, “hive”, “pig” & “mongodb”, etc.
‣ Your job is to ﬁgure out which type applies for the
groups requesting “Hadoop” or “BigData” capability

57
Hadoop & “Big Data”
‣ Hadoop is being driven by a small group of
academics writing and releasing open source
life science hadoop applications;
‣ Your people will want to run these codes
‣ In some academic environments you may ﬁnd
people wanting to develop on this platform

58
Intro & Terminology
The Road Ahead
1
2
3
4
5
6

59

60
Private Clouds: Only 60% BS in ’13
‣ I’m known as a private cloud cynic
‣ The hype::usefulness ratio is still extreme
‣ For vendors it’s still a play to get you to toss
everything in your datacenter and ‘start fresh’
‣ However ...

61
Private Clouds: Make sense for ...
‣ If you are a globe spanning enterprise with tens
of thousands of employees or “customers”
‣ If you want to leverage hardcore DevOps for
serious infrastructure automation and
conﬁguration management
‣ If you want to use Private Cloud to drive fresh
new tech like object storage and software
deﬁned networking (SDN) into your
environment

62
Private Clouds: However ...
‣ My $.02 is that the two primary science-facing beneﬁts
from Cloud are:
1. Browsable catalogs of available server images
2. Self-service (Scientists can select & provision systems)
‣ And guess what? You can do that TODAY on most
enterprise virtualization stacks WITHOUT jumping on
the private cloud bandwagon
‣ My advice:
• Think hard about what you hope to gain from private clouds and
do some extra due-diligence to see if you can gain those
capabilities in a simpler and cheaper way

Strategy
63
Practical Advice
‣ Research oriented IT organizations need a
cloud strategy today; or risk being bypassed by
employees

Design Patterns
64
Practical Advice
‣ Remember the three design patterns on the
cloud:
• Legacy HPC systems
(replicate traditional clusters in the cloud)
• Hadoop
• Cloudy
(when you rewrite something to fully leverage cloud
capability)

Policies and Procedures
65
Practical Advice
‣ Cloud technology bits are easy. Cloud Process
and Policy discussions take forever
‣ Start these conversations sooner rather than
later!

Core services that take time and advance planning
66
Practical Advice
‣ A few key cloud services take time and
advanced planning to deploy properly:
‣ VPNs & subnet schemes
‣ Identity Management & Access Control
‣ Data Movement

Data Movement
67
Practical Advice
‣ A few words & pictures on data movement ...

68
Physical Ingest Just Plain Nasty
‣ Easy to talk about in theory
‣ Seems “easy” to scientists
and even IT at ﬁrst glance
‣ Really really nasty in practice
• Incredibly time consuming
• Signiﬁcant operational burden
• Easy to do badly / lose data

And huge need for fast(er) research networks!
69
Huge Need For Network Ingest
1. Public data repositories have
petabytes of useful data
2. Collaborators still need to
swap data in serious ways
3. Amazon becoming an
important repo of public and
private sources
4. Many vendors now “deliver”
to the cloud

70
Physical Ingest: Unit = Array

71
Physical Ingest: Unit = Disk

72
“Naked” Data Movement

73
“Naked” Data Archive

74
Cloud Data Movement
‣ Things changed pretty deﬁnitively in 2012
‣ And the next image shows why ...

75
2012 Experiment

Network vs. Physical
Cloud Data Movement
‣ With a 1GbE internet connection ...
‣ and using Aspera software ....
‣ We sustained 700 MB/sec for more than 7 hours
freighting genomes into Amazon Web Services
‣ This is fast enough for many use cases,
including genome sequencing core facilities*
‣ Chris Dwan’s webinar on this topic:
http://biote.am/7e
76

Network vs. Physical
Cloud Data Movement
‣ Results like this mean we now favor network-
based data movement over physical media
movement
‣ Large-scale physical data movement carries a
high operational burden and consumes non-
trivial staﬀ time & resources
77

There are three ways to do network data movement ...
Cloud Data Movement
1. Buy software from Aspera and be done with it
2. Attend the annual SuperComputing conference
& see which student group wins the bandwidth
challenge contest; use their code
3. Get GridFTP from the Globus folks
78

79
Intro & Terminology
The Road Ahead
1
2
3
4
5
6

80
The road ahead ...

Some ﬁnal thoughts
81
Future Trends & Patterns
‣ Compute continues to become easier
‣ Data movement (physical & network) gets harder.
‣ The cloud decision may be made by
where your data actually resides
‣ Cost of storage will be dwarfed by “cost of
managing stored data”
‣ We can see end-of-life for our current IT
architecture and design patterns; new patterns
will start to appear over next 2-5 years

Very blurry lines in 2013 for all of these roles
82
Scientist/SysAdmin/Programmer
‣ Cloud is forcing these issues ...
‣ Far more control is going into
the hands of the research end
user
‣ IT support roles will radically
change -- no longer owners or
gatekeepers
‣ IT will handle policies,
procedures, reference patterns ,
security & best practices
‣ Researchers will control the
“what”, “when” and “how big”

83
end;
Thanks!
chris@Bioteam.net slideshare.net/chrisdag/ @chris_dag

Bio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting

More Related Content

What's hot

Viewers also liked

Similar to Bio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting

Recently uploaded

Bio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting