High-Performance Networking Use Cases in Life Sciences

High-Performance Networking
Use Cases in Life Sciences
1
2014 Internet2 Technology Exchange; Indianapolis, IN
Slides available at http://www.slideshare.net/arieberman

Who am I?
2
Director of Government Services, Principal
Investigator
I’m a fallen scientist - Ph.D. Molecular Biology,
Neuroscience, Bioinformatics
I’m an HPC/Infrastructure geek - 15 years
I help enable science!
I’m Ari

3
BioTeam
‣ Independent consulting shop
‣ Staffed by scientists forced to learn
IT, SW & HPC to get our own
research done
‣ Infrastructure, Informatics,
Software Development, Cross-disciplinary
Assessments
‣ 11+ years bridging the “gap”
between science, IT & high
performance computing
‣ Our wide-ranging work is what gets
us invited to speak at events like
this ...

BioTeam
What do we do?
4
Laboratory Knowledge

BioTeam
What do we do?
4
Converged Solution

Our domain coverage
Mostly work in Life Sciences
• Government
• Universities
• Big pharma
• Biotech
• Private institutes
• Diagnostic startups
• Oil and Gas
• Geospatial
• Hollywood Animation
• Law Enforcement
5

6
OK, so why am I here talking to
you?

We’ve noticed a few things
We have a unique perspective across much of life
sciences
‣ Big Data has arrived in Life Sciences
‣ Data is being generated at unprecedented rates
‣ Research and Biomedical Orgs were caught off
guard
‣ IT running to catch up, limited budgets
‣ Money is tight, Orgs reluctant to invest in Bio-IT
7
25% of all Life Scientists will require HPC in 2015!

8
Big Picture / Meta Issue
‣ HUGE revolution in the
rate at which lab
platforms are being
redesigned, improved &
refreshed
‣ IT not a part of the
conversation, running to
catch up

The Central Problem Is ...
Science progressing way faster than IT can refresh/
change
‣ Instrumentation & protocols are changing FAR
FASTER than we can refresh our Research-IT &
Scientific Computing infrastructure
• Bench science is changing month-to-month ...
• ... while our IT infrastructure only gets refreshed every
2-7 years
‣ We have to design systems TODAY that can
support unknown research requirements &
workflows over many years (gulp ...)
9

10
It’s a risky time to be doing Bio-IT
11
What are the drivers in Bio-IT today?

11
Genomics: Next Generation Sequencing
(NGS)

It’s like the hard drive of life
12
The big deal about DNA
‣ DNA is the template of life
‣ DNA is read --> RNA
‣ RNA is read --> Proteins
‣ Proteins are the
functional machinery that
make life possible
‣ Understanding the
template = understanding
basis for disease

How does NGS work?
Sequencing by Synthesis
13

How does NGS work?
Reference assembly, variant calling
14

The Human Genome
Gateway to personalized medicine
‣ 3.2 Gbp
‣ 23 chromosomes
‣ ~21,000 genes
‣ Over 55M known
variations
15

...and why NGS is the primary driver
16
The Problem...
‣ Sequencers are now relatively
cheap and fast
‣ Some can generate a human
genome in 18 hours, for $2,000
‣ Everyone is doing it
‣ Can generate 3TB of data in
that time
‣ First genome took 13 years and
$2.7B to complete
‣ Know of 10 organizations:
100,000 genomes over 5 years

...and why NGS is the primary driver
16
The Problem...
‣ Sequencers are now relatively
cheap and fast
‣ Some can generate a human
genome in 18 hours, for $2,000
‣ Everyone is doing it
‣ Can generate 3TB of data in
that time
‣ First genome took 13 years and
$2.7B to complete
‣ Know of 10 organizations:
100,000 genomes over 5 years
That’s 14PB of data, folks

17
Other Methodologies Not Far Behind

High-throughput Imaging
‣ Robotics screening millions of
compounds on live cells 24/7
• Not as much data as genomics in
volume, but just as complex
• Data volumes in the 10’s TB/week
‣ Confocal Imaging
• Scanning 100’s of tissue sections/
week, each with 10’s of scans,
each with 20-40 layers and multiple
florescent channels
• Data volumes in the 1’s - 10’s TB/
week
18

High-res medical imaging
High-power, dense detector MRI scanners in use
24/7 at large research hospitals
‣ Creating 3D models of
brains, comparing large
datasets
‣ Using those models to
perform detailed
neurosurgery with real-time
analytic feedback from
supercomputer in the OR
(cool stuff)
‣ Also generates 10’s of TB/
week 19

20
This is a huge problem
‣ Causing a literal deluge of
data, in the 10’s of
Petabytes
‣ NIH generating 1.5PB of
data/month
‣ First real case in life science
where 100Gb networking
might really be needed
‣ But, not enough storage or
compute

21
And, just to make things more complicated

File & Data Types
We have them all
‣ Massive text files
‣ Massive binary files
‣ Flatfile ‘databases’
‣ Spreadsheets everywhere
‣ Directories w/ 6 million
files
‣ Large files: 600GB+
‣ Small files: 30kb or smaller
22

Why, giant meta-analyses, of course
23
What to do with all that data?
‣ Typical problem across all of
big data: how do you use it?
‣ In life sciences: no real
standards of data formats
‣ Data scattered all over,
despite push for Data
Commons
‣ Not always accessible
‣ Combining the data if you
have it all is a real challenge

A Compounding Problem...
Scientists don’t like to share (really!)
‣ The fear:
• if someone sees data before it
is published, they might steal it
and publish it themselves
(getting scooped)
‣ Causes:
• Long time to publication
• Outdated methods of
assigning scientific credit
• Not properly incentivized
24

A Problem for Data Commons
Sharing required
‣ Data piling up
(scientists are
hoarders)
‣ Bad network
infrastructures
‣ Few central analytics
platforms
‣ Wild-west file formats/
algorithms
‣ No sharing 25

A Problem for Data Commons
Sharing required
‣ Data piling up
(scientists are
hoarders)
‣ Hyperscale Bad network
infrastructures
analytics will only work
‣ Few central if the analytics
data is accessible!
platforms
‣ Wild-west file formats/
algorithms
‣ No sharing 25

Clear issue for Networking
Every kind of flow imaginable
‣ Mouse —> Elephant
‣ Typical problem: firewalls
not designed for this
‣ Potentially massive
amount of constant data
movement
‣ How are people handling
all of this?
26

27
Use Cases in Life Sciences

28
Getting Data out of the Laboratory

Laboratories not Integrated
Usually very little IT infrastructure in labs
‣ Tons of data generating
equipment going in now
‣ Can generate 15GB of
data in 50 hours
‣ Others can generate
64GB/day
‣ Labs are not designed to
transmit data, lucky if
wired for ethernet
29

Getting data out
OK, so write data over ethernet to network drive…
‣ Sounds good, 64GB in 24
hours ~= 6Mb/s
‣ Problem: desktop class
ethernet adaptors
‣ No error checking, no
retries, no MD5, no local
buffer
‣ If network goes, whole
run is lost
30

Getting data out
Scientists have to get creative, but not in a good way
‣ Usually ends up going to
local workstation
‣ Go buy the cheapest disks
they can
‣ Carry it somewhere, transfer
the data to a workstation
‣ Put the disk in a drawer
under a sink (really)
‣ Works if lab only does one or
two runs/month, fails if more
31

Lab data transit not huge!
Unless you’re dealing with a bigger lab with lots of
equipment, or a core facility
‣ Fast networking not
required, 100Mb OK
‣ Just GOOD networking
‣ ….for now (more later)
32

Successful models
Some generalized network models that have
successfully solved the problem
‣ Most of it is protocol and
topology
‣ Quality of Service (QoS)
‣ Appropriate segmentation
(L2 and/or L3)
‣ MPLS paths
‣ Intermediate protocols
(i.e., Aspera FASP)
‣ One way or another,
guarantee transfer 33

Storage: a networking problem
As storage needs increase, the need to transmit it
goes up too
‣ Networking will quickly replace storage as #1
headache in Bio-IT
‣ Petascale storage is useless without high-performance
networking
‣ Most enterprise networks won’t cut it
35

Storage: an Org Problem
Most single laboratories don’t have an immediate
need for peta-scale storage
‣ BUT - labs need to be peta-capable
‣ Can’t predict how much or
what kind of equipment
‣ Have to build for an
indeterminate future
‣ Does it make sense for each
lab to buy own storage?
• Probably not, doesn’t scale well
financially
36

Storage: an Org Problem
Orgs that don’t invest will find themselves in a mess
of storage support
‣ This is when the storage
problem becomes a
networking problem
‣ Scientists need to share,
collaborate
‣ Lab with 100TB of data,
needs to share with offsite
or onsite scientist
‣ Also: backups and disaster
recovery: data is the new
commodity 37

Storage: a networking problem
Without high-performance networking, petascale
anything is useless
‣ Traditional enterprise networks
don’t cut it
‣ Large single-stream flows get
squashed through firewalls and
IDS
‣ Centralized: 10’s of PBs
‣ Distributed: 100’s of PBs
• Likely a lot of duplication
‣ Network becomes key
‣ Cloud use makes this an even
bigger problem
38

Storage: options!
‣ There are a ton of options for
storage
• Local: small and large
• Institutional: mostly large
• Distributed Institutional: distributed NAS
(GPFS over WAN), Object store
networks, iRODS
• Public clouds: block and object storage
‣ All require high-performance
networking
‣ Anything external requires
awesome external connection
39

Storage networking: solutions
External connections that make petascale storage
useful to scientists
‣ OC-192
• Works for large institutions willing to
make investment
• Cost prohibitive: $200-$300k/month
• Start-up cost of at least $1-2M for
border equipment
‣ Internet2 10/100Gb Hybrid ports
• Much better cost, fewer routing
options
• $200k/year
‣ Google Fiber, AT&T Gigapower? 40

Internal networking more critical than external
for petascale storage
‣ Infrastructure must be able to
support the inevitable 1PB transit
• Disaster recovery
• High-availability
• Backup
‣ Need at least 10Gb
• Probably dedicated 10Gb per >1PB
storage facility: 40Gb min —> 1Tb
backbone
‣ 1Gb will not cut it for that data size
• ~97 days to transmit at saturation
• 10Gb: ~9.7 days
41

And now, the real problem: topology and logical
design
‣ Need a scaling internal
topology
‣ One core switch doing all
routing and packet transit ==
bad
‣ More advanced designs needed
‣ Also: prioritize performance
over security
• Nearly impossible for most orgs
‣ Most implemented option:
Science DMZ
42

Science DMZ: not for everything
Sensitive data have policies and compliance issues,
breaking them can be illegal
‣ Need logical topology flexible
enough for security AND
performance
‣ Best example: ISP model
• Collapsed PE/CE on single router at edge
• OSPF routing at edge, fast label
switching on dual 100Gb cores
• VRF for network segments
• MPLS for fast transit and bandwidth
guarantees
‣ Side benefit: trusted and untrusted
Science DMZ
43

Compute == Answers!
The pinnacle of data transit, the reason we store it
in the first place
‣ High performance computing:
clusters, supercomputers,
single servers, powerful
workstations, etc.
‣ Mostly a datacenter issue
‣ Unless…
• Storage not centralized or co-located:
data duplicated unless
have a killer network
• New methods: data doesn’t
move, compute moves to data
45

Use Case: Get data to cluster
Assumes the use of central high-performance
storage system
‣ Easier problem within the
same datacenter
‣ Large data needs large pipe
‣ Output of storage device
needs to be fast
• Needs to drive data to/from all
compute nodes simultaneously
‣ Large clusters: big problem
• Needs parallel filesystems:
GPFS, Lustre
46

Internal network esp. important
Use of local disk in newer clusters
‣ Implementation of
storage/analytics systems
for Big Data/HDFS
‣ Hadoop, Gluster, local
ZFS volumes, virtual disk
pools
‣ Now storage can be both
internal and external
‣ I/O throughput is critical
47

Application characteristics
‣ Mostly single process apps
‣ Some SMP/threaded apps performance
bound by IO and/or RAM
‣ Lots of Perl/Python/R
‣ Hundreds of apps, codes & toolkits
‣ 1TB - 2TB RAM “High Memory” nodes
becoming essential
‣ MPI is rare
• Well written MPI is even rarer
‣ Few MPI apps actually benefit from
expensive low-latency interconnects*
• *Chemistry, modeling and structure work is
the exception
48

Life Science very I/O bound
Genomics especially
‣ Sync time for data often
takes longer than the job
itself
‣ Have to load up to 300GB
into memory, for 1min
process
‣ Do this thousands of times
‣ Largely due to bad
programming and
improperly configured
systems 49

Cluster networking Solutions
Interconnects between the nodes and the cluster’s
connection to the main network critical
‣ Optimal cluster networks: fat
tree and torus topologies
• All layer 2, internally
‣ Most keep subscription to 1:4,
depending on usage
‣ Top-level switches connect at
high speed to datacenter
network
• Newest are multiple 10Gb or 40Gb
• Infiniband internal networks:
Mellanox ConnectX3 - ethernet and
IB capable switch ports 50

51
Sharing the data: Collaboration

Collaboration
Fundamental to science
‣ Now that data production is reaching petascale,
collaboration is getting harder
‣ Projects are getting more complex, more data
is being generated, takes more people to work
on the science
‣ Journal authorships: common to see 40+
authors now
‣ Clearly a networking problem at its core
‣ Let’s face it, doing this right is expensive! 52

Data Movement & Data Sharing
The gist of collaborative data sharing in life sciences
‣ Peta-scale data movement
needs
• Within an organization
• To/from collaborators
• To/from suppliers
• To/from public data repos
‣ Peta-scale data sharing
needs
• Collaborators and partners may
be all over the world
53

54
Most common high-speed network: FedEx

We Have Both Ingest Problems
Physical & Network
‣ Significant physical ingest
occurring in Life Science
• Standard media: naked SATA drives
shipped via Fedex
‣ Cliche example:
• 30 genomes outsourced means 30
drives will soon be sitting in your
mail pile
‣ Organizations often use similar
methods to freight data
between buildings and among
geographic sites 55

Physical Ingest Just Plain Nasty
‣ Easy to talk about in
theory
‣ Seems “easy” to scientists
and even IT at first glance
‣ Really really nasty in
practice
• Incredibly time consuming
• Significant operational burden
• Easy to do badly / lose data
56

Collaboration Solutions
Science DMZ: making it easier to collaborate
Image source: “The Science DMZ: Introduction & Architecture” -- esnet 57

Internet2: making data accessible and affordable
‣ Internet2 is bringing Research
and Education together
• High-speed, clean networking at its
core
• Novel and advanced uses of SDN
• Subsidized rates: national high-performance
networking affordable
‣ AL2S: quickly establish national
networks at high-speed
‣ Combined with Science DMZ:
platform for collaboration
58

Push for Cloud use: Most use Amazon Web
Services, Google Cloud not far behind
‣ Many Orgs are pushing for cloud
‣ Unsupported scientists end up
using cloud
‣ It’s fast, flexible, affordable, if done
right
‣ Great place for large public
datasets to live
‣ Has existing high(ish)-performance
networking
‣ If done wrong, way more expensive
than local compute
‣ Biggest problem: getting data to it!
59

Hybrid HPC: Also known as hybrid clouds
‣ Relatively new idea
• small local footprint
• large, dynamic, scalable, orchestrated
public cloud component
‣ DevOps is key to making this work
‣ High-speed network to public cloud
required
‣ Software interface layer acting as the
mediator between local and public
resources
‣ Good for tight budgets, has to be
done right to work
‣ Not many working examples yet 60

Data Commons
Central storage of knowledge with compute
‣ Common structure for
data storage and indexing
(a cloud?)
‣ Associated compute for
analytics
‣ Development platform for
application development
(PaaS)
‣ Make discovery more
possible 61

USDA: Agricultural Research Service
Huge Government Agency trying to make agriculture
better in every way
‣ Researchers doing amazing
research on how crops and
animals can be better farmed
‣ Lower environmental
impacts
‣ Better economic returns
‣ How to optimize how
agriculture functions in the
US
‣ But, there’s a problem…
63

They’re doing all the things!
Every kind of high-throughput research talked about
they are doing, and more, and on a massive scale
64

Just to list a few…
‣ Genomics (a lot of de novo
assembly)
‣ Large scale imaging
• LIDAR
• Satellite
‣ Simulations
‣ Climatology
‣ Remote sensing
‣ Farm equipment sensors (IoT)
65

Their current network
66
• Upgrading to DS3
• Still a lot of T1
• Won’t cut it for
science

The new initiative
Build a Science DMZ: SciNet, on an Internet2 AL2S
Backbone
67

SciNet to feature compute
Hybrid HPC, Storage, Virtualization environment
68

Problems getting solved
Utilizing scientific computing to enable discovery
70

Converged Infrastructure
71
The meta issue
‣ Individual technologies and
their general successful use
are fine
‣ Unless they all work
together as a unified
solution, it all means
nothing
‣ Creating an end-to-end
solution based on the use
case (science!): converged
infrastructure

[Hyper-]convergence
It’s what we do
72

[Hyper-]convergence
It’s what we do
72
Converged Solution

Convergence
People matter too
73
Converged Solution

Universal Truth
“The network IS the computer” - John Gage, Sun
Microsystems
‣ Convergence is not possible
without networking
‣ Also not possible without GOOD
networking
‣ Life Sciences is learning lessons
learned by physics and astronomy
5-10 years ago
‣ Biggest problem is Org acceptance
and investment in personnel and
equipment
‣ Next-Gen biomedical research
advancing too quickly: must invest
now
74

75
end; Thanks!
slides at http://www.slideshare.net/arieberman

High-Performance Networking Use Cases in Life Sciences

Recommended

Recommended

More Related Content

Similar to High-Performance Networking Use Cases in Life Sciences

Similar to High-Performance Networking Use Cases in Life Sciences (20)

Recently uploaded

Recently uploaded (20)

High-Performance Networking Use Cases in Life Sciences