SlideShare a Scribd company logo
High-Performance Networking 
Use Cases in Life Sciences 
1 
2014 Internet2 Technology Exchange; Indianapolis, IN 
Slides available at http://www.slideshare.net/arieberman
Who am I? 
2 
Director of Government Services, Principal 
Investigator 
I’m a fallen scientist - Ph.D. Molecular Biology, 
Neuroscience, Bioinformatics 
I’m an HPC/Infrastructure geek - 15 years 
I help enable science! 
I’m Ari
3 
BioTeam 
‣ Independent consulting shop 
‣ Staffed by scientists forced to learn 
IT, SW & HPC to get our own 
research done 
‣ Infrastructure, Informatics, 
Software Development, Cross-disciplinary 
Assessments 
‣ 11+ years bridging the “gap” 
between science, IT & high 
performance computing 
‣ Our wide-ranging work is what gets 
us invited to speak at events like 
this ...
BioTeam 
What do we do? 
4 
Laboratory Knowledge
BioTeam 
What do we do? 
4 
Laboratory Knowledge
BioTeam 
What do we do? 
4 
Laboratory Knowledge
BioTeam 
What do we do? 
4 
Laboratory Knowledge
BioTeam 
What do we do? 
4 
Laboratory Knowledge
BioTeam 
What do we do? 
4 
Laboratory Knowledge
BioTeam 
What do we do? 
4 
Laboratory Knowledge 
Converged Solution
BioTeam 
What do we do? 
4 
Laboratory Knowledge 
Converged Solution
Our domain coverage 
Mostly work in Life Sciences 
• Government 
• Universities 
• Big pharma 
• Biotech 
• Private institutes 
• Diagnostic startups 
• Oil and Gas 
• Geospatial 
• Hollywood Animation 
• Law Enforcement 
5
6 
OK, so why am I here talking to 
you?
We’ve noticed a few things 
We have a unique perspective across much of life 
sciences 
‣ Big Data has arrived in Life Sciences 
‣ Data is being generated at unprecedented rates 
‣ Research and Biomedical Orgs were caught off 
guard 
‣ IT running to catch up, limited budgets 
‣ Money is tight, Orgs reluctant to invest in Bio-IT 
7 
25% of all Life Scientists will require HPC in 2015!
8 
Big Picture / Meta Issue 
‣ HUGE revolution in the 
rate at which lab 
platforms are being 
redesigned, improved & 
refreshed 
‣ IT not a part of the 
conversation, running to 
catch up
The Central Problem Is ... 
Science progressing way faster than IT can refresh/ 
change 
‣ Instrumentation & protocols are changing FAR 
FASTER than we can refresh our Research-IT & 
Scientific Computing infrastructure 
• Bench science is changing month-to-month ... 
• ... while our IT infrastructure only gets refreshed every 
2-7 years 
‣ We have to design systems TODAY that can 
support unknown research requirements & 
workflows over many years (gulp ...) 
9
10 
It’s a risky time to be doing Bio-IT 
11 
What are the drivers in Bio-IT today?
11 
Genomics: Next Generation Sequencing 
(NGS)
It’s like the hard drive of life 
12 
The big deal about DNA 
‣ DNA is the template of life 
‣ DNA is read --> RNA 
‣ RNA is read --> Proteins 
‣ Proteins are the 
functional machinery that 
make life possible 
‣ Understanding the 
template = understanding 
basis for disease
How does NGS work? 
Sequencing by Synthesis 
13
How does NGS work? 
Reference assembly, variant calling 
14
How does NGS work? 
Reference assembly, variant calling 
14
How does NGS work? 
Reference assembly, variant calling 
14
The Human Genome 
Gateway to personalized medicine 
‣ 3.2 Gbp 
‣ 23 chromosomes 
‣ ~21,000 genes 
‣ Over 55M known 
variations 
15
...and why NGS is the primary driver 
16 
The Problem... 
‣ Sequencers are now relatively 
cheap and fast 
‣ Some can generate a human 
genome in 18 hours, for $2,000 
‣ Everyone is doing it 
‣ Can generate 3TB of data in 
that time 
‣ First genome took 13 years and 
$2.7B to complete 
‣ Know of 10 organizations: 
100,000 genomes over 5 years
...and why NGS is the primary driver 
16 
The Problem... 
‣ Sequencers are now relatively 
cheap and fast 
‣ Some can generate a human 
genome in 18 hours, for $2,000 
‣ Everyone is doing it 
‣ Can generate 3TB of data in 
that time 
‣ First genome took 13 years and 
$2.7B to complete 
‣ Know of 10 organizations: 
100,000 genomes over 5 years 
That’s 14PB of data, folks
17 
Other Methodologies Not Far Behind
High-throughput Imaging 
‣ Robotics screening millions of 
compounds on live cells 24/7 
• Not as much data as genomics in 
volume, but just as complex 
• Data volumes in the 10’s TB/week 
‣ Confocal Imaging 
• Scanning 100’s of tissue sections/ 
week, each with 10’s of scans, 
each with 20-40 layers and multiple 
florescent channels 
• Data volumes in the 1’s - 10’s TB/ 
week 
18
High-res medical imaging 
High-power, dense detector MRI scanners in use 
24/7 at large research hospitals 
‣ Creating 3D models of 
brains, comparing large 
datasets 
‣ Using those models to 
perform detailed 
neurosurgery with real-time 
analytic feedback from 
supercomputer in the OR 
(cool stuff) 
‣ Also generates 10’s of TB/ 
week 19
20 
This is a huge problem 
‣ Causing a literal deluge of 
data, in the 10’s of 
Petabytes 
‣ NIH generating 1.5PB of 
data/month 
‣ First real case in life science 
where 100Gb networking 
might really be needed 
‣ But, not enough storage or 
compute
21 
And, just to make things more complicated
File & Data Types 
We have them all 
‣ Massive text files 
‣ Massive binary files 
‣ Flatfile ‘databases’ 
‣ Spreadsheets everywhere 
‣ Directories w/ 6 million 
files 
‣ Large files: 600GB+ 
‣ Small files: 30kb or smaller 
22
Why, giant meta-analyses, of course 
23 
What to do with all that data? 
‣ Typical problem across all of 
big data: how do you use it? 
‣ In life sciences: no real 
standards of data formats 
‣ Data scattered all over, 
despite push for Data 
Commons 
‣ Not always accessible 
‣ Combining the data if you 
have it all is a real challenge
A Compounding Problem... 
Scientists don’t like to share (really!) 
‣ The fear: 
• if someone sees data before it 
is published, they might steal it 
and publish it themselves 
(getting scooped) 
‣ Causes: 
• Long time to publication 
• Outdated methods of 
assigning scientific credit 
• Not properly incentivized 
24
A Problem for Data Commons 
Sharing required 
‣ Data piling up 
(scientists are 
hoarders) 
‣ Bad network 
infrastructures 
‣ Few central analytics 
platforms 
‣ Wild-west file formats/ 
algorithms 
‣ No sharing 25
A Problem for Data Commons 
Sharing required 
‣ Data piling up 
(scientists are 
hoarders) 
‣ Hyperscale Bad network 
infrastructures 
analytics will only work 
‣ Few central if the analytics 
data is accessible! 
platforms 
‣ Wild-west file formats/ 
algorithms 
‣ No sharing 25
Clear issue for Networking 
Every kind of flow imaginable 
‣ Mouse —> Elephant 
‣ Typical problem: firewalls 
not designed for this 
‣ Potentially massive 
amount of constant data 
movement 
‣ How are people handling 
all of this? 
26
27 
Use Cases in Life Sciences
28 
Getting Data out of the Laboratory
Laboratories not Integrated 
Usually very little IT infrastructure in labs 
‣ Tons of data generating 
equipment going in now 
‣ Can generate 15GB of 
data in 50 hours 
‣ Others can generate 
64GB/day 
‣ Labs are not designed to 
transmit data, lucky if 
wired for ethernet 
29
Laboratories not Integrated 
Usually very little IT infrastructure in labs 
‣ Tons of data generating 
equipment going in now 
‣ Can generate 15GB of 
data in 50 hours 
‣ Others can generate 
64GB/day 
‣ Labs are not designed to 
transmit data, lucky if 
wired for ethernet 
29
Laboratories not Integrated 
Usually very little IT infrastructure in labs 
‣ Tons of data generating 
equipment going in now 
‣ Can generate 15GB of 
data in 50 hours 
‣ Others can generate 
64GB/day 
‣ Labs are not designed to 
transmit data, lucky if 
wired for ethernet 
29
Getting data out 
OK, so write data over ethernet to network drive… 
‣ Sounds good, 64GB in 24 
hours ~= 6Mb/s 
‣ Problem: desktop class 
ethernet adaptors 
‣ No error checking, no 
retries, no MD5, no local 
buffer 
‣ If network goes, whole 
run is lost 
30
Getting data out 
Scientists have to get creative, but not in a good way 
‣ Usually ends up going to 
local workstation 
‣ Go buy the cheapest disks 
they can 
‣ Carry it somewhere, transfer 
the data to a workstation 
‣ Put the disk in a drawer 
under a sink (really) 
‣ Works if lab only does one or 
two runs/month, fails if more 
31
Lab data transit not huge! 
Unless you’re dealing with a bigger lab with lots of 
equipment, or a core facility 
‣ Fast networking not 
required, 100Mb OK 
‣ Just GOOD networking 
‣ ….for now (more later) 
32
Successful models 
Some generalized network models that have 
successfully solved the problem 
‣ Most of it is protocol and 
topology 
‣ Quality of Service (QoS) 
‣ Appropriate segmentation 
(L2 and/or L3) 
‣ MPLS paths 
‣ Intermediate protocols 
(i.e., Aspera FASP) 
‣ One way or another, 
guarantee transfer 33
34 
Storing the Data
Storage: a networking problem 
As storage needs increase, the need to transmit it 
goes up too 
‣ Networking will quickly replace storage as #1 
headache in Bio-IT 
‣ Petascale storage is useless without high-performance 
networking 
‣ Most enterprise networks won’t cut it 
35
Storage: an Org Problem 
Most single laboratories don’t have an immediate 
need for peta-scale storage 
‣ BUT - labs need to be peta-capable 
‣ Can’t predict how much or 
what kind of equipment 
‣ Have to build for an 
indeterminate future 
‣ Does it make sense for each 
lab to buy own storage? 
• Probably not, doesn’t scale well 
financially 
36
Storage: an Org Problem 
Orgs that don’t invest will find themselves in a mess 
of storage support 
‣ This is when the storage 
problem becomes a 
networking problem 
‣ Scientists need to share, 
collaborate 
‣ Lab with 100TB of data, 
needs to share with offsite 
or onsite scientist 
‣ Also: backups and disaster 
recovery: data is the new 
commodity 37
Storage: a networking problem 
Without high-performance networking, petascale 
anything is useless 
‣ Traditional enterprise networks 
don’t cut it 
‣ Large single-stream flows get 
squashed through firewalls and 
IDS 
‣ Centralized: 10’s of PBs 
‣ Distributed: 100’s of PBs 
• Likely a lot of duplication 
‣ Network becomes key 
‣ Cloud use makes this an even 
bigger problem 
38
Storage: options! 
‣ There are a ton of options for 
storage 
• Local: small and large 
• Institutional: mostly large 
• Distributed Institutional: distributed NAS 
(GPFS over WAN), Object store 
networks, iRODS 
• Public clouds: block and object storage 
‣ All require high-performance 
networking 
‣ Anything external requires 
awesome external connection 
39
Storage networking: solutions 
External connections that make petascale storage 
useful to scientists 
‣ OC-192 
• Works for large institutions willing to 
make investment 
• Cost prohibitive: $200-$300k/month 
• Start-up cost of at least $1-2M for 
border equipment 
‣ Internet2 10/100Gb Hybrid ports 
• Much better cost, fewer routing 
options 
• $200k/year 
‣ Google Fiber, AT&T Gigapower? 40
Storage networking: solutions 
Internal networking more critical than external 
for petascale storage 
‣ Infrastructure must be able to 
support the inevitable 1PB transit 
• Disaster recovery 
• High-availability 
• Backup 
‣ Need at least 10Gb 
• Probably dedicated 10Gb per >1PB 
storage facility: 40Gb min —> 1Tb 
backbone 
‣ 1Gb will not cut it for that data size 
• ~97 days to transmit at saturation 
• 10Gb: ~9.7 days 
41
Storage networking: solutions 
And now, the real problem: topology and logical 
design 
‣ Need a scaling internal 
topology 
‣ One core switch doing all 
routing and packet transit == 
bad 
‣ More advanced designs needed 
‣ Also: prioritize performance 
over security 
• Nearly impossible for most orgs 
‣ Most implemented option: 
Science DMZ 
42
Science DMZ: not for everything 
Sensitive data have policies and compliance issues, 
breaking them can be illegal 
‣ Need logical topology flexible 
enough for security AND 
performance 
‣ Best example: ISP model 
• Collapsed PE/CE on single router at edge 
• OSPF routing at edge, fast label 
switching on dual 100Gb cores 
• VRF for network segments 
• MPLS for fast transit and bandwidth 
guarantees 
‣ Side benefit: trusted and untrusted 
Science DMZ 
43
44 
Analyzing the data
Compute == Answers! 
The pinnacle of data transit, the reason we store it 
in the first place 
‣ High performance computing: 
clusters, supercomputers, 
single servers, powerful 
workstations, etc. 
‣ Mostly a datacenter issue 
‣ Unless… 
• Storage not centralized or co-located: 
data duplicated unless 
have a killer network 
• New methods: data doesn’t 
move, compute moves to data 
45
Use Case: Get data to cluster 
Assumes the use of central high-performance 
storage system 
‣ Easier problem within the 
same datacenter 
‣ Large data needs large pipe 
‣ Output of storage device 
needs to be fast 
• Needs to drive data to/from all 
compute nodes simultaneously 
‣ Large clusters: big problem 
• Needs parallel filesystems: 
GPFS, Lustre 
46
Internal network esp. important 
Use of local disk in newer clusters 
‣ Implementation of 
storage/analytics systems 
for Big Data/HDFS 
‣ Hadoop, Gluster, local 
ZFS volumes, virtual disk 
pools 
‣ Now storage can be both 
internal and external 
‣ I/O throughput is critical 
47
Application characteristics 
‣ Mostly single process apps 
‣ Some SMP/threaded apps performance 
bound by IO and/or RAM 
‣ Lots of Perl/Python/R 
‣ Hundreds of apps, codes & toolkits 
‣ 1TB - 2TB RAM “High Memory” nodes 
becoming essential 
‣ MPI is rare 
• Well written MPI is even rarer 
‣ Few MPI apps actually benefit from 
expensive low-latency interconnects* 
• *Chemistry, modeling and structure work is 
the exception 
48
Life Science very I/O bound 
Genomics especially 
‣ Sync time for data often 
takes longer than the job 
itself 
‣ Have to load up to 300GB 
into memory, for 1min 
process 
‣ Do this thousands of times 
‣ Largely due to bad 
programming and 
improperly configured 
systems 49
Cluster networking Solutions 
Interconnects between the nodes and the cluster’s 
connection to the main network critical 
‣ Optimal cluster networks: fat 
tree and torus topologies 
• All layer 2, internally 
‣ Most keep subscription to 1:4, 
depending on usage 
‣ Top-level switches connect at 
high speed to datacenter 
network 
• Newest are multiple 10Gb or 40Gb 
• Infiniband internal networks: 
Mellanox ConnectX3 - ethernet and 
IB capable switch ports 50
51 
Sharing the data: Collaboration
Collaboration 
Fundamental to science 
‣ Now that data production is reaching petascale, 
collaboration is getting harder 
‣ Projects are getting more complex, more data 
is being generated, takes more people to work 
on the science 
‣ Journal authorships: common to see 40+ 
authors now 
‣ Clearly a networking problem at its core 
‣ Let’s face it, doing this right is expensive! 52
Data Movement & Data Sharing 
The gist of collaborative data sharing in life sciences 
‣ Peta-scale data movement 
needs 
• Within an organization 
• To/from collaborators 
• To/from suppliers 
• To/from public data repos 
‣ Peta-scale data sharing 
needs 
• Collaborators and partners may 
be all over the world 
53
54 
Most common high-speed network: FedEx
We Have Both Ingest Problems 
Physical & Network 
‣ Significant physical ingest 
occurring in Life Science 
• Standard media: naked SATA drives 
shipped via Fedex 
‣ Cliche example: 
• 30 genomes outsourced means 30 
drives will soon be sitting in your 
mail pile 
‣ Organizations often use similar 
methods to freight data 
between buildings and among 
geographic sites 55
Physical Ingest Just Plain Nasty 
‣ Easy to talk about in 
theory 
‣ Seems “easy” to scientists 
and even IT at first glance 
‣ Really really nasty in 
practice 
• Incredibly time consuming 
• Significant operational burden 
• Easy to do badly / lose data 
56
Collaboration Solutions 
Science DMZ: making it easier to collaborate 
Image source: “The Science DMZ: Introduction & Architecture” -- esnet 57
Collaboration Solutions 
Internet2: making data accessible and affordable 
‣ Internet2 is bringing Research 
and Education together 
• High-speed, clean networking at its 
core 
• Novel and advanced uses of SDN 
• Subsidized rates: national high-performance 
networking affordable 
‣ AL2S: quickly establish national 
networks at high-speed 
‣ Combined with Science DMZ: 
platform for collaboration 
58
Collaboration Solutions 
Push for Cloud use: Most use Amazon Web 
Services, Google Cloud not far behind 
‣ Many Orgs are pushing for cloud 
‣ Unsupported scientists end up 
using cloud 
‣ It’s fast, flexible, affordable, if done 
right 
‣ Great place for large public 
datasets to live 
‣ Has existing high(ish)-performance 
networking 
‣ If done wrong, way more expensive 
than local compute 
‣ Biggest problem: getting data to it! 
59
Collaboration Solutions 
Hybrid HPC: Also known as hybrid clouds 
‣ Relatively new idea 
• small local footprint 
• large, dynamic, scalable, orchestrated 
public cloud component 
‣ DevOps is key to making this work 
‣ High-speed network to public cloud 
required 
‣ Software interface layer acting as the 
mediator between local and public 
resources 
‣ Good for tight budgets, has to be 
done right to work 
‣ Not many working examples yet 60
Data Commons 
Central storage of knowledge with compute 
‣ Common structure for 
data storage and indexing 
(a cloud?) 
‣ Associated compute for 
analytics 
‣ Development platform for 
application development 
(PaaS) 
‣ Make discovery more 
possible 61
62 
An Example of Progress
USDA: Agricultural Research Service 
Huge Government Agency trying to make agriculture 
better in every way 
‣ Researchers doing amazing 
research on how crops and 
animals can be better farmed 
‣ Lower environmental 
impacts 
‣ Better economic returns 
‣ How to optimize how 
agriculture functions in the 
US 
‣ But, there’s a problem… 
63
They’re doing all the things! 
Every kind of high-throughput research talked about 
they are doing, and more, and on a massive scale 
64
Just to list a few… 
‣ Genomics (a lot of de novo 
assembly) 
‣ Large scale imaging 
• LIDAR 
• Satellite 
‣ Simulations 
‣ Climatology 
‣ Remote sensing 
‣ Farm equipment sensors (IoT) 
65
Their current network 
66 
• Upgrading to DS3 
• Still a lot of T1 
• Won’t cut it for 
science
The new initiative 
Build a Science DMZ: SciNet, on an Internet2 AL2S 
Backbone 
67
SciNet to feature compute 
Hybrid HPC, Storage, Virtualization environment 
68
69 
What’s the Big Picture?
Problems getting solved 
Utilizing scientific computing to enable discovery 
70 
Laboratory Knowledge
Problems getting solved 
Utilizing scientific computing to enable discovery 
70 
Laboratory Knowledge
Problems getting solved 
Utilizing scientific computing to enable discovery 
70 
Laboratory Knowledge
Problems getting solved 
Utilizing scientific computing to enable discovery 
70 
Laboratory Knowledge
Problems getting solved 
Utilizing scientific computing to enable discovery 
70 
Laboratory Knowledge
Problems getting solved 
Utilizing scientific computing to enable discovery 
70 
Laboratory Knowledge
Converged Infrastructure 
71 
The meta issue 
‣ Individual technologies and 
their general successful use 
are fine 
‣ Unless they all work 
together as a unified 
solution, it all means 
nothing 
‣ Creating an end-to-end 
solution based on the use 
case (science!): converged 
infrastructure
[Hyper-]convergence 
It’s what we do 
72 
Laboratory Knowledge
[Hyper-]convergence 
It’s what we do 
72 
Laboratory Knowledge 
Converged Solution
[Hyper-]convergence 
It’s what we do 
72 
Laboratory Knowledge 
Converged Solution
Convergence 
People matter too 
73 
Laboratory Knowledge 
Converged Solution
Universal Truth 
“The network IS the computer” - John Gage, Sun 
Microsystems 
‣ Convergence is not possible 
without networking 
‣ Also not possible without GOOD 
networking 
‣ Life Sciences is learning lessons 
learned by physics and astronomy 
5-10 years ago 
‣ Biggest problem is Org acceptance 
and investment in personnel and 
equipment 
‣ Next-Gen biomedical research 
advancing too quickly: must invest 
now 
74
75 
end; Thanks! 
slides at http://www.slideshare.net/arieberman

More Related Content

Similar to High-Performance Networking Use Cases in Life Sciences

BioIT Trends - 2014 Internet2 Technology Exchange
BioIT Trends - 2014 Internet2 Technology ExchangeBioIT Trends - 2014 Internet2 Technology Exchange
BioIT Trends - 2014 Internet2 Technology Exchange
Chris Dagdigian
 
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
Ari Berman
 
2015 CDC Workshop on ScienceDMZ
2015 CDC Workshop on ScienceDMZ2015 CDC Workshop on ScienceDMZ
2015 CDC Workshop on ScienceDMZ
Chris Dagdigian
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciences
Guy Coates
 
2013: Trends from the Trenches
2013: Trends from the Trenches2013: Trends from the Trenches
2013: Trends from the Trenches
Chris Dagdigian
 
Converged IT and Data Commons
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data Commons
Simon Twigger
 
SLAS Ultra-High-Throughput Screening Special Interest Group SLAS2017 Presenta...
SLAS Ultra-High-Throughput Screening Special Interest Group SLAS2017 Presenta...SLAS Ultra-High-Throughput Screening Special Interest Group SLAS2017 Presenta...
SLAS Ultra-High-Throughput Screening Special Interest Group SLAS2017 Presenta...
SLAS (Society for Laboratory Automation and Screening)
 
O'Reilly Webcast: Organizing the Internet of Things - Actionable Insight Thro...
O'Reilly Webcast: Organizing the Internet of Things - Actionable Insight Thro...O'Reilly Webcast: Organizing the Internet of Things - Actionable Insight Thro...
O'Reilly Webcast: Organizing the Internet of Things - Actionable Insight Thro...
Boris Adryan
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
c.titus.brown
 
Reproducible research - to infinity
Reproducible research - to infinityReproducible research - to infinity
Reproducible research - to infinity
PeterMorrell4
 
Introduction Big data
Introduction Big data  Introduction Big data
Introduction Big data
مروان الوجيه
 
BioTeam Trends from the Trenches - NIH, April 2014
BioTeam Trends from the Trenches - NIH, April 2014BioTeam Trends from the Trenches - NIH, April 2014
BioTeam Trends from the Trenches - NIH, April 2014
Ari Berman
 
S B Goyal
S B GoyalS B Goyal
S B Goyal
Shyam Goyal
 
Building a Distributed Collaborative Data Pipeline with Apache Spark
Building a Distributed Collaborative Data Pipeline with Apache SparkBuilding a Distributed Collaborative Data Pipeline with Apache Spark
Building a Distributed Collaborative Data Pipeline with Apache Spark
Databricks
 
Cshl minseqe 2013_ouellette
Cshl minseqe 2013_ouelletteCshl minseqe 2013_ouellette
Cshl minseqe 2013_ouellette
Functional Genomics Data Society
 
TranSMART: How open source software revolutionizes drug discovery through cro...
TranSMART: How open source software revolutionizes drug discovery through cro...TranSMART: How open source software revolutionizes drug discovery through cro...
TranSMART: How open source software revolutionizes drug discovery through cro...
keesvb
 
Big Data Rampage
Big Data RampageBig Data Rampage
Big Data Rampage
Niko Vuokko
 
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Chris Dagdigian
 
All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databri...
All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databri...All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databri...
All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databri...
Databricks
 

Similar to High-Performance Networking Use Cases in Life Sciences (20)

BioIT Trends - 2014 Internet2 Technology Exchange
BioIT Trends - 2014 Internet2 Technology ExchangeBioIT Trends - 2014 Internet2 Technology Exchange
BioIT Trends - 2014 Internet2 Technology Exchange
 
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
 
2015 CDC Workshop on ScienceDMZ
2015 CDC Workshop on ScienceDMZ2015 CDC Workshop on ScienceDMZ
2015 CDC Workshop on ScienceDMZ
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciences
 
2013: Trends from the Trenches
2013: Trends from the Trenches2013: Trends from the Trenches
2013: Trends from the Trenches
 
Converged IT and Data Commons
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data Commons
 
SLAS Ultra-High-Throughput Screening Special Interest Group SLAS2017 Presenta...
SLAS Ultra-High-Throughput Screening Special Interest Group SLAS2017 Presenta...SLAS Ultra-High-Throughput Screening Special Interest Group SLAS2017 Presenta...
SLAS Ultra-High-Throughput Screening Special Interest Group SLAS2017 Presenta...
 
O'Reilly Webcast: Organizing the Internet of Things - Actionable Insight Thro...
O'Reilly Webcast: Organizing the Internet of Things - Actionable Insight Thro...O'Reilly Webcast: Organizing the Internet of Things - Actionable Insight Thro...
O'Reilly Webcast: Organizing the Internet of Things - Actionable Insight Thro...
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
Reproducible research - to infinity
Reproducible research - to infinityReproducible research - to infinity
Reproducible research - to infinity
 
Introduction Big data
Introduction Big data  Introduction Big data
Introduction Big data
 
BioTeam Trends from the Trenches - NIH, April 2014
BioTeam Trends from the Trenches - NIH, April 2014BioTeam Trends from the Trenches - NIH, April 2014
BioTeam Trends from the Trenches - NIH, April 2014
 
S B Goyal
S B GoyalS B Goyal
S B Goyal
 
Building a Distributed Collaborative Data Pipeline with Apache Spark
Building a Distributed Collaborative Data Pipeline with Apache SparkBuilding a Distributed Collaborative Data Pipeline with Apache Spark
Building a Distributed Collaborative Data Pipeline with Apache Spark
 
Cshl minseqe 2013_ouellette
Cshl minseqe 2013_ouelletteCshl minseqe 2013_ouellette
Cshl minseqe 2013_ouellette
 
TranSMART: How open source software revolutionizes drug discovery through cro...
TranSMART: How open source software revolutionizes drug discovery through cro...TranSMART: How open source software revolutionizes drug discovery through cro...
TranSMART: How open source software revolutionizes drug discovery through cro...
 
Big Data Rampage
Big Data RampageBig Data Rampage
Big Data Rampage
 
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
 
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
 
All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databri...
All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databri...All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databri...
All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databri...
 

Recently uploaded

EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
Sérgio Sacani
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
LengamoLAppostilic
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
Gokturk Mehmet Dilci
 
Direct Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart AgricultureDirect Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart Agriculture
International Food Policy Research Institute- South Asia Office
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
pablovgd
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
Daniel Tubbenhauer
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
Leonel Morgado
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
Anagha Prasad
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
University of Maribor
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
RitabrataSarkar3
 
The cost of acquiring information by natural selection
The cost of acquiring information by natural selectionThe cost of acquiring information by natural selection
The cost of acquiring information by natural selection
Carl Bergstrom
 
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfMending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Selcen Ozturkcan
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
vluwdy49
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
İsa Badur
 
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
hozt8xgk
 
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
Advanced-Concepts-Team
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
Leonel Morgado
 
23PH301 - Optics - Optical Lenses.pptx
23PH301 - Optics  -  Optical Lenses.pptx23PH301 - Optics  -  Optical Lenses.pptx
23PH301 - Optics - Optical Lenses.pptx
RDhivya6
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
Vandana Devesh Sharma
 
Basics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different formsBasics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different forms
MaheshaNanjegowda
 

Recently uploaded (20)

EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
 
Direct Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart AgricultureDirect Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart Agriculture
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
 
The cost of acquiring information by natural selection
The cost of acquiring information by natural selectionThe cost of acquiring information by natural selection
The cost of acquiring information by natural selection
 
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfMending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
 
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
 
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
 
23PH301 - Optics - Optical Lenses.pptx
23PH301 - Optics  -  Optical Lenses.pptx23PH301 - Optics  -  Optical Lenses.pptx
23PH301 - Optics - Optical Lenses.pptx
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
 
Basics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different formsBasics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different forms
 

High-Performance Networking Use Cases in Life Sciences

  • 1. High-Performance Networking Use Cases in Life Sciences 1 2014 Internet2 Technology Exchange; Indianapolis, IN Slides available at http://www.slideshare.net/arieberman
  • 2. Who am I? 2 Director of Government Services, Principal Investigator I’m a fallen scientist - Ph.D. Molecular Biology, Neuroscience, Bioinformatics I’m an HPC/Infrastructure geek - 15 years I help enable science! I’m Ari
  • 3. 3 BioTeam ‣ Independent consulting shop ‣ Staffed by scientists forced to learn IT, SW & HPC to get our own research done ‣ Infrastructure, Informatics, Software Development, Cross-disciplinary Assessments ‣ 11+ years bridging the “gap” between science, IT & high performance computing ‣ Our wide-ranging work is what gets us invited to speak at events like this ...
  • 4. BioTeam What do we do? 4 Laboratory Knowledge
  • 5. BioTeam What do we do? 4 Laboratory Knowledge
  • 6. BioTeam What do we do? 4 Laboratory Knowledge
  • 7. BioTeam What do we do? 4 Laboratory Knowledge
  • 8. BioTeam What do we do? 4 Laboratory Knowledge
  • 9. BioTeam What do we do? 4 Laboratory Knowledge
  • 10. BioTeam What do we do? 4 Laboratory Knowledge Converged Solution
  • 11. BioTeam What do we do? 4 Laboratory Knowledge Converged Solution
  • 12. Our domain coverage Mostly work in Life Sciences • Government • Universities • Big pharma • Biotech • Private institutes • Diagnostic startups • Oil and Gas • Geospatial • Hollywood Animation • Law Enforcement 5
  • 13. 6 OK, so why am I here talking to you?
  • 14. We’ve noticed a few things We have a unique perspective across much of life sciences ‣ Big Data has arrived in Life Sciences ‣ Data is being generated at unprecedented rates ‣ Research and Biomedical Orgs were caught off guard ‣ IT running to catch up, limited budgets ‣ Money is tight, Orgs reluctant to invest in Bio-IT 7 25% of all Life Scientists will require HPC in 2015!
  • 15. 8 Big Picture / Meta Issue ‣ HUGE revolution in the rate at which lab platforms are being redesigned, improved & refreshed ‣ IT not a part of the conversation, running to catch up
  • 16. The Central Problem Is ... Science progressing way faster than IT can refresh/ change ‣ Instrumentation & protocols are changing FAR FASTER than we can refresh our Research-IT & Scientific Computing infrastructure • Bench science is changing month-to-month ... • ... while our IT infrastructure only gets refreshed every 2-7 years ‣ We have to design systems TODAY that can support unknown research requirements & workflows over many years (gulp ...) 9
  • 17. 10 It’s a risky time to be doing Bio-IT 11 What are the drivers in Bio-IT today?
  • 18. 11 Genomics: Next Generation Sequencing (NGS)
  • 19. It’s like the hard drive of life 12 The big deal about DNA ‣ DNA is the template of life ‣ DNA is read --> RNA ‣ RNA is read --> Proteins ‣ Proteins are the functional machinery that make life possible ‣ Understanding the template = understanding basis for disease
  • 20. How does NGS work? Sequencing by Synthesis 13
  • 21. How does NGS work? Reference assembly, variant calling 14
  • 22. How does NGS work? Reference assembly, variant calling 14
  • 23. How does NGS work? Reference assembly, variant calling 14
  • 24. The Human Genome Gateway to personalized medicine ‣ 3.2 Gbp ‣ 23 chromosomes ‣ ~21,000 genes ‣ Over 55M known variations 15
  • 25. ...and why NGS is the primary driver 16 The Problem... ‣ Sequencers are now relatively cheap and fast ‣ Some can generate a human genome in 18 hours, for $2,000 ‣ Everyone is doing it ‣ Can generate 3TB of data in that time ‣ First genome took 13 years and $2.7B to complete ‣ Know of 10 organizations: 100,000 genomes over 5 years
  • 26. ...and why NGS is the primary driver 16 The Problem... ‣ Sequencers are now relatively cheap and fast ‣ Some can generate a human genome in 18 hours, for $2,000 ‣ Everyone is doing it ‣ Can generate 3TB of data in that time ‣ First genome took 13 years and $2.7B to complete ‣ Know of 10 organizations: 100,000 genomes over 5 years That’s 14PB of data, folks
  • 27. 17 Other Methodologies Not Far Behind
  • 28. High-throughput Imaging ‣ Robotics screening millions of compounds on live cells 24/7 • Not as much data as genomics in volume, but just as complex • Data volumes in the 10’s TB/week ‣ Confocal Imaging • Scanning 100’s of tissue sections/ week, each with 10’s of scans, each with 20-40 layers and multiple florescent channels • Data volumes in the 1’s - 10’s TB/ week 18
  • 29. High-res medical imaging High-power, dense detector MRI scanners in use 24/7 at large research hospitals ‣ Creating 3D models of brains, comparing large datasets ‣ Using those models to perform detailed neurosurgery with real-time analytic feedback from supercomputer in the OR (cool stuff) ‣ Also generates 10’s of TB/ week 19
  • 30. 20 This is a huge problem ‣ Causing a literal deluge of data, in the 10’s of Petabytes ‣ NIH generating 1.5PB of data/month ‣ First real case in life science where 100Gb networking might really be needed ‣ But, not enough storage or compute
  • 31. 21 And, just to make things more complicated
  • 32. File & Data Types We have them all ‣ Massive text files ‣ Massive binary files ‣ Flatfile ‘databases’ ‣ Spreadsheets everywhere ‣ Directories w/ 6 million files ‣ Large files: 600GB+ ‣ Small files: 30kb or smaller 22
  • 33. Why, giant meta-analyses, of course 23 What to do with all that data? ‣ Typical problem across all of big data: how do you use it? ‣ In life sciences: no real standards of data formats ‣ Data scattered all over, despite push for Data Commons ‣ Not always accessible ‣ Combining the data if you have it all is a real challenge
  • 34. A Compounding Problem... Scientists don’t like to share (really!) ‣ The fear: • if someone sees data before it is published, they might steal it and publish it themselves (getting scooped) ‣ Causes: • Long time to publication • Outdated methods of assigning scientific credit • Not properly incentivized 24
  • 35. A Problem for Data Commons Sharing required ‣ Data piling up (scientists are hoarders) ‣ Bad network infrastructures ‣ Few central analytics platforms ‣ Wild-west file formats/ algorithms ‣ No sharing 25
  • 36. A Problem for Data Commons Sharing required ‣ Data piling up (scientists are hoarders) ‣ Hyperscale Bad network infrastructures analytics will only work ‣ Few central if the analytics data is accessible! platforms ‣ Wild-west file formats/ algorithms ‣ No sharing 25
  • 37. Clear issue for Networking Every kind of flow imaginable ‣ Mouse —> Elephant ‣ Typical problem: firewalls not designed for this ‣ Potentially massive amount of constant data movement ‣ How are people handling all of this? 26
  • 38. 27 Use Cases in Life Sciences
  • 39. 28 Getting Data out of the Laboratory
  • 40. Laboratories not Integrated Usually very little IT infrastructure in labs ‣ Tons of data generating equipment going in now ‣ Can generate 15GB of data in 50 hours ‣ Others can generate 64GB/day ‣ Labs are not designed to transmit data, lucky if wired for ethernet 29
  • 41. Laboratories not Integrated Usually very little IT infrastructure in labs ‣ Tons of data generating equipment going in now ‣ Can generate 15GB of data in 50 hours ‣ Others can generate 64GB/day ‣ Labs are not designed to transmit data, lucky if wired for ethernet 29
  • 42. Laboratories not Integrated Usually very little IT infrastructure in labs ‣ Tons of data generating equipment going in now ‣ Can generate 15GB of data in 50 hours ‣ Others can generate 64GB/day ‣ Labs are not designed to transmit data, lucky if wired for ethernet 29
  • 43. Getting data out OK, so write data over ethernet to network drive… ‣ Sounds good, 64GB in 24 hours ~= 6Mb/s ‣ Problem: desktop class ethernet adaptors ‣ No error checking, no retries, no MD5, no local buffer ‣ If network goes, whole run is lost 30
  • 44. Getting data out Scientists have to get creative, but not in a good way ‣ Usually ends up going to local workstation ‣ Go buy the cheapest disks they can ‣ Carry it somewhere, transfer the data to a workstation ‣ Put the disk in a drawer under a sink (really) ‣ Works if lab only does one or two runs/month, fails if more 31
  • 45. Lab data transit not huge! Unless you’re dealing with a bigger lab with lots of equipment, or a core facility ‣ Fast networking not required, 100Mb OK ‣ Just GOOD networking ‣ ….for now (more later) 32
  • 46. Successful models Some generalized network models that have successfully solved the problem ‣ Most of it is protocol and topology ‣ Quality of Service (QoS) ‣ Appropriate segmentation (L2 and/or L3) ‣ MPLS paths ‣ Intermediate protocols (i.e., Aspera FASP) ‣ One way or another, guarantee transfer 33
  • 48. Storage: a networking problem As storage needs increase, the need to transmit it goes up too ‣ Networking will quickly replace storage as #1 headache in Bio-IT ‣ Petascale storage is useless without high-performance networking ‣ Most enterprise networks won’t cut it 35
  • 49. Storage: an Org Problem Most single laboratories don’t have an immediate need for peta-scale storage ‣ BUT - labs need to be peta-capable ‣ Can’t predict how much or what kind of equipment ‣ Have to build for an indeterminate future ‣ Does it make sense for each lab to buy own storage? • Probably not, doesn’t scale well financially 36
  • 50. Storage: an Org Problem Orgs that don’t invest will find themselves in a mess of storage support ‣ This is when the storage problem becomes a networking problem ‣ Scientists need to share, collaborate ‣ Lab with 100TB of data, needs to share with offsite or onsite scientist ‣ Also: backups and disaster recovery: data is the new commodity 37
  • 51. Storage: a networking problem Without high-performance networking, petascale anything is useless ‣ Traditional enterprise networks don’t cut it ‣ Large single-stream flows get squashed through firewalls and IDS ‣ Centralized: 10’s of PBs ‣ Distributed: 100’s of PBs • Likely a lot of duplication ‣ Network becomes key ‣ Cloud use makes this an even bigger problem 38
  • 52. Storage: options! ‣ There are a ton of options for storage • Local: small and large • Institutional: mostly large • Distributed Institutional: distributed NAS (GPFS over WAN), Object store networks, iRODS • Public clouds: block and object storage ‣ All require high-performance networking ‣ Anything external requires awesome external connection 39
  • 53. Storage networking: solutions External connections that make petascale storage useful to scientists ‣ OC-192 • Works for large institutions willing to make investment • Cost prohibitive: $200-$300k/month • Start-up cost of at least $1-2M for border equipment ‣ Internet2 10/100Gb Hybrid ports • Much better cost, fewer routing options • $200k/year ‣ Google Fiber, AT&T Gigapower? 40
  • 54. Storage networking: solutions Internal networking more critical than external for petascale storage ‣ Infrastructure must be able to support the inevitable 1PB transit • Disaster recovery • High-availability • Backup ‣ Need at least 10Gb • Probably dedicated 10Gb per >1PB storage facility: 40Gb min —> 1Tb backbone ‣ 1Gb will not cut it for that data size • ~97 days to transmit at saturation • 10Gb: ~9.7 days 41
  • 55. Storage networking: solutions And now, the real problem: topology and logical design ‣ Need a scaling internal topology ‣ One core switch doing all routing and packet transit == bad ‣ More advanced designs needed ‣ Also: prioritize performance over security • Nearly impossible for most orgs ‣ Most implemented option: Science DMZ 42
  • 56. Science DMZ: not for everything Sensitive data have policies and compliance issues, breaking them can be illegal ‣ Need logical topology flexible enough for security AND performance ‣ Best example: ISP model • Collapsed PE/CE on single router at edge • OSPF routing at edge, fast label switching on dual 100Gb cores • VRF for network segments • MPLS for fast transit and bandwidth guarantees ‣ Side benefit: trusted and untrusted Science DMZ 43
  • 58. Compute == Answers! The pinnacle of data transit, the reason we store it in the first place ‣ High performance computing: clusters, supercomputers, single servers, powerful workstations, etc. ‣ Mostly a datacenter issue ‣ Unless… • Storage not centralized or co-located: data duplicated unless have a killer network • New methods: data doesn’t move, compute moves to data 45
  • 59. Use Case: Get data to cluster Assumes the use of central high-performance storage system ‣ Easier problem within the same datacenter ‣ Large data needs large pipe ‣ Output of storage device needs to be fast • Needs to drive data to/from all compute nodes simultaneously ‣ Large clusters: big problem • Needs parallel filesystems: GPFS, Lustre 46
  • 60. Internal network esp. important Use of local disk in newer clusters ‣ Implementation of storage/analytics systems for Big Data/HDFS ‣ Hadoop, Gluster, local ZFS volumes, virtual disk pools ‣ Now storage can be both internal and external ‣ I/O throughput is critical 47
  • 61. Application characteristics ‣ Mostly single process apps ‣ Some SMP/threaded apps performance bound by IO and/or RAM ‣ Lots of Perl/Python/R ‣ Hundreds of apps, codes & toolkits ‣ 1TB - 2TB RAM “High Memory” nodes becoming essential ‣ MPI is rare • Well written MPI is even rarer ‣ Few MPI apps actually benefit from expensive low-latency interconnects* • *Chemistry, modeling and structure work is the exception 48
  • 62. Life Science very I/O bound Genomics especially ‣ Sync time for data often takes longer than the job itself ‣ Have to load up to 300GB into memory, for 1min process ‣ Do this thousands of times ‣ Largely due to bad programming and improperly configured systems 49
  • 63. Cluster networking Solutions Interconnects between the nodes and the cluster’s connection to the main network critical ‣ Optimal cluster networks: fat tree and torus topologies • All layer 2, internally ‣ Most keep subscription to 1:4, depending on usage ‣ Top-level switches connect at high speed to datacenter network • Newest are multiple 10Gb or 40Gb • Infiniband internal networks: Mellanox ConnectX3 - ethernet and IB capable switch ports 50
  • 64. 51 Sharing the data: Collaboration
  • 65. Collaboration Fundamental to science ‣ Now that data production is reaching petascale, collaboration is getting harder ‣ Projects are getting more complex, more data is being generated, takes more people to work on the science ‣ Journal authorships: common to see 40+ authors now ‣ Clearly a networking problem at its core ‣ Let’s face it, doing this right is expensive! 52
  • 66. Data Movement & Data Sharing The gist of collaborative data sharing in life sciences ‣ Peta-scale data movement needs • Within an organization • To/from collaborators • To/from suppliers • To/from public data repos ‣ Peta-scale data sharing needs • Collaborators and partners may be all over the world 53
  • 67. 54 Most common high-speed network: FedEx
  • 68. We Have Both Ingest Problems Physical & Network ‣ Significant physical ingest occurring in Life Science • Standard media: naked SATA drives shipped via Fedex ‣ Cliche example: • 30 genomes outsourced means 30 drives will soon be sitting in your mail pile ‣ Organizations often use similar methods to freight data between buildings and among geographic sites 55
  • 69. Physical Ingest Just Plain Nasty ‣ Easy to talk about in theory ‣ Seems “easy” to scientists and even IT at first glance ‣ Really really nasty in practice • Incredibly time consuming • Significant operational burden • Easy to do badly / lose data 56
  • 70. Collaboration Solutions Science DMZ: making it easier to collaborate Image source: “The Science DMZ: Introduction & Architecture” -- esnet 57
  • 71. Collaboration Solutions Internet2: making data accessible and affordable ‣ Internet2 is bringing Research and Education together • High-speed, clean networking at its core • Novel and advanced uses of SDN • Subsidized rates: national high-performance networking affordable ‣ AL2S: quickly establish national networks at high-speed ‣ Combined with Science DMZ: platform for collaboration 58
  • 72. Collaboration Solutions Push for Cloud use: Most use Amazon Web Services, Google Cloud not far behind ‣ Many Orgs are pushing for cloud ‣ Unsupported scientists end up using cloud ‣ It’s fast, flexible, affordable, if done right ‣ Great place for large public datasets to live ‣ Has existing high(ish)-performance networking ‣ If done wrong, way more expensive than local compute ‣ Biggest problem: getting data to it! 59
  • 73. Collaboration Solutions Hybrid HPC: Also known as hybrid clouds ‣ Relatively new idea • small local footprint • large, dynamic, scalable, orchestrated public cloud component ‣ DevOps is key to making this work ‣ High-speed network to public cloud required ‣ Software interface layer acting as the mediator between local and public resources ‣ Good for tight budgets, has to be done right to work ‣ Not many working examples yet 60
  • 74. Data Commons Central storage of knowledge with compute ‣ Common structure for data storage and indexing (a cloud?) ‣ Associated compute for analytics ‣ Development platform for application development (PaaS) ‣ Make discovery more possible 61
  • 75. 62 An Example of Progress
  • 76. USDA: Agricultural Research Service Huge Government Agency trying to make agriculture better in every way ‣ Researchers doing amazing research on how crops and animals can be better farmed ‣ Lower environmental impacts ‣ Better economic returns ‣ How to optimize how agriculture functions in the US ‣ But, there’s a problem… 63
  • 77. They’re doing all the things! Every kind of high-throughput research talked about they are doing, and more, and on a massive scale 64
  • 78. Just to list a few… ‣ Genomics (a lot of de novo assembly) ‣ Large scale imaging • LIDAR • Satellite ‣ Simulations ‣ Climatology ‣ Remote sensing ‣ Farm equipment sensors (IoT) 65
  • 79. Their current network 66 • Upgrading to DS3 • Still a lot of T1 • Won’t cut it for science
  • 80. The new initiative Build a Science DMZ: SciNet, on an Internet2 AL2S Backbone 67
  • 81. SciNet to feature compute Hybrid HPC, Storage, Virtualization environment 68
  • 82. 69 What’s the Big Picture?
  • 83. Problems getting solved Utilizing scientific computing to enable discovery 70 Laboratory Knowledge
  • 84. Problems getting solved Utilizing scientific computing to enable discovery 70 Laboratory Knowledge
  • 85. Problems getting solved Utilizing scientific computing to enable discovery 70 Laboratory Knowledge
  • 86. Problems getting solved Utilizing scientific computing to enable discovery 70 Laboratory Knowledge
  • 87. Problems getting solved Utilizing scientific computing to enable discovery 70 Laboratory Knowledge
  • 88. Problems getting solved Utilizing scientific computing to enable discovery 70 Laboratory Knowledge
  • 89. Converged Infrastructure 71 The meta issue ‣ Individual technologies and their general successful use are fine ‣ Unless they all work together as a unified solution, it all means nothing ‣ Creating an end-to-end solution based on the use case (science!): converged infrastructure
  • 90. [Hyper-]convergence It’s what we do 72 Laboratory Knowledge
  • 91. [Hyper-]convergence It’s what we do 72 Laboratory Knowledge Converged Solution
  • 92. [Hyper-]convergence It’s what we do 72 Laboratory Knowledge Converged Solution
  • 93. Convergence People matter too 73 Laboratory Knowledge Converged Solution
  • 94. Universal Truth “The network IS the computer” - John Gage, Sun Microsystems ‣ Convergence is not possible without networking ‣ Also not possible without GOOD networking ‣ Life Sciences is learning lessons learned by physics and astronomy 5-10 years ago ‣ Biggest problem is Org acceptance and investment in personnel and equipment ‣ Next-Gen biomedical research advancing too quickly: must invest now 74
  • 95. 75 end; Thanks! slides at http://www.slideshare.net/arieberman