2015 CDC Workshop on ScienceDMZ

1
ScienceDMZ:  
Industry Trends & Widespread Drivers
slideshare.net/chrisdag/ chris@bioteam.net @chris_dag

2
I’m Chris.
I’m an infrastructure geek.
I work for the BioTeam.
(and CDC …)

3
Chris & Ari: Why 2 of Us Today?
Answer: Ari concentrates on Federal/US.Gov while I deal mostly
with commercial biotech/pharma, EDU and non-proﬁt Orgs.  
They are very diﬀerent.

4
Bottom Line: Science evolves faster than
IT can refresh infrastructure & practices

5
This is why we are here today.

6
Terabyte-scale Instruments & Lab Tools
Cheap, easy to acquire and popping up EVERYWHERE
HiSeq 2500 MiSeq NextSeq 500
And this …

$1,000 human genome @ 30x coverage
* some caveats
7
Illumina HiSeq X 10

This data will be moving constantly …
Illumina HiSeq x 10
‣ Raw Instrument Data
• +13 TB every 3 days
‣ FASTQ Conversion
• +8 TB every 3 days
‣ Align -> Compressed BAM
• +2 TB every three days
‣ Data Distribution
• ?
8

9
Coming Soon To a Researcher Near You:
USB-attached genomic sequencing
Gulp.

10
Tipping Point #1
Eﬀort/cost of generating or acquiring vast piles of data
in 2015 is far less than real world cost of storing and
managing that data through a realistic lifecycle.

11
Tipping Point #2
Scientists still believe storage is cheap & near-infinite.
Data triage no longer sufficient. Scientists rarely asked
to articulate a scientific/business case for storage.

12
Tipping Point #3
Centralized infrastructure models are not suﬃcient and
must be modiﬁed. Data & compute WILL span sites and
locations with or without active IT involvement.  
 
We need to start preparing now.

13
“Center Of Gravity” Problem

14
“Center Of Gravity” Problem
Current methods involving centralized storage and bringing
“users” and “compute” very close “… to the data” are going
to face signiﬁcant problems in 2015 and beyond.

15
“Center Of Gravity” Pain #1
Terabyte class instruments. Everywhere. Gulp.
 
We can not stop this trend - large scale data generation will span labs,
bulding, campus sites & WANs

16
Collaborations & Peta-scale Open Access Data
 
The future of large scale genomics|informatics increasingly involves
multi-party / multi-site collaboration. Also: Petabytes of free data (!!)

17
Object Storage Less Effective @ Single Site
 
Object storage is the future of scientific data at rest. Some major side
benefits (erasure coding, etc.) can only be realized when 3 or more
sites are involved

18
“Center Of Gravity” Summarized
Data spread is unavoidable. Eﬀectively Unstoppable.
 
We have a WAN-scale data movement/access problem.
There are ~2 viable approaches going forward ...

19
Option 1 - “Stay Centralized”
Still totally viable but much faster connectivity to
instruments & collaborators will be essential
Nutshell: Signiﬁcant investment in edge/WAN connectivity required,
likely requiring bandwidth exceeding 10Gbps

20
Option 2 - “Go With The Flow”
Embrace the distributed & “cloudy” future where
compute & storage span multiple zones
Nutshell: Still requires massive bandwidth upgrades to support
metadata-aware or location-aware access & compute

22
Terabyte-scale data movement is
going to be an informatics “grand
challenge” for the next 2-3+ years
And far harder/scarier than previous compute & storage challenges

Long history of engagement & cooperation
Research IT vs. Enterprise IT
‣ Historically our infrastructure requirements
often surpassed what the Enterprise uses to
sustain day to day operation
‣ We’ve spent ~20 years working closely with
Enterprise IT to enable “data intensive
science”
‣ Relatively easy to align informatics IT
infrastructure with established vendor,
product, technology and architecture
standards
24

Barely worth talking about in 2015
25
Computing Power
‣ 32 CPU cores to 60,000
cores - it almost does not
matter
‣ Simple commodity
‣ Interesting & challenging
but not insanely hard.
‣ Easy to acquire & deploy
in 2015 at whatever scale
is needed (budget
permitting)

Still a hassle but no longer intractable
26
Storage
‣ Petabyte-capable storage is no
big deal in 2015
‣ Pricing slowly being
commoditized
‣ Many opportunities to do clever
stuﬀ or waste phenomenal
amounts of money
‣ Biggest risk may be research
driving towards object storage
faster than Enterprise is willing
to commit/support

Hard but not insurmountable
27
Data Management
‣ Managing scientiﬁc data
at rest is still very hard
‣ … but we have seen a
few successful ways
forward
‣ DIY/RDBMS/LIMS
‣ iRODS
‣ Object Storage

$#%(*&@#@*&^@!*^@!(*&# !!!!!!!!!!!!!!!!!!!!!!!!
28
Data Movement
Prepare For Pain …

29
2015 Grand Challenge
Large-scale Data Movement (and why this will be very diﬃcult …)

30
Issue #1
Current LAN/WAN stacks bad for emerging use case
Existing technology we’ve used for decades has been architected to
support many small network ﬂows; not a single big data ﬂow

31
Issue #2
Ratio of LAN:WAN bandwidth is out of whack
We will need faster links to “outside” than most organizations have
anticipated or accounted for in long-term technology planning

32
Issue #3
Core, Campus, Edge and “Top of Rack” bandwidth
Enterprise networking types can be *smug* about 10Gbps at the
network core. Boy are they in for a bad surprise.

33
Issue #4
Bigger blast radius when stuﬀ goes wrong
Compute & storage can be logically or physically contained to
minimize disruption/risk when Research does stupid things.
 
Networks, however, touch EVERYTHING EVERYWHERE. Major risk.

34
What We Need:
- Ludicrous bandwidth @ network core
- Very fast (10-40Gbps) ToR, Edge, Campus links
- 1Gbps - 10Gbps connections to “outside”
- Switches/Routers/Firewalls that can support
small #s of very large data ﬂows

35
Why this will be diﬃcult to achieve

36
Issue #4
Social, trust & cultural issues
We lack the multi-year relationship and track record we’ve built with
facility, compute & storage teams. We are “strangers” to many WAN
and SecurityOps types

37
Issue #5
Our “deep bench” of internal expertise is lacking
Research IT usually has very good “shadow IT” skills but we don’t
have homegrown experts in BGP, Firewalls, Dark Fiber, Routing etc.

38
Issue #5
Cost. Cost. Cost.
Have you seen what Cisco charges for a 100Gbps line card?

39
Issue #5
Cisco. Cisco. Cisco.
The elephant in the room. Cisco rarely 1st choice for greenﬁeld eﬀorts
in this space but Cisco shops often refuse to entertain any
alternatives. Massive existing install base & on-premise expertise
must be balanced, recognized & carefully handled.

40
Issue #5
Firewalls, SecOps & Incumbent Vendors
Legacy security products supporting 10Gbps can cost $150,000+ and
still utterly fail to perform without heroic tuning & deep conﬁg magic.
Alternatives exist but massive institutional inertia to overcome.  
 
Deeply Challenging Issue.

42
‣ Peta-scale becoming the norm, not exception
‣ Compute is a commodity; Storage getting there
‣ Historically it has been pretty easy to integrate
“Research Computing” with “Enterprise”
facilities and operational standards
‣ We can no longer assume the majority of our
infrastructure will reside in a single datacenter

43
‣ We need a massive increase in end-to-end
network connectivity & bandwidth
‣ … and kit that can handle large data ﬂows
‣ Current state of “Enterprise” LAN/WAN
networking is not aligned with emerging needs:
‣ Cost, Capability, Performance, Security …

44
‣ New hardware, reference architectures, best
practices and methods will be required
‣ There is no easy path forward …

45
‣ And this brings us to …
‣ ScienceDMZ

46
‣ Science DMZ
‣ Only viable reference architecture &
collection of operational practices /
philosophy BioTeam has seen to date
‣ In-use today. Real world. No BS.
‣ High level visibility & support within US.GOV,
grant funding agencies and supporters of
data intensive science and R&E networks

47
‣ If you did not know why you were attending this
workshop today; hopefully you do now!
‣ Enjoy the rest of the talks!

48
end; Thanks!
slideshare.net/chrisdag/ chris@bioteam.net @chris_dag

2015 CDC Workshop on ScienceDMZ

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 2015 CDC Workshop on ScienceDMZ

Similar to 2015 CDC Workshop on ScienceDMZ (20)

Recently uploaded

Recently uploaded (16)

2015 CDC Workshop on ScienceDMZ