© 2013 New York Genome Center 1
Bio-IT World
IT Infrastructure for the
New York Genome
Center
Chris Dwan
(cdwan@nygenome.org)
© 2013 New York Genome Center 2
Bio-IT World
“Traveller, there is no path. The path is made by
walking.”
Antonio Machado
© 2013 New York Genome Center 3
Bio-IT World
THANK YOU
NYGC Leadership: Bob Darnell, Nancy Kelley
NYGC, Research Computing: Scott Bunnell, Uday Evani, James
Spencer
NYGC, Bioinformatics: Avinash Abhyankar, Dirk Evers, Filipe
Ribeiro, Nico Robine, Vlada Vacic
NYGC, Sequencing: Kevin Shianna, Soren Germer, Dayna
Oshwald, Frank Wos
NYGC: Patty Bradley, Chris Duignan, Bill Fair, Jen Feueustein,
Paula Leca, Yasmeen Pattie, Matt Pelo, Dave Whelan, Zane Wruble
Sabey Data Centers: Tom Beckwith, Mike Bosco, Jim Glen, Paul
Ryan, John Sabey, Dave Sabey
CDI: Julie Baez, Tom Baker, Vince Collado, Tony Daniello, Ramon
Gil, George Gosseling, Adam Jacobs, Rob Sienrukos
Rockefeller University: Stuart Cohnen, Armand Gazes, Dave Seay
Bioteam: Chris Dagdigian, Stan Gloss
Bio-IT World / CHI: Kevin Davies, Cindy Crowninshield
© 2013 New York Genome Center 4
Bio-IT World
I consider the IT infrastructure, as
well as the DNA sequencing
operation itself, to be a necessary
evil on the road to transforming
health care and improving people’s
lives.
© 2013 New York Genome Center 5
Bio-IT World
THE FUTURE
Data will not all be in one place
Flexible APIs and management protocols for
large data warehouses will be the key
differentiator
Computation will go to data
Relocatable code and scriptable infrastructure
will determine tool adoption and scientific
success
Cloud pricing will drive budgeting
Over time, financial and operational incentives
will settle the cloud argument.
© 2013 New York Genome Center 6
Bio-IT World
NY Genome
© 2013 New York Genome Center 7
Bio-IT World
Business Offices
(November 2011)
© 2013 New York Genome Center 8
Bio-IT World
NYGC
NYGC SERVICE MODEL
NYGC provides
centralized resources
for sequencing,
bioinformatics, data
warehousing, and
high performance
computing
Service offerings are
integrated across
that stack.
Collaborations
reinforce the
institutional members
© 2013 New York Genome Center 9
Bio-IT World
INTEGRATED SERVICE
OFFERING
“Sequencing”
Whole genome, exome, RNA Seq
“Informatics”
Analysis pipeline, defined and tuned collaboratively
with customer
“Data archive”
Two years data storage for BAM or FASTQ plus
variant files and other analytic results
Direct access (VPN to dedicated server) to NYGC
data storage and computing services.
The goal is to enable direct collaboration while
providing a high level of service.
© 2013 New York Genome Center 10
Bio-IT World
Business Offices
(November 2011)
Initial Service
Offering: (Feb 2012)
© 2013 New York Genome Center 11
Bio-IT World
OUTSOURCED DATA
PRODUCTION
IGN
Sequencing.
S3
Outsourced data are transferred to
Amazon over the internet.
Primary analysis of outsourced data
occurs in Amazon.
Primary data and results are
transferred to the larger data archive
Users may access data directly from
Amazon for a short time period (three
months or as needed)
Amazon Web
Services
EBS
Data Center
Data Archive
Interne
t
Amazon
“direct
connect” 1
Gb/sec
dedicated line
© 2013 New York Genome Center 12
Bio-IT World
© 2013 New York Genome Center 13
Bio-IT World
Shipping disks via Fedex
• Assume a 48 hour point to point latency
– Two sets of checksums
– Sneakerbot / human interaction on either side
• Very low potential for automation
• Many opportunities for error
• Sadly, this is the state of the art in many places
• Sometimes this is necessary, but it is not the plan.
© 2013 New York Genome Center 14
Bio-IT World
DATA BANDWIDTH
Bandwidth 1
Gigabyte
1 Genome
(130GB)
Genomes /
day
HiSeq 2000 daily
raw output
(55GB/day)
T1 business
link (12Mb/sec)
11m 22s 24.6h 1 2
T3 business
link (45Mb/sec)
3m 10s 6.9h 3.4 9
700Mb/sec 11s 24m 60 134
Gigabit 8 sec 17m 84 192
If we can make full use of the available bandwidth, gigabit networking is
sufficient for the long term data motion needs of a large genome center*
© 2013 New York Genome Center 15
Bio-IT World
COAST TO COAST 1GB/SEC
© 2013 New York Genome Center 16
Bio-IT World
Pilot Lab at
Rockefeller University
(June 2012)
Primary Colocation
Site
(June 2012)
Business Offices
(November 2011)
Initial Service
Offering: (Feb
© 2013 New York Genome Center 17
Bio-IT World
LAB AT ROCKEFELLER
UNIVERSITY
© 2013 New York Genome Center 18
Bio-IT World
LAB AT ROCKEFELLER
UNIVERSITY
© 2013 New York Genome Center 19
Bio-IT World
IN-HOUSE DATA CAPTURE
Pilot Lab
Data staging:
20TB Isilon X200
Hiseq
Primary data capture should be
physically close to the instruments.
This reduces the number of factors that
can interrupt data production.
Local storage scales with the number
of instruments.
Initial data motion is performed
according to vendor practices. For
Illumina, this means CIFS (SAMBA)
mounts directly from the instrument.
Data is organized according to instrument
run.
Hiseq
1Gb/sec
ethernet
© 2013 New York Genome Center 20
Bio-IT World
DATA ARCHIVE AND
DISASTER RECOVERY
Pilot Lab
Lab NAS
Data are transferred from in-lab
storage to the data archive at our
colocation center.
Data archive grows with
production.
In lab storage can be reclaimed
once data is incorporated in the
archive.
In house data is analyzed using
the in house compute cluster.
Data Center
Data Archive
1Gb/sec
private wide
area
ethernet
© 2013 New York Genome Center 21
Bio-IT World
Pilot Lab at
Rockefeller University
(June 2012)
Primary Colocation
Site
(June 2012)
Business Offices
(November 2011)
Initial Service
Offering: (Feb
32 Avenue of the
Americas
(NYSERNet)
© 2013 New York Genome Center 22
Bio-IT World
NYSERNET AT 32 AOA
© 2013 New York Genome Center 23
Bio-IT World
THE FUTURE IS CONNECTED
© 2013 New York Genome Center 24
Bio-IT World
INITIAL NETWORK CONNECTION
Sabey Data Centers:
• 3 x 106 square feet of data center, nationwide
• Recent purchase of Manhattan High Rise Data Center
• Strong interest in genomic medicine
NYSERNet:
• New York State Educational and Research Network
• Pre-existing fiber network
• Dark fiber available from Rockefeller!
© 2013 New York Genome Center 25
Bio-IT World
Your
headquarters is
a terrible place
to put a data
center*
© 2013 New York Genome Center 26
Bio-IT World
375 PEARL: INTERGATE
MANHATTAN
© 2013 New York Genome Center 27
Bio-IT World
INITIAL DEPLOYMENT: 26TH
FLOOR
© 2013 New York Genome Center 28
Bio-IT World
ROOM TO GROW
© 2013 New York Genome Center 29
Bio-IT World
6TH FLOOR AT 375 PEARL
© 2013 New York Genome Center 30
Bio-IT World
DATA ARCHIVE AND
DISASTER RECOVERY
Pilot Lab
Lab NAS
Data are transferred from in-lab
storage to the data archive at our
colocation center.
Data archive grows with
production.
Data are mirrored (disk to disk) to a
disaster recovery
In lab storage can be reclaimed
once data is incorporated in the
archive.
In house data is analyzed using
the in house compute cluster.
Data Center
Data Archive
1Gb/sec
private wide
area
ethernet
Disaster Recovery
Data Archive
1Gb/sec
private point to
point ethernet
© 2013 New York Genome Center 31
Bio-IT World
Quincy, WA
Power: ~$0.025 / kWh
(Hydroelectric)
Rent: “Lower”
MOST OF BIOINFORMATICS IS
NOT LATENCY BOUND
Manhattan
Power: ~$0.145 / kWh
Rent: “Higher”
time=82.0 ms
© 2013 New York Genome Center 32
Bio-IT World
101 Avenue of the
Americas
Headquarters
32 Avenue of the
Americas
Switching
375 Pearl Street
Primary Colocation
Quincy, WA
Disaster Recovery
1Gb/s
40 Gb/s
Collaborating
Institutions
Amazon
East Coast
PHYSICAL NETWORK
Direct Connect
NYSERNet
Internet /
Phone
© 2013 New York Genome Center 33
Bio-IT World
CROSS COUNTRY NETWORK
© 2013 New York Genome Center 34
Bio-IT World
SHARED VISION
© 2013 New York Genome Center 35
Bio-IT World
UNIQUE PARTY LOCATIONS
© 2013 New York Genome Center 36
Bio-IT World
Pilot Lab at
Rockefeller University
(June 2012)
Primary Colocation
Site
(June 2012)
NY Genome Headquarters
(June 2013)
Business Offices
(November 2011)
Initial Service
Offering: (Feb
© 2013 New York Genome Center 37
Bio-IT World
NYGC HEADQUARTERS
Headquarters: 101 Avenue of the
Americas
7 floors (170,000 sq. ft) in a Manhattan Hi-rise
Sequencing and Laboratory space for 80+
instruments
Administrative and scientific offices for ~500
staff members
Training / conference facilities
Small onsite data center for data capture and
primary analysis
© 2013 New York Genome Center 38
Bio-IT World
HEADQUARTERS FLOOR
STACKING
1st : Entrance, auditorium, café, meeting rooms
2nd : Mechanical
3rd: Expansion Space
4th: Laboratory and Faculty
5th: Sequencing
6th: Informatics
7th: Leadership
© 2013 New York Genome Center 39
Bio-IT World
VIEW FROM THE ROOF
© 2013 New York Genome Center 40
Bio-IT World
6TH FLOOR, MARCH 2013
© 2013 New York Genome Center 41
Bio-IT World
DATA CENTER AT 101 AOA
Network
Virtualization
& Utility
Computing
Storage
• In row cooling
• 12.5kW / rack
• Fiber between floors
• Copper within floors
© 2013 New York Genome Center 42
Bio-IT World
© 2013 New York Genome Center 43
Bio-IT World
“Infrastructure is code”
Joe Landman
© 2013 New York Genome Center 44
Bio-IT World
COMPUTING INFRASTRUCTURE
Commodity Compute Node
1U: 2 x 8 core Intel Xeon (E5)
16GB RAM per core:
256GB/chassis
5TB Local RAID 0 / scratch
space
100Mb/sec / core = 20GbE
High Memory Node
4U: 8 x 8 core Intel Xeon (E7)
1024GB RAM
11TB Local RAID 0 / scratch
space
4 x 10Gb/sec Ethernet
CentOS / Univa / SGE
© 2013 New York Genome Center 45
Bio-IT World
CLUSTER DASHBOARD
© 2013 New York Genome Center 46
Bio-IT World
HOW DO I USE THE CLOUD?
This goes in the
cloud
This stays in house
© 2013 New York Genome Center 47
Bio-IT World
BUY VS RENT
Cloud first:
Only own what cannot be rented
Do not buy capital equipment for burst needs
Not just finance, also regulatory compliance
Use cloud pricing for budgetary numbers.
Virtualize second:
“Bare metal” operating systems are good for
special purpose hardware (huge RAM, GPU)
Also, 10 – 15% performance bump.
© 2013 New York Genome Center 48
Bio-IT World
“Data Flows Downhill”
Jeff
Hammerbacher
© 2013 New York Genome Center 49
Bio-IT World
DATA STORAGE
© 2013 New York Genome Center 50
Bio-IT World
2012: RAW STORAGE
2013: Deploying
petabyte scale storage
is an exercise in
requirements,
purchasing, and project
management.
If you find yourself
designing custom
storage, you are
doing it wrong*
© 2013 New York Genome Center 51
Bio-IT World
DATA STORAGE
INFRASTRUCTURE
Local cache:
20 TB Isilon X200
Data Archive:
160TB of Isilon NL 36TB units
Expanding with Isilon x400 144TB units
Disaster Recovery
160TB of Isilon NL 36TB units
Expanding with Isilon NL 144TB units
© 2013 New York Genome Center 52
Bio-IT World
DATA ACCESS
S3
Amazon Web
Services
EBS
Data Center
Data Archive
Most users log in to
access their data directly
via VPN on a dedicated
virtual machine
Some users may download
data directly from Amazon
Institutional Founding
Members
Selected sites will participate
in a high performance private
research network.
© 2013 New York Genome Center 53
Bio-IT World
Big Data is just small data that you
do not yet understand*
© 2013 New York Genome Center 54
Bio-IT World
GENOME SEQUENCING DATA
FLOW
Sequencing Lab “Reads”
Working Space
“Variants”
Actionable report
130GB / 30x genome
~2TB temp files
~5GB
PDF or user screen
Data archive
Clinical Use
Most research use
© 2013 New York Genome Center 55
Bio-IT World
DATA FLOW
Lab Data
Production
“Sequenc
e”
“Analyze”
“Deliver”
Volumes, one per quote / order,
accessed through a dedicated virtual
machine
Data are demultiplexed and analyzed
using our in house workflow engine
“EastRiver”
Sequence data automatically move to
analysis and archive areas
© 2013 New York Genome Center 56
Bio-IT World
The Future
© 2013 New York Genome Center 57
Bio-IT World
NY GENOME CENTER
Faculty recruitment:
~5 investigators by mid 2014
Service operations:
~20 Hiseq 2500s by late 2013
IT Infrastructure:
~300 core service cluster
~1,000 core research cluster
~1PB unstructured data archive
© 2013 New York Genome Center 58
Bio-IT World
THE INEVITABLE FUTURE
Data are segregated by scientific
discipline, instrument type,
accident of funding, and
organizational boundaries
Highly virtualizable computing
resources are built next to the
larger data repositories
© 2013 New York Genome Center 59
Bio-IT World
THE INEVITABLE FUTURE
Data are segregated by scientific
discipline, instrument type,
accident of funding, and
organizational boundaries
Highly virtualizable computing
resources are built next to the
larger data repositories
APIs implement controlled queries and
access to both data and compute
resources
Interesting datasets are defined in
terms of URIs against these APIs
© 2013 New York Genome Center 60
Bio-IT World
THE INEVITABLE FUTURE
Data are segregated by scientific
discipline, instrument type,
accident of funding, and
organizational boundaries
Highly virtualizable computing
resources are built next to the
larger data repositories
APIs implement controlled queries and
access to both data and compute
resources
Interesting datasets are defined in
terms of URIs against these APIs
Interesting analysis techniques incorporate
scriptable infrastructure to define their hardware
and software requirements Executable
Methods and
Materials
© 2013 New York Genome Center 61
Bio-IT World
THE FUTURE IS CONNECTED
© 2013 New York Genome Center 62
Bio-IT World
END;

2013 bio it world

  • 1.
    © 2013 NewYork Genome Center 1 Bio-IT World IT Infrastructure for the New York Genome Center Chris Dwan (cdwan@nygenome.org)
  • 2.
    © 2013 NewYork Genome Center 2 Bio-IT World “Traveller, there is no path. The path is made by walking.” Antonio Machado
  • 3.
    © 2013 NewYork Genome Center 3 Bio-IT World THANK YOU NYGC Leadership: Bob Darnell, Nancy Kelley NYGC, Research Computing: Scott Bunnell, Uday Evani, James Spencer NYGC, Bioinformatics: Avinash Abhyankar, Dirk Evers, Filipe Ribeiro, Nico Robine, Vlada Vacic NYGC, Sequencing: Kevin Shianna, Soren Germer, Dayna Oshwald, Frank Wos NYGC: Patty Bradley, Chris Duignan, Bill Fair, Jen Feueustein, Paula Leca, Yasmeen Pattie, Matt Pelo, Dave Whelan, Zane Wruble Sabey Data Centers: Tom Beckwith, Mike Bosco, Jim Glen, Paul Ryan, John Sabey, Dave Sabey CDI: Julie Baez, Tom Baker, Vince Collado, Tony Daniello, Ramon Gil, George Gosseling, Adam Jacobs, Rob Sienrukos Rockefeller University: Stuart Cohnen, Armand Gazes, Dave Seay Bioteam: Chris Dagdigian, Stan Gloss Bio-IT World / CHI: Kevin Davies, Cindy Crowninshield
  • 4.
    © 2013 NewYork Genome Center 4 Bio-IT World I consider the IT infrastructure, as well as the DNA sequencing operation itself, to be a necessary evil on the road to transforming health care and improving people’s lives.
  • 5.
    © 2013 NewYork Genome Center 5 Bio-IT World THE FUTURE Data will not all be in one place Flexible APIs and management protocols for large data warehouses will be the key differentiator Computation will go to data Relocatable code and scriptable infrastructure will determine tool adoption and scientific success Cloud pricing will drive budgeting Over time, financial and operational incentives will settle the cloud argument.
  • 6.
    © 2013 NewYork Genome Center 6 Bio-IT World NY Genome
  • 7.
    © 2013 NewYork Genome Center 7 Bio-IT World Business Offices (November 2011)
  • 8.
    © 2013 NewYork Genome Center 8 Bio-IT World NYGC NYGC SERVICE MODEL NYGC provides centralized resources for sequencing, bioinformatics, data warehousing, and high performance computing Service offerings are integrated across that stack. Collaborations reinforce the institutional members
  • 9.
    © 2013 NewYork Genome Center 9 Bio-IT World INTEGRATED SERVICE OFFERING “Sequencing” Whole genome, exome, RNA Seq “Informatics” Analysis pipeline, defined and tuned collaboratively with customer “Data archive” Two years data storage for BAM or FASTQ plus variant files and other analytic results Direct access (VPN to dedicated server) to NYGC data storage and computing services. The goal is to enable direct collaboration while providing a high level of service.
  • 10.
    © 2013 NewYork Genome Center 10 Bio-IT World Business Offices (November 2011) Initial Service Offering: (Feb 2012)
  • 11.
    © 2013 NewYork Genome Center 11 Bio-IT World OUTSOURCED DATA PRODUCTION IGN Sequencing. S3 Outsourced data are transferred to Amazon over the internet. Primary analysis of outsourced data occurs in Amazon. Primary data and results are transferred to the larger data archive Users may access data directly from Amazon for a short time period (three months or as needed) Amazon Web Services EBS Data Center Data Archive Interne t Amazon “direct connect” 1 Gb/sec dedicated line
  • 12.
    © 2013 NewYork Genome Center 12 Bio-IT World
  • 13.
    © 2013 NewYork Genome Center 13 Bio-IT World Shipping disks via Fedex • Assume a 48 hour point to point latency – Two sets of checksums – Sneakerbot / human interaction on either side • Very low potential for automation • Many opportunities for error • Sadly, this is the state of the art in many places • Sometimes this is necessary, but it is not the plan.
  • 14.
    © 2013 NewYork Genome Center 14 Bio-IT World DATA BANDWIDTH Bandwidth 1 Gigabyte 1 Genome (130GB) Genomes / day HiSeq 2000 daily raw output (55GB/day) T1 business link (12Mb/sec) 11m 22s 24.6h 1 2 T3 business link (45Mb/sec) 3m 10s 6.9h 3.4 9 700Mb/sec 11s 24m 60 134 Gigabit 8 sec 17m 84 192 If we can make full use of the available bandwidth, gigabit networking is sufficient for the long term data motion needs of a large genome center*
  • 15.
    © 2013 NewYork Genome Center 15 Bio-IT World COAST TO COAST 1GB/SEC
  • 16.
    © 2013 NewYork Genome Center 16 Bio-IT World Pilot Lab at Rockefeller University (June 2012) Primary Colocation Site (June 2012) Business Offices (November 2011) Initial Service Offering: (Feb
  • 17.
    © 2013 NewYork Genome Center 17 Bio-IT World LAB AT ROCKEFELLER UNIVERSITY
  • 18.
    © 2013 NewYork Genome Center 18 Bio-IT World LAB AT ROCKEFELLER UNIVERSITY
  • 19.
    © 2013 NewYork Genome Center 19 Bio-IT World IN-HOUSE DATA CAPTURE Pilot Lab Data staging: 20TB Isilon X200 Hiseq Primary data capture should be physically close to the instruments. This reduces the number of factors that can interrupt data production. Local storage scales with the number of instruments. Initial data motion is performed according to vendor practices. For Illumina, this means CIFS (SAMBA) mounts directly from the instrument. Data is organized according to instrument run. Hiseq 1Gb/sec ethernet
  • 20.
    © 2013 NewYork Genome Center 20 Bio-IT World DATA ARCHIVE AND DISASTER RECOVERY Pilot Lab Lab NAS Data are transferred from in-lab storage to the data archive at our colocation center. Data archive grows with production. In lab storage can be reclaimed once data is incorporated in the archive. In house data is analyzed using the in house compute cluster. Data Center Data Archive 1Gb/sec private wide area ethernet
  • 21.
    © 2013 NewYork Genome Center 21 Bio-IT World Pilot Lab at Rockefeller University (June 2012) Primary Colocation Site (June 2012) Business Offices (November 2011) Initial Service Offering: (Feb 32 Avenue of the Americas (NYSERNet)
  • 22.
    © 2013 NewYork Genome Center 22 Bio-IT World NYSERNET AT 32 AOA
  • 23.
    © 2013 NewYork Genome Center 23 Bio-IT World THE FUTURE IS CONNECTED
  • 24.
    © 2013 NewYork Genome Center 24 Bio-IT World INITIAL NETWORK CONNECTION Sabey Data Centers: • 3 x 106 square feet of data center, nationwide • Recent purchase of Manhattan High Rise Data Center • Strong interest in genomic medicine NYSERNet: • New York State Educational and Research Network • Pre-existing fiber network • Dark fiber available from Rockefeller!
  • 25.
    © 2013 NewYork Genome Center 25 Bio-IT World Your headquarters is a terrible place to put a data center*
  • 26.
    © 2013 NewYork Genome Center 26 Bio-IT World 375 PEARL: INTERGATE MANHATTAN
  • 27.
    © 2013 NewYork Genome Center 27 Bio-IT World INITIAL DEPLOYMENT: 26TH FLOOR
  • 28.
    © 2013 NewYork Genome Center 28 Bio-IT World ROOM TO GROW
  • 29.
    © 2013 NewYork Genome Center 29 Bio-IT World 6TH FLOOR AT 375 PEARL
  • 30.
    © 2013 NewYork Genome Center 30 Bio-IT World DATA ARCHIVE AND DISASTER RECOVERY Pilot Lab Lab NAS Data are transferred from in-lab storage to the data archive at our colocation center. Data archive grows with production. Data are mirrored (disk to disk) to a disaster recovery In lab storage can be reclaimed once data is incorporated in the archive. In house data is analyzed using the in house compute cluster. Data Center Data Archive 1Gb/sec private wide area ethernet Disaster Recovery Data Archive 1Gb/sec private point to point ethernet
  • 31.
    © 2013 NewYork Genome Center 31 Bio-IT World Quincy, WA Power: ~$0.025 / kWh (Hydroelectric) Rent: “Lower” MOST OF BIOINFORMATICS IS NOT LATENCY BOUND Manhattan Power: ~$0.145 / kWh Rent: “Higher” time=82.0 ms
  • 32.
    © 2013 NewYork Genome Center 32 Bio-IT World 101 Avenue of the Americas Headquarters 32 Avenue of the Americas Switching 375 Pearl Street Primary Colocation Quincy, WA Disaster Recovery 1Gb/s 40 Gb/s Collaborating Institutions Amazon East Coast PHYSICAL NETWORK Direct Connect NYSERNet Internet / Phone
  • 33.
    © 2013 NewYork Genome Center 33 Bio-IT World CROSS COUNTRY NETWORK
  • 34.
    © 2013 NewYork Genome Center 34 Bio-IT World SHARED VISION
  • 35.
    © 2013 NewYork Genome Center 35 Bio-IT World UNIQUE PARTY LOCATIONS
  • 36.
    © 2013 NewYork Genome Center 36 Bio-IT World Pilot Lab at Rockefeller University (June 2012) Primary Colocation Site (June 2012) NY Genome Headquarters (June 2013) Business Offices (November 2011) Initial Service Offering: (Feb
  • 37.
    © 2013 NewYork Genome Center 37 Bio-IT World NYGC HEADQUARTERS Headquarters: 101 Avenue of the Americas 7 floors (170,000 sq. ft) in a Manhattan Hi-rise Sequencing and Laboratory space for 80+ instruments Administrative and scientific offices for ~500 staff members Training / conference facilities Small onsite data center for data capture and primary analysis
  • 38.
    © 2013 NewYork Genome Center 38 Bio-IT World HEADQUARTERS FLOOR STACKING 1st : Entrance, auditorium, café, meeting rooms 2nd : Mechanical 3rd: Expansion Space 4th: Laboratory and Faculty 5th: Sequencing 6th: Informatics 7th: Leadership
  • 39.
    © 2013 NewYork Genome Center 39 Bio-IT World VIEW FROM THE ROOF
  • 40.
    © 2013 NewYork Genome Center 40 Bio-IT World 6TH FLOOR, MARCH 2013
  • 41.
    © 2013 NewYork Genome Center 41 Bio-IT World DATA CENTER AT 101 AOA Network Virtualization & Utility Computing Storage • In row cooling • 12.5kW / rack • Fiber between floors • Copper within floors
  • 42.
    © 2013 NewYork Genome Center 42 Bio-IT World
  • 43.
    © 2013 NewYork Genome Center 43 Bio-IT World “Infrastructure is code” Joe Landman
  • 44.
    © 2013 NewYork Genome Center 44 Bio-IT World COMPUTING INFRASTRUCTURE Commodity Compute Node 1U: 2 x 8 core Intel Xeon (E5) 16GB RAM per core: 256GB/chassis 5TB Local RAID 0 / scratch space 100Mb/sec / core = 20GbE High Memory Node 4U: 8 x 8 core Intel Xeon (E7) 1024GB RAM 11TB Local RAID 0 / scratch space 4 x 10Gb/sec Ethernet CentOS / Univa / SGE
  • 45.
    © 2013 NewYork Genome Center 45 Bio-IT World CLUSTER DASHBOARD
  • 46.
    © 2013 NewYork Genome Center 46 Bio-IT World HOW DO I USE THE CLOUD? This goes in the cloud This stays in house
  • 47.
    © 2013 NewYork Genome Center 47 Bio-IT World BUY VS RENT Cloud first: Only own what cannot be rented Do not buy capital equipment for burst needs Not just finance, also regulatory compliance Use cloud pricing for budgetary numbers. Virtualize second: “Bare metal” operating systems are good for special purpose hardware (huge RAM, GPU) Also, 10 – 15% performance bump.
  • 48.
    © 2013 NewYork Genome Center 48 Bio-IT World “Data Flows Downhill” Jeff Hammerbacher
  • 49.
    © 2013 NewYork Genome Center 49 Bio-IT World DATA STORAGE
  • 50.
    © 2013 NewYork Genome Center 50 Bio-IT World 2012: RAW STORAGE 2013: Deploying petabyte scale storage is an exercise in requirements, purchasing, and project management. If you find yourself designing custom storage, you are doing it wrong*
  • 51.
    © 2013 NewYork Genome Center 51 Bio-IT World DATA STORAGE INFRASTRUCTURE Local cache: 20 TB Isilon X200 Data Archive: 160TB of Isilon NL 36TB units Expanding with Isilon x400 144TB units Disaster Recovery 160TB of Isilon NL 36TB units Expanding with Isilon NL 144TB units
  • 52.
    © 2013 NewYork Genome Center 52 Bio-IT World DATA ACCESS S3 Amazon Web Services EBS Data Center Data Archive Most users log in to access their data directly via VPN on a dedicated virtual machine Some users may download data directly from Amazon Institutional Founding Members Selected sites will participate in a high performance private research network.
  • 53.
    © 2013 NewYork Genome Center 53 Bio-IT World Big Data is just small data that you do not yet understand*
  • 54.
    © 2013 NewYork Genome Center 54 Bio-IT World GENOME SEQUENCING DATA FLOW Sequencing Lab “Reads” Working Space “Variants” Actionable report 130GB / 30x genome ~2TB temp files ~5GB PDF or user screen Data archive Clinical Use Most research use
  • 55.
    © 2013 NewYork Genome Center 55 Bio-IT World DATA FLOW Lab Data Production “Sequenc e” “Analyze” “Deliver” Volumes, one per quote / order, accessed through a dedicated virtual machine Data are demultiplexed and analyzed using our in house workflow engine “EastRiver” Sequence data automatically move to analysis and archive areas
  • 56.
    © 2013 NewYork Genome Center 56 Bio-IT World The Future
  • 57.
    © 2013 NewYork Genome Center 57 Bio-IT World NY GENOME CENTER Faculty recruitment: ~5 investigators by mid 2014 Service operations: ~20 Hiseq 2500s by late 2013 IT Infrastructure: ~300 core service cluster ~1,000 core research cluster ~1PB unstructured data archive
  • 58.
    © 2013 NewYork Genome Center 58 Bio-IT World THE INEVITABLE FUTURE Data are segregated by scientific discipline, instrument type, accident of funding, and organizational boundaries Highly virtualizable computing resources are built next to the larger data repositories
  • 59.
    © 2013 NewYork Genome Center 59 Bio-IT World THE INEVITABLE FUTURE Data are segregated by scientific discipline, instrument type, accident of funding, and organizational boundaries Highly virtualizable computing resources are built next to the larger data repositories APIs implement controlled queries and access to both data and compute resources Interesting datasets are defined in terms of URIs against these APIs
  • 60.
    © 2013 NewYork Genome Center 60 Bio-IT World THE INEVITABLE FUTURE Data are segregated by scientific discipline, instrument type, accident of funding, and organizational boundaries Highly virtualizable computing resources are built next to the larger data repositories APIs implement controlled queries and access to both data and compute resources Interesting datasets are defined in terms of URIs against these APIs Interesting analysis techniques incorporate scriptable infrastructure to define their hardware and software requirements Executable Methods and Materials
  • 61.
    © 2013 NewYork Genome Center 61 Bio-IT World THE FUTURE IS CONNECTED
  • 62.
    © 2013 NewYork Genome Center 62 Bio-IT World END;

Editor's Notes

  • #3 Inspiration of Kepler & Tycho Brahe Remove non-scientific obstacles to scientific understanding