SlideShare a Scribd company logo
1 of 59
Download to read offline
Calcul Québec - Université Laval
Building a Storage System for
Genomics
1
HPCS 2014
Halifax, NS
Florent.Parent@calculquebec.ca
Frederick.Lefebvre@calculquebec.ca
Calcul Québec - Université Laval
Agenda
Genomics storage project background
Reviewing and optimizing the proposal
Network + politics issues
Writing the RFP
Lessons learned
!
!
2
Calcul Québec - Université Laval
Genomics storage project: background
FCI Leading Edge Fund (2012 competition)
“Human and Microbial Integrative Genomics”
Project Lead: Dr. Jacques Simard, CRCHUQ, (Université Laval)
16 researchers from Université Laval
Bioinformatics and Computational Infrastructure
Arnaud Droit
Large storage component
3
Calcul Québec - Université Laval
CRCHUQ
Genomics, proteomics and metabolomics
Data sources: HiSeq 2500, 2000 and MiSeq
Applications: RAY, genomics pipeline …
Some researchers already active HPC users
Jacques Corbeil, Sébastien Boisvert (RAY), Arnaud Droit,
Yohan Bossé
4
Calcul Québec - Université Laval
Site specifications: Physical
Number of racks in silo: 56 max
Floor loading capacity: 940 lb/pi2
5
Calcul Québec - Université Laval
Site specifications: Power
6
1 MW generator
Campus Data
Center
CII: Centre des Infrastructure Informationelles
Silo:
1.1 MW available (~33% used)
72 kW UPS (+ generator)
25 kV hydro line
2 MVA transformer
Calcul Québec - Université Laval
Site specifications: Cooling
Rack cooling: 100% air
No CRAC units! Using campus wide
chilled-water loop for cooling
Cooling capacity: 1.5 MW
Residual heat transferred to campus hot-
water loop
Partial free air cooling (up to 300 kW)
7
Cooling
coils
Air blowers
Free air
cooling
Calcul Québec - Université Laval
Site specifications: Networking
8
Fibre optic network
to Québec hospital
research networks
Calcul Québec - Université Laval
Timeline
9
2014
Feb 26
All tests pass
system accepted
Jan 6
Physical
installation
2013
Oct 3
RFP
published
Jan 15
MSSS
meeting
Jan 22
Acceptance
testing starts
Nov 20
RFP winner
announced
number of meetings with vendors/manufacturers
July 9
MSSS
derogation
2012
Feb
First
meeting
number of meetings with vendors/manufacturers
Nov
identify
FW issue
April
FCI
conditions
met
March
finalize
budget
Calcul Québec - Université Laval
Researcher LEF proposal
Initial contact: Researcher->VPR->CC staff
Initial meetings: Review proposal, discussions
Researchers planned to install storage at CRCHUQ facilities
Based on quote from local supplier
Review and optimize
Discuss possible optimization in proposal
Scheduled meetings with HPC storage suppliers
10
Calcul Québec - Université Laval
CFI LEF
Discussed option to host storage at CC site
Install some storage and compute at CRCHUQ
Bulk storage at UL/CQ/CC site
High speed connectivity already in place (10G UL-CRCHUQ)
Sounds simple, right? …
11
Calcul Québec - Université Laval
Concerns raised
Ease of access to CC hosted storage
Security
12
Calcul Québec - Université Laval
Refining the proposal
Evaluate benefits to hosts at CQ/CC
Power, cooling infrastructure already in place
O&M handled by CQ staff.
Collaborate with CRCHUQ sysadmin
Initial budget planned room renovations and
extra A/C: $$ saved for more infrastructure
13
Calcul Québec - Université Laval
MOU
MOU sent to CFI (Jan 2013)
CRCHUQ and CQ/CC staff work on RFP and acquisition process
CQ/CC staff manage storage
Storage is for exclusive usage of CRCHUQ
Local storage for CRCHUQ, Parallel FS at CQ/UL
Archival (tapes) will use existing system available at remote
CQ/CC site
14
Calcul Québec - Université Laval
Genomics Storage Components
15
Calcul Québec - Université Laval
Interconnect
Université Laval owns fibre optics MAN
Interconnects all QC hospital research networks
16
Calcul Québec - Université Laval
Test node
Network testing
17
Test node (VM)
4 Gbps
141 Mbps
“IS-QC” Firewall
Limits flows to 1.2 Gbps
Calcul Québec - Université Laval
“Pile of firewalls”
CRCHUQ already manages it’s security
firewall at it’s periphery
IS-QC under MSSS authority
acts as“safety valve”
Work with CRCHUQ to request derogation to
remove IS-QC
18
Calcul Québec - Université Laval
MSSS Derogation
Document and prepare meeting in Dec 2012
Jan 2013: meeting with MSSS security staff
Jan 2013: regional security coordinator refusal
Feb 2013: CRCHUQ director writes to deputy minister
of MSSS IT
July 2013: Deputy minister (MSSS IT) visits UL/CQ
Derogation done.
19
Calcul Québec - Université Laval
Network Measurements
perfSonar for periodic network measurements
20
Calcul Québec - Université Laval
Archival
Use existing tape archives at CQ
Plenty of network bandwidth (for now…)
21
V1.0Calcul Québec - Université Laval
RFP
22
Calcul Québec - Université Laval
Building the RFP
An iterative process
Based on multiple meetings with researchers
+ Expertise and market knowledge of local HPC team
23
Vendors
RFP
Researcher HPC team
Calcul Québec - Université Laval
Premise
2 storage systems with different requirements in
very different environments
Parallel storage
Large and high-speed in modern datacenter with plenty of
power and cooling
On-site storage
Smaller capacity with slower interconnect in air-conditioned
server room
24
Calcul Québec - Université Laval
Challenges
Budget is limited. We want to get the most out
of it !
But the most of what ?
Parallel storage capacity/
Parallel storage write speed
On-site storage capacity
etc…
25
Calcul Québec - Université Laval
Challenges (cont.)
Most of the budget to be allocated to the
parallel storage
To enable computing and mid-term storage
On-site storage must be large enough. No more.
A quality based RFP allows for such distinctions
26
Calcul Québec - Université Laval
How large is large enough
The sequencing platform could generate 10TB of
data per week
Operating at full capacity
40TB would provide 1 month of buffering
27
On-site StorageSequencers Parallel Storage
Buffering
Automated Data
Synchronisation
Calcul Québec - Université Laval
quality based RFP
We chose to publish a quality based RFP
In contrast to a lowest-bidder process
!
Evaluated on cost + « quality criteria »
Vendors are asked to spend at least 95% of budget
28
Calcul Québec - Université Laval
Challenges (solution)
Define 2 indépendant sets of requirements
!
Use the « quality criteria » to let vendors know
what they should prioritize
More weight will be given to the parallel storage components
!
29
Calcul Québec - Université Laval
Hardware only or integrated solution ?
A) Hardware only:Write an RFP to buy XTB of raw disk space
+Y servers and the accompanying interconnect. Integrate
everything in-house to deploy a storage system.
B) Integrated solution: Ask for a complete system to meet a
size and performance requirements.
First things first
30
Calcul Québec - Université Laval
Integrated Solution
Cumbersome question … Lustre, GPFS or
anything
Should we ask for a specific parallel FS ?
Some parallel FS are tied to a specific vendor or a very small
set of vendors
Went with Lustre because it is a multi-vendor
ecosystem
… and our team is already familiar with it
31
Calcul Québec - Université Laval
Fostering competition
The RFP can be so specific as to open the door
only to a single product
!
Or it can let bidders come up with their own
solution to our problem
32
Specific
product
Surprise…
Calcul Québec - Université Laval
Fostering competition (cont.)
Vendors know when a RFP is targeted to them
They will price accordingly
Inversely, vendors will not bid if they do not feel
they have a fair chance
Less bid will often equal « higher price »
A less constrained RFP will generally attract
more proposals
!
33
Calcul Québec - Université Laval
Fostering competition (cont.)
Example of being too specific:
« Storage units with 60 drives in raid5, 8+2 configuration »
!
Such a statement could apply to a single vendor,
while limiting the available technologies
34
Calcul Québec - Université Laval
Spec'ing a storage system
Power & Cooling capacity
Physical space and room topology
Compatibility with existing infrastructure
Software
Physical
35
Calcul Québec - Université Laval
Physical infrastructure
36
Document floor/rack plan
Maximum weight per square foot ?
How much space do we actually have ?
Where does the system need to connect?
Both power and interconnect
Cable length
Calcul Québec - Université Laval
Power & Cooling
37
How much electrical capacity is available
Total?
Per rack ?
UPS ?
Can our room cooling system handle that much
new power ?
Calcul Québec - Université Laval
Requirements for parallel storage
1 PB usable (or more) Lustre FS
Compatible with Lustre clients 1.8.9 and 2.4.x
20 GB/s aggregate read/write speed (or more)
Drives and Lustre servers redundancy
« how » is purposely left unspecified
Infiniband interconnect
2:1 blocking factor with computing resources
38
Calcul Québec - Université Laval
Requirements for parallel storage
Vendor to provide all interconnect
Leaf IB switch, ethernet switch for management and cabling
Site provides uplink to core switches
20KW maximum electrical consumption
Vendor to supply PDUs (switched)
Site to connect PDUs to existing electrical infrastructure
39
Calcul Québec - Université Laval
Requirements for on-site storage
Export network filesystem
Compatible with sequencers,Windows 7, Linux and Mac
10G Ethernet interconnect
50TB usable capacity (or more)
with option to grow up to 300TB
Drives and servers redundancy
40
Calcul Québec - Université Laval
Requirements for on-site storage
Site to provide all cabling and interconnect for
on-site storage
PDUs and rack space provided by the site
41
Calcul Québec - Université Laval
Measuring the quality of a proposal
Final evaluation is based on « adjusted price »
calculated from the bid price and the rating of
the « quality criteria » given by the evaluation
committee
!
« adjusted price » can vary from the real price by
up to 30%
42
Calcul Québec - Université Laval
Quality criteria
43
Parallel Storage 45 %
On-site Storage 20 %
Interconnect & Networking 10 %
Vendor’s Experience & Reputation 25 %
Calcul Québec - Université Laval
Quality criteria (cont.)
!
In the 1st three categories, meeting the base
requirements gives a passing score of 70%.
Any specs or meaningful features above base
requirements will improve the mark.
!
44
Calcul Québec - Université Laval
Quality criteria (cont.)
In the « vendor » category, score is based on the
bidder’s experience in deploying similar
systems with a requirement for at least 1 such
system in the past 18 months.
Support structure and resume of the lead
architect for the project are also a factor.
!
45
V1.0Calcul Québec - Université Laval
Benchmarks &
stability tests
46
Calcul Québec - Université Laval
Acceptance tests
We define stability tests to validate the system
can operate in a real production environment.
!
We run synthetic benchmarks to make sure the
system hits the performance targets set by the
vendor as requested by the quality criteria.
47
Calcul Québec - Université Laval
Stability tests
To validate normal operation
Homogenous firmware and software versions everywhere
No errors or warning
Verify the systems reboots cleanly
Lustre mounts properly
Simulate drive failures
Verify rebuild process
48
Calcul Québec - Université Laval
Benchmarks
We set some base rules
No custom tools. Re-use existing software
Let the vendor tune the tests for his system
But test must be large enough to avoid cache effect
What to benchmark
Read/write speed of single target : IOZone
Maximum aggregate read/write speed : IOR
Maximum I/O operations per second (IOPS) : mdtest
49
V1.0Calcul Québec - Université Laval
RFP results
50
Calcul Québec - Université Laval
Bids
We got 6 valid proposals
Parallel storage capacity varied from more than 60% across bids
Aggregate speed for parallel storage varied by almost 50%
On-site capacity varied by almost 100%
On-site storage went from a NAS on ZFS to full fledges Lustre or
GPFS systems
51
Calcul Québec - Université Laval
System selected
Parallel storage: Xyratex CS6000
1.4 PB usable Lustre FS
12 OSS and 4 targets per OSS
4TB NL SAS drives +SSD for journals
30 GB/s maximum aggregated R/W speed
On-site storage: Xyratex CS1500
120TB usable Lustre FS (scales to 7 PB)
4 CIFS/NFS exporters
52
V1.0Calcul Québec - Université Laval
Deployment
53
Calcul Québec - Université Laval
Operation
Both system in production since early february
Parallel storage dedicated to research group
mounted on compute ressources
Data transfers are enabled by Globus endpoints
on dedicated DTNs at both sites.
Todo: Review network topology for transfers
Perfsonar nodes to be deployed at research center
54
Calcul Québec - Université Laval
Operation (cont.)
Researchers need a CC account to access Parallel
Storage
Access control and allocations are a challenge
Shared spreadsheet filled by research center to allocate space
on parallel FS for their users (Cumbersome!)
Integration with the CCDB would leverage existing system to
manage storage allocations
!
55
V1.0Calcul Québec - Université Laval
Lessons learned
56
Calcul Québec - Université Laval
Lessons learned
Time consuming (2 year projects)
Mostly thrust and relationship building
Time needed to write an RFP should not be underestimated
Benefit for the research group
Access to a team of specialist to lead their project
Major cost saving on the infrastructure. No investment to
upgrade an existing server room (UPS, Power, Cooling, etc)
57
Calcul Québec - Université Laval
Cost to integrate CS6000
Installation: 900$ (rack enclosure)
Power: 1457$ (new outlets)
Cooling: 0$
Infiniband: Used existing cables
6 CXP - QSFP cables (18 QDR links)
58
Calcul Québec - Université Laval
Improving the process
Sharing RFPs between Compute Canada site
could ease the process for new projects
Common benchmarks across Compute Canada
would help when designing acceptance tests
Applies to both storage and computing
59

More Related Content

What's hot

Cisco UCS: meeting the growing need for bandwidth
Cisco UCS: meeting the growing need for bandwidthCisco UCS: meeting the growing need for bandwidth
Cisco UCS: meeting the growing need for bandwidthPrincipled Technologies
 
IBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERIBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERinside-BigData.com
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Community
 
Design installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuAlan Sill
 
UberCloud HPC Experiment Introduction for Beginners
UberCloud HPC Experiment Introduction for BeginnersUberCloud HPC Experiment Introduction for Beginners
UberCloud HPC Experiment Introduction for Beginnershpcexperiment
 
Ac922 watson 180208 v1
Ac922 watson 180208 v1Ac922 watson 180208 v1
Ac922 watson 180208 v1IBM Sverige
 
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
Mellanox Announces HDR 200 Gb/s InfiniBand SolutionsMellanox Announces HDR 200 Gb/s InfiniBand Solutions
Mellanox Announces HDR 200 Gb/s InfiniBand Solutionsinside-BigData.com
 
Ncar globally accessible user environment
Ncar globally accessible user environmentNcar globally accessible user environment
Ncar globally accessible user environmentinside-BigData.com
 
HPC Storage and IO Trends and Workflows
HPC Storage and IO Trends and WorkflowsHPC Storage and IO Trends and Workflows
HPC Storage and IO Trends and Workflowsinside-BigData.com
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...Ganesan Narayanasamy
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformGanesan Narayanasamy
 
Red Hat® Ceph Storage and Network Solutions for Software Defined Infrastructure
Red Hat® Ceph Storage and Network Solutions for Software Defined InfrastructureRed Hat® Ceph Storage and Network Solutions for Software Defined Infrastructure
Red Hat® Ceph Storage and Network Solutions for Software Defined InfrastructureIntel® Software
 
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsXilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsGanesan Narayanasamy
 
Covid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power SystemsCovid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power SystemsGanesan Narayanasamy
 
Dynamic Provisioning of Data Intensive Computing Middleware Frameworks
Dynamic Provisioning of Data Intensive Computing Middleware FrameworksDynamic Provisioning of Data Intensive Computing Middleware Frameworks
Dynamic Provisioning of Data Intensive Computing Middleware FrameworksLinh Ngo
 

What's hot (20)

Cisco UCS: meeting the growing need for bandwidth
Cisco UCS: meeting the growing need for bandwidthCisco UCS: meeting the growing need for bandwidth
Cisco UCS: meeting the growing need for bandwidth
 
IBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERIBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWER
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
 
Design installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttu
 
UberCloud HPC Experiment Introduction for Beginners
UberCloud HPC Experiment Introduction for BeginnersUberCloud HPC Experiment Introduction for Beginners
UberCloud HPC Experiment Introduction for Beginners
 
Ac922 watson 180208 v1
Ac922 watson 180208 v1Ac922 watson 180208 v1
Ac922 watson 180208 v1
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
 
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
Mellanox Announces HDR 200 Gb/s InfiniBand SolutionsMellanox Announces HDR 200 Gb/s InfiniBand Solutions
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
 
OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
 
Ncar globally accessible user environment
Ncar globally accessible user environmentNcar globally accessible user environment
Ncar globally accessible user environment
 
HPC Storage and IO Trends and Workflows
HPC Storage and IO Trends and WorkflowsHPC Storage and IO Trends and Workflows
HPC Storage and IO Trends and Workflows
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platform
 
Red Hat® Ceph Storage and Network Solutions for Software Defined Infrastructure
Red Hat® Ceph Storage and Network Solutions for Software Defined InfrastructureRed Hat® Ceph Storage and Network Solutions for Software Defined Infrastructure
Red Hat® Ceph Storage and Network Solutions for Software Defined Infrastructure
 
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsXilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systems
 
Status of HDF-EOS, Related Software and Tools
Status of HDF-EOS, Related Software and ToolsStatus of HDF-EOS, Related Software and Tools
Status of HDF-EOS, Related Software and Tools
 
Covid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power SystemsCovid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power Systems
 
IBM HPC Transformation with AI
IBM HPC Transformation with AI IBM HPC Transformation with AI
IBM HPC Transformation with AI
 
Dynamic Provisioning of Data Intensive Computing Middleware Frameworks
Dynamic Provisioning of Data Intensive Computing Middleware FrameworksDynamic Provisioning of Data Intensive Computing Middleware Frameworks
Dynamic Provisioning of Data Intensive Computing Middleware Frameworks
 

Viewers also liked

11урок теорія меню
11урок теорія меню11урок теорія меню
11урок теорія менюAndy Levkovich
 
1 вступ робоче місце
1 вступ робоче місце1 вступ робоче місце
1 вступ робоче місцеAndy Levkovich
 
Beaver Creek Park Mgmt Plan
Beaver Creek Park Mgmt PlanBeaver Creek Park Mgmt Plan
Beaver Creek Park Mgmt PlanMonty Horton
 
Ахітектура та скульптура
Ахітектура та скульптураАхітектура та скульптура
Ахітектура та скульптураAndy Levkovich
 
Bimba kids 21-10-2012 (2)
Bimba kids   21-10-2012 (2)Bimba kids   21-10-2012 (2)
Bimba kids 21-10-2012 (2)Debora Teixeira
 
RESUME administrator 010616
RESUME administrator  010616RESUME administrator  010616
RESUME administrator 010616Paul Firetto
 
Трудове навчання_6 клас_ 4 параграф
Трудове навчання_6 клас_ 4 параграфТрудове навчання_6 клас_ 4 параграф
Трудове навчання_6 клас_ 4 параграфAndy Levkovich
 
11 урок письмове завдання меню
11 урок письмове завдання меню11 урок письмове завдання меню
11 урок письмове завдання менюAndy Levkovich
 
Bimba kids 06-01-2013 (1)
Bimba kids   06-01-2013 (1)Bimba kids   06-01-2013 (1)
Bimba kids 06-01-2013 (1)Debora Teixeira
 
SITES Certifies 12 New Projects _ SITES
SITES Certifies 12 New Projects _ SITESSITES Certifies 12 New Projects _ SITES
SITES Certifies 12 New Projects _ SITESMary Abe
 

Viewers also liked (20)

Tabla reymol
Tabla reymolTabla reymol
Tabla reymol
 
Presentación1 cb
Presentación1 cbPresentación1 cb
Presentación1 cb
 
Powerpoint
PowerpointPowerpoint
Powerpoint
 
I os x android
I os x androidI os x android
I os x android
 
11урок теорія меню
11урок теорія меню11урок теорія меню
11урок теорія меню
 
E mail marketing
E mail marketingE mail marketing
E mail marketing
 
1 вступ робоче місце
1 вступ робоче місце1 вступ робоче місце
1 вступ робоче місце
 
mark cv 2016
mark cv 2016mark cv 2016
mark cv 2016
 
Bimba kids 13-01-2013
Bimba kids   13-01-2013Bimba kids   13-01-2013
Bimba kids 13-01-2013
 
Beaver Creek Park Mgmt Plan
Beaver Creek Park Mgmt PlanBeaver Creek Park Mgmt Plan
Beaver Creek Park Mgmt Plan
 
Ахітектура та скульптура
Ахітектура та скульптураАхітектура та скульптура
Ахітектура та скульптура
 
Powerpoint
PowerpointPowerpoint
Powerpoint
 
Bimba kids 21-10-2012 (2)
Bimba kids   21-10-2012 (2)Bimba kids   21-10-2012 (2)
Bimba kids 21-10-2012 (2)
 
RESUME administrator 010616
RESUME administrator  010616RESUME administrator  010616
RESUME administrator 010616
 
Трудове навчання_6 клас_ 4 параграф
Трудове навчання_6 клас_ 4 параграфТрудове навчання_6 клас_ 4 параграф
Трудове навчання_6 клас_ 4 параграф
 
11 урок письмове завдання меню
11 урок письмове завдання меню11 урок письмове завдання меню
11 урок письмове завдання меню
 
Bimba kids 06-01-2013 (1)
Bimba kids   06-01-2013 (1)Bimba kids   06-01-2013 (1)
Bimba kids 06-01-2013 (1)
 
Blogger, word press e tumblr
Blogger, word press e tumblrBlogger, word press e tumblr
Blogger, word press e tumblr
 
Dougherty 1
Dougherty 1Dougherty 1
Dougherty 1
 
SITES Certifies 12 New Projects _ SITES
SITES Certifies 12 New Projects _ SITESSITES Certifies 12 New Projects _ SITES
SITES Certifies 12 New Projects _ SITES
 

Similar to HPCS2014 - Building a storage system for genomics

MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData  Datastax webinar - Operating Cassandra on Kubernetes with the help ...MayaData  Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...MayaData Inc
 
The Cambridge Research Computing Service
The Cambridge Research Computing ServiceThe Cambridge Research Computing Service
The Cambridge Research Computing Serviceinside-BigData.com
 
CloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use CaseCloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use CaseCloudLightning
 
Optimized placement in Openstack for NFV
Optimized placement in Openstack for NFVOptimized placement in Openstack for NFV
Optimized placement in Openstack for NFVDebojyoti Dutta
 
Linac Coherent Light Source (LCLS) Data Transfer Requirements
Linac Coherent Light Source (LCLS) Data Transfer RequirementsLinac Coherent Light Source (LCLS) Data Transfer Requirements
Linac Coherent Light Source (LCLS) Data Transfer Requirementsinside-BigData.com
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facilityinside-BigData.com
 
Alan Crosswell Canarie20090304
Alan Crosswell  Canarie20090304Alan Crosswell  Canarie20090304
Alan Crosswell Canarie20090304Bill St. Arnaud
 
Accelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesAccelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesIntel® Software
 
20190620 accelerating containers v3
20190620 accelerating containers v320190620 accelerating containers v3
20190620 accelerating containers v3Tim Bell
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersIntel® Software
 
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...Amazon Web Services
 
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...inside-BigData.com
 
Ceph Day London 2014 - Ceph Over High-Performance Networks
Ceph Day London 2014 - Ceph Over High-Performance Networks Ceph Day London 2014 - Ceph Over High-Performance Networks
Ceph Day London 2014 - Ceph Over High-Performance Networks Ceph Community
 
Improving the Efficiency of Cloud Infrastructures with Elastic Tandem Machine...
Improving the Efficiency of Cloud Infrastructures with Elastic Tandem Machine...Improving the Efficiency of Cloud Infrastructures with Elastic Tandem Machine...
Improving the Efficiency of Cloud Infrastructures with Elastic Tandem Machine...Frank Dürr
 
Ceph Day New York 2014: Ceph over High Performance Networks
Ceph Day New York 2014: Ceph over High Performance NetworksCeph Day New York 2014: Ceph over High Performance Networks
Ceph Day New York 2014: Ceph over High Performance NetworksCeph Community
 
Carrier Grade OCP: Open Solutions for Telecom Data Centers
Carrier Grade OCP: Open Solutions for Telecom Data CentersCarrier Grade OCP: Open Solutions for Telecom Data Centers
Carrier Grade OCP: Open Solutions for Telecom Data CentersRadisys Corporation
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko GlobalLogic Ukraine
 

Similar to HPCS2014 - Building a storage system for genomics (20)

MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData  Datastax webinar - Operating Cassandra on Kubernetes with the help ...MayaData  Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...
 
The Cambridge Research Computing Service
The Cambridge Research Computing ServiceThe Cambridge Research Computing Service
The Cambridge Research Computing Service
 
CloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use CaseCloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use Case
 
Optimized placement in Openstack for NFV
Optimized placement in Openstack for NFVOptimized placement in Openstack for NFV
Optimized placement in Openstack for NFV
 
Linac Coherent Light Source (LCLS) Data Transfer Requirements
Linac Coherent Light Source (LCLS) Data Transfer RequirementsLinac Coherent Light Source (LCLS) Data Transfer Requirements
Linac Coherent Light Source (LCLS) Data Transfer Requirements
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
 
Gupta_Keynote_VTDC-3
Gupta_Keynote_VTDC-3Gupta_Keynote_VTDC-3
Gupta_Keynote_VTDC-3
 
Alan Crosswell Canarie20090304
Alan Crosswell  Canarie20090304Alan Crosswell  Canarie20090304
Alan Crosswell Canarie20090304
 
Available HPC Resources at CSUC
Available HPC Resources at CSUCAvailable HPC Resources at CSUC
Available HPC Resources at CSUC
 
Accelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesAccelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing Technologies
 
20190620 accelerating containers v3
20190620 accelerating containers v320190620 accelerating containers v3
20190620 accelerating containers v3
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
 
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
 
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
 
Ceph Day London 2014 - Ceph Over High-Performance Networks
Ceph Day London 2014 - Ceph Over High-Performance Networks Ceph Day London 2014 - Ceph Over High-Performance Networks
Ceph Day London 2014 - Ceph Over High-Performance Networks
 
Improving the Efficiency of Cloud Infrastructures with Elastic Tandem Machine...
Improving the Efficiency of Cloud Infrastructures with Elastic Tandem Machine...Improving the Efficiency of Cloud Infrastructures with Elastic Tandem Machine...
Improving the Efficiency of Cloud Infrastructures with Elastic Tandem Machine...
 
Ceph Day New York 2014: Ceph over High Performance Networks
Ceph Day New York 2014: Ceph over High Performance NetworksCeph Day New York 2014: Ceph over High Performance Networks
Ceph Day New York 2014: Ceph over High Performance Networks
 
An Update on Arm HPC
An Update on Arm HPCAn Update on Arm HPC
An Update on Arm HPC
 
Carrier Grade OCP: Open Solutions for Telecom Data Centers
Carrier Grade OCP: Open Solutions for Telecom Data CentersCarrier Grade OCP: Open Solutions for Telecom Data Centers
Carrier Grade OCP: Open Solutions for Telecom Data Centers
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
 

HPCS2014 - Building a storage system for genomics

  • 1. Calcul Québec - Université Laval Building a Storage System for Genomics 1 HPCS 2014 Halifax, NS Florent.Parent@calculquebec.ca Frederick.Lefebvre@calculquebec.ca
  • 2. Calcul Québec - Université Laval Agenda Genomics storage project background Reviewing and optimizing the proposal Network + politics issues Writing the RFP Lessons learned ! ! 2
  • 3. Calcul Québec - Université Laval Genomics storage project: background FCI Leading Edge Fund (2012 competition) “Human and Microbial Integrative Genomics” Project Lead: Dr. Jacques Simard, CRCHUQ, (Université Laval) 16 researchers from Université Laval Bioinformatics and Computational Infrastructure Arnaud Droit Large storage component 3
  • 4. Calcul Québec - Université Laval CRCHUQ Genomics, proteomics and metabolomics Data sources: HiSeq 2500, 2000 and MiSeq Applications: RAY, genomics pipeline … Some researchers already active HPC users Jacques Corbeil, Sébastien Boisvert (RAY), Arnaud Droit, Yohan Bossé 4
  • 5. Calcul Québec - Université Laval Site specifications: Physical Number of racks in silo: 56 max Floor loading capacity: 940 lb/pi2 5
  • 6. Calcul Québec - Université Laval Site specifications: Power 6 1 MW generator Campus Data Center CII: Centre des Infrastructure Informationelles Silo: 1.1 MW available (~33% used) 72 kW UPS (+ generator) 25 kV hydro line 2 MVA transformer
  • 7. Calcul Québec - Université Laval Site specifications: Cooling Rack cooling: 100% air No CRAC units! Using campus wide chilled-water loop for cooling Cooling capacity: 1.5 MW Residual heat transferred to campus hot- water loop Partial free air cooling (up to 300 kW) 7 Cooling coils Air blowers Free air cooling
  • 8. Calcul Québec - Université Laval Site specifications: Networking 8 Fibre optic network to Québec hospital research networks
  • 9. Calcul Québec - Université Laval Timeline 9 2014 Feb 26 All tests pass system accepted Jan 6 Physical installation 2013 Oct 3 RFP published Jan 15 MSSS meeting Jan 22 Acceptance testing starts Nov 20 RFP winner announced number of meetings with vendors/manufacturers July 9 MSSS derogation 2012 Feb First meeting number of meetings with vendors/manufacturers Nov identify FW issue April FCI conditions met March finalize budget
  • 10. Calcul Québec - Université Laval Researcher LEF proposal Initial contact: Researcher->VPR->CC staff Initial meetings: Review proposal, discussions Researchers planned to install storage at CRCHUQ facilities Based on quote from local supplier Review and optimize Discuss possible optimization in proposal Scheduled meetings with HPC storage suppliers 10
  • 11. Calcul Québec - Université Laval CFI LEF Discussed option to host storage at CC site Install some storage and compute at CRCHUQ Bulk storage at UL/CQ/CC site High speed connectivity already in place (10G UL-CRCHUQ) Sounds simple, right? … 11
  • 12. Calcul Québec - Université Laval Concerns raised Ease of access to CC hosted storage Security 12
  • 13. Calcul Québec - Université Laval Refining the proposal Evaluate benefits to hosts at CQ/CC Power, cooling infrastructure already in place O&M handled by CQ staff. Collaborate with CRCHUQ sysadmin Initial budget planned room renovations and extra A/C: $$ saved for more infrastructure 13
  • 14. Calcul Québec - Université Laval MOU MOU sent to CFI (Jan 2013) CRCHUQ and CQ/CC staff work on RFP and acquisition process CQ/CC staff manage storage Storage is for exclusive usage of CRCHUQ Local storage for CRCHUQ, Parallel FS at CQ/UL Archival (tapes) will use existing system available at remote CQ/CC site 14
  • 15. Calcul Québec - Université Laval Genomics Storage Components 15
  • 16. Calcul Québec - Université Laval Interconnect Université Laval owns fibre optics MAN Interconnects all QC hospital research networks 16
  • 17. Calcul Québec - Université Laval Test node Network testing 17 Test node (VM) 4 Gbps 141 Mbps “IS-QC” Firewall Limits flows to 1.2 Gbps
  • 18. Calcul Québec - Université Laval “Pile of firewalls” CRCHUQ already manages it’s security firewall at it’s periphery IS-QC under MSSS authority acts as“safety valve” Work with CRCHUQ to request derogation to remove IS-QC 18
  • 19. Calcul Québec - Université Laval MSSS Derogation Document and prepare meeting in Dec 2012 Jan 2013: meeting with MSSS security staff Jan 2013: regional security coordinator refusal Feb 2013: CRCHUQ director writes to deputy minister of MSSS IT July 2013: Deputy minister (MSSS IT) visits UL/CQ Derogation done. 19
  • 20. Calcul Québec - Université Laval Network Measurements perfSonar for periodic network measurements 20
  • 21. Calcul Québec - Université Laval Archival Use existing tape archives at CQ Plenty of network bandwidth (for now…) 21
  • 22. V1.0Calcul Québec - Université Laval RFP 22
  • 23. Calcul Québec - Université Laval Building the RFP An iterative process Based on multiple meetings with researchers + Expertise and market knowledge of local HPC team 23 Vendors RFP Researcher HPC team
  • 24. Calcul Québec - Université Laval Premise 2 storage systems with different requirements in very different environments Parallel storage Large and high-speed in modern datacenter with plenty of power and cooling On-site storage Smaller capacity with slower interconnect in air-conditioned server room 24
  • 25. Calcul Québec - Université Laval Challenges Budget is limited. We want to get the most out of it ! But the most of what ? Parallel storage capacity/ Parallel storage write speed On-site storage capacity etc… 25
  • 26. Calcul Québec - Université Laval Challenges (cont.) Most of the budget to be allocated to the parallel storage To enable computing and mid-term storage On-site storage must be large enough. No more. A quality based RFP allows for such distinctions 26
  • 27. Calcul Québec - Université Laval How large is large enough The sequencing platform could generate 10TB of data per week Operating at full capacity 40TB would provide 1 month of buffering 27 On-site StorageSequencers Parallel Storage Buffering Automated Data Synchronisation
  • 28. Calcul Québec - Université Laval quality based RFP We chose to publish a quality based RFP In contrast to a lowest-bidder process ! Evaluated on cost + « quality criteria » Vendors are asked to spend at least 95% of budget 28
  • 29. Calcul Québec - Université Laval Challenges (solution) Define 2 indépendant sets of requirements ! Use the « quality criteria » to let vendors know what they should prioritize More weight will be given to the parallel storage components ! 29
  • 30. Calcul Québec - Université Laval Hardware only or integrated solution ? A) Hardware only:Write an RFP to buy XTB of raw disk space +Y servers and the accompanying interconnect. Integrate everything in-house to deploy a storage system. B) Integrated solution: Ask for a complete system to meet a size and performance requirements. First things first 30
  • 31. Calcul Québec - Université Laval Integrated Solution Cumbersome question … Lustre, GPFS or anything Should we ask for a specific parallel FS ? Some parallel FS are tied to a specific vendor or a very small set of vendors Went with Lustre because it is a multi-vendor ecosystem … and our team is already familiar with it 31
  • 32. Calcul Québec - Université Laval Fostering competition The RFP can be so specific as to open the door only to a single product ! Or it can let bidders come up with their own solution to our problem 32 Specific product Surprise…
  • 33. Calcul Québec - Université Laval Fostering competition (cont.) Vendors know when a RFP is targeted to them They will price accordingly Inversely, vendors will not bid if they do not feel they have a fair chance Less bid will often equal « higher price » A less constrained RFP will generally attract more proposals ! 33
  • 34. Calcul Québec - Université Laval Fostering competition (cont.) Example of being too specific: « Storage units with 60 drives in raid5, 8+2 configuration » ! Such a statement could apply to a single vendor, while limiting the available technologies 34
  • 35. Calcul Québec - Université Laval Spec'ing a storage system Power & Cooling capacity Physical space and room topology Compatibility with existing infrastructure Software Physical 35
  • 36. Calcul Québec - Université Laval Physical infrastructure 36 Document floor/rack plan Maximum weight per square foot ? How much space do we actually have ? Where does the system need to connect? Both power and interconnect Cable length
  • 37. Calcul Québec - Université Laval Power & Cooling 37 How much electrical capacity is available Total? Per rack ? UPS ? Can our room cooling system handle that much new power ?
  • 38. Calcul Québec - Université Laval Requirements for parallel storage 1 PB usable (or more) Lustre FS Compatible with Lustre clients 1.8.9 and 2.4.x 20 GB/s aggregate read/write speed (or more) Drives and Lustre servers redundancy « how » is purposely left unspecified Infiniband interconnect 2:1 blocking factor with computing resources 38
  • 39. Calcul Québec - Université Laval Requirements for parallel storage Vendor to provide all interconnect Leaf IB switch, ethernet switch for management and cabling Site provides uplink to core switches 20KW maximum electrical consumption Vendor to supply PDUs (switched) Site to connect PDUs to existing electrical infrastructure 39
  • 40. Calcul Québec - Université Laval Requirements for on-site storage Export network filesystem Compatible with sequencers,Windows 7, Linux and Mac 10G Ethernet interconnect 50TB usable capacity (or more) with option to grow up to 300TB Drives and servers redundancy 40
  • 41. Calcul Québec - Université Laval Requirements for on-site storage Site to provide all cabling and interconnect for on-site storage PDUs and rack space provided by the site 41
  • 42. Calcul Québec - Université Laval Measuring the quality of a proposal Final evaluation is based on « adjusted price » calculated from the bid price and the rating of the « quality criteria » given by the evaluation committee ! « adjusted price » can vary from the real price by up to 30% 42
  • 43. Calcul Québec - Université Laval Quality criteria 43 Parallel Storage 45 % On-site Storage 20 % Interconnect & Networking 10 % Vendor’s Experience & Reputation 25 %
  • 44. Calcul Québec - Université Laval Quality criteria (cont.) ! In the 1st three categories, meeting the base requirements gives a passing score of 70%. Any specs or meaningful features above base requirements will improve the mark. ! 44
  • 45. Calcul Québec - Université Laval Quality criteria (cont.) In the « vendor » category, score is based on the bidder’s experience in deploying similar systems with a requirement for at least 1 such system in the past 18 months. Support structure and resume of the lead architect for the project are also a factor. ! 45
  • 46. V1.0Calcul Québec - Université Laval Benchmarks & stability tests 46
  • 47. Calcul Québec - Université Laval Acceptance tests We define stability tests to validate the system can operate in a real production environment. ! We run synthetic benchmarks to make sure the system hits the performance targets set by the vendor as requested by the quality criteria. 47
  • 48. Calcul Québec - Université Laval Stability tests To validate normal operation Homogenous firmware and software versions everywhere No errors or warning Verify the systems reboots cleanly Lustre mounts properly Simulate drive failures Verify rebuild process 48
  • 49. Calcul Québec - Université Laval Benchmarks We set some base rules No custom tools. Re-use existing software Let the vendor tune the tests for his system But test must be large enough to avoid cache effect What to benchmark Read/write speed of single target : IOZone Maximum aggregate read/write speed : IOR Maximum I/O operations per second (IOPS) : mdtest 49
  • 50. V1.0Calcul Québec - Université Laval RFP results 50
  • 51. Calcul Québec - Université Laval Bids We got 6 valid proposals Parallel storage capacity varied from more than 60% across bids Aggregate speed for parallel storage varied by almost 50% On-site capacity varied by almost 100% On-site storage went from a NAS on ZFS to full fledges Lustre or GPFS systems 51
  • 52. Calcul Québec - Université Laval System selected Parallel storage: Xyratex CS6000 1.4 PB usable Lustre FS 12 OSS and 4 targets per OSS 4TB NL SAS drives +SSD for journals 30 GB/s maximum aggregated R/W speed On-site storage: Xyratex CS1500 120TB usable Lustre FS (scales to 7 PB) 4 CIFS/NFS exporters 52
  • 53. V1.0Calcul Québec - Université Laval Deployment 53
  • 54. Calcul Québec - Université Laval Operation Both system in production since early february Parallel storage dedicated to research group mounted on compute ressources Data transfers are enabled by Globus endpoints on dedicated DTNs at both sites. Todo: Review network topology for transfers Perfsonar nodes to be deployed at research center 54
  • 55. Calcul Québec - Université Laval Operation (cont.) Researchers need a CC account to access Parallel Storage Access control and allocations are a challenge Shared spreadsheet filled by research center to allocate space on parallel FS for their users (Cumbersome!) Integration with the CCDB would leverage existing system to manage storage allocations ! 55
  • 56. V1.0Calcul Québec - Université Laval Lessons learned 56
  • 57. Calcul Québec - Université Laval Lessons learned Time consuming (2 year projects) Mostly thrust and relationship building Time needed to write an RFP should not be underestimated Benefit for the research group Access to a team of specialist to lead their project Major cost saving on the infrastructure. No investment to upgrade an existing server room (UPS, Power, Cooling, etc) 57
  • 58. Calcul Québec - Université Laval Cost to integrate CS6000 Installation: 900$ (rack enclosure) Power: 1457$ (new outlets) Cooling: 0$ Infiniband: Used existing cables 6 CXP - QSFP cables (18 QDR links) 58
  • 59. Calcul Québec - Université Laval Improving the process Sharing RFPs between Compute Canada site could ease the process for new projects Common benchmarks across Compute Canada would help when designing acceptance tests Applies to both storage and computing 59