Best Practices for Ceph- 
Powered Implementations of 
Storage as-a-Service 
Kamesh Pemmaraju, Sr. Product Mgr, Dell 
Ceph Developer Day, New York City, Oct 2014
Outline 
• Planning your Ceph implementation 
• Ceph Use Cases 
• Choosing targets for Ceph deployments 
• Reference Architecture Considerations 
• Dell Reference Configurations 
• Customer Case Study
Planning your Ceph Implementation 
• Business Requirements 
– Budget considerations, organizational commitment 
– Avoiding lock-in – use open source and industry standards 
– Enterprise IT use cases 
– Cloud applications/XaaS use cases for massive-scale, cost-effective storage 
• Sizing requirements 
– What is the initial storage capacity? 
– Is it steady-state data usage vs. Spike data usage 
– What is the expected growth rate? 
• Workload requirements 
– Does the workload need high performance or it is more capacity focused? 
– What are IOPS/Throughput requirements? 
– What type of data will be stored? 
– Ephemeral vs. persistent data, Object, Block, File? 
• Ceph is like a Swiss Army knife – it can tuned a wide variety of use cases. Let 
us look at some of them
Ceph is like a Swiss Army Knife – it can fit in a 
wide variety of target use cases 
Virtualization and Private 
Ceph Cloud 
Target 
(traditional SAN/NAS) 
High Performance 
(traditional SAN) 
NAS & Object 
Content Store 
(traditional NAS) 
Capacity Performance 
Traditional IT 
Cloud 
Applications 
XaaS Compute Cloud 
Open Source Block 
XaaS Content Store 
Open Source NAS/Object 
Ceph Target
USE CASE: OPENSTACK 
Copyright © 2013 by Inktank | Private and Confidential 
5
USE CASE: OPENSTACK 
Copyright © 2013 by Inktank | Private and Confidential 
6 
Volumes Ephemeral 
Copy-on-Write Snapshots
USE CASE: OPENSTACK 
Copyright © 2013 by Inktank | Private and Confidential 
7
USE CASE: CLOUD STORAGE 
Copyright © 2013 by Inktank | Private and Confidential 
8 
S3/Swift S3/Swift S3/Swift S3/Swift
USE CASE: WEBSCALE APPLICATIONS 
Copyright © 2013 by Inktank | Private and Confidential 
9 
Native 
Protocol 
Native 
Protocol 
Native 
Protocol 
Native 
Protocol
USE CASE: PERFORMANCE BLOCK 
Copyright © 2013 by Inktank | Private and Confidential 
10 
CEPH STORAGE CLUSTER
USE CASE: PERFORMANCE BLOCK 
Read/Write Read/Write 
Copyright © 2013 by Inktank | Private and Confidential 
11 
CEPH STORAGE CLUSTER
USE CASE: PERFORMANCE BLOCK 
Write Write Read Read 
Copyright © 2013 by Inktank | Private and Confidential 
12 
CEPH STORAGE CLUSTER
USE CASE: ARCHIVE / COLD STORAGE 
Copyright © 2013 by Inktank | Private and Confidential 
13 
CEPH STORAGE CLUSTER
USE CASE: DATABASES 
Copyright © 2013 by Inktank | Private and Confidential 
14 
Native 
Protocol 
Native 
Protocol 
Native 
Protocol 
Native 
Protocol
USE CASE: HADOOP 
Copyright © 2013 by Inktank | Private and Confidential 
15 
Native 
Protocol 
Native 
Protocol 
Native 
Protocol 
Native 
Protocol
Architectural considerations – Redundancy and 
replication considerations 
• Tradeoff between Cost vs. Reliability (use-case dependent) 
• Use the Crush configs to map out your failures domains and performance pools 
• Failure domains 
– Disk (OSD and OS) 
– SSD journals 
– Node 
– Rack 
– Site (replication at the RADOS level, Block replication, consider latencies) 
• Storage pools 
– SSD pool for higher performance 
– Capacity pool 
• Plan for failure domains of the monitor nodes 
• Consider failure replacement scenarios, lowered redundancies, and performance 
impacts
Server Considerations 
• Storage Node: 
– one OSD per HDD, 1 – 2 GB ram, and 1 Gz/core/OSD, 
– SSD’s for journaling and for using SSD pooling (tiering) in Firefly 
– Erasure coding will increase useable capacity at the expense of additional compute 
load 
– SAS JBOD expanders for extra capacity (beware of extra latency, oversubscribed 
SAS lanes, large footprint for a failure zone) 
• Monitor nodes (MON): odd number for quorum, services 
can be hosted on the storage node for smaller 
deployments, but will need dedicated nodes larger 
installations 
• Dedicated RADOS Gateway nodes for large object store 
deployments and for federated gateways for multi-site
Networking Considerations 
• Dedicated or Shared network 
– Be sure to involve the networking and security teams early when designing your 
networking options 
– Network redundancy considerations 
– Dedicated client and OSD networks 
– VLAN’s vs. Dedicated switches 
– 1 Gbs vs 10 Gbs vs 40 Gbs! 
• Networking design 
– Spine and Leaf 
– Multi-rack 
– Core fabric connectivity 
– WAN connectivity and latency issues for multi-site deployments
Ceph additions coming to the Dell Red Hat 
OpenStack solution 
Pilot configuration Components 
• Dell PowerEdge R620/R720/R720XD Servers 
• Dell Networking S4810/S55 Switches, 10GB 
• Red Hat Enterprise Linux OpenStack Platform 
• Dell ProSupport 
• Dell Professional Services 
• Avail. w/wo High Availability 
Specs at a glance 
• Node 1: Red Hat Openstack Manager 
• Node 2: OpenStack Controller (2 additional controllers 
for HA) 
• Nodes 3-8: OpenStack Nova Compute 
• Nodes: 9-11: Ceph 12x3 TB raw storage 
• Network Switches: Dell Networking S4810/S55 
• Supports ~ 170-228 virtual machines 
Benefits 
• Rapid on-ramp to OpenStack cloud 
• Scale up, modular compute and storage blocks 
• Single point of contact for solution support 
• Enterprise-grade OpenStack software package 
Storage 
bundles
Example Ceph Dell Server Configurations 
Type Size Components 
Performance 20 TB • R720XD 
• 24 GB DRAM 
• 10 X 4 TB HDD (data drives) 
• 2 X 300 GB SSD (journal) 
Capacity 44TB / 
105 TB* 
• R720XD 
• 64 GB DRAM 
• 10 X 4 TB HDD (data drives) 
• 2 X 300 GB SSH (journal) 
• MD1200 
• 12 X 4 TB HHD (data drives) 
Extra Capacity 144 TB / 
240 TB* 
• R720XD 
• 128 GB DRAM 
• 12 X 4 TB HDD (data drives) 
• MD3060e (JBOD) 
• 60 X 4 TB HHD (data drives)
What Are We Doing To Enable? 
• Dell & Red Hat & Inktank have partnered to bring a complete 
Enterprise-grade storage solution for RHEL-OSP + Ceph 
• The joint solution provides: 
– Co-engineered and validated Reference Architecture 
– Pre-configured storage bundles optimized for performance or 
storage 
– Storage enhancements to existing OpenStack Bundles 
– Certification against RHEL-OSP 
– Professional Services, Support, and Training 
› Collaborative Support for Dell hardware customers 
› Deployment services & tools
UAB Case Study
Overcoming a data deluge 
US university that specializes in Cancer and Genomic research 
• 900 researchers 
• Data sets challenging resources 
• Research data scattered everywhere 
• Transferring datasets took forever and clogged 
shared networks 
• Distributed data management reduced 
productivity and put data at risk 
• Needed centralized repository for compliance 
Dell - Confidential
Research Computing System (Originally) 
A collection of grids, proto-clouds, tons of virtualization and DevOps 
HPC 
Cluster 
HPC 
Cluster 
HPC 
Storage 
DDR Infiniband QDR Infiniband 
1Gb Ethernet 
University Research Network 
Interactive Services 
Thumb 
drives 
Local 
servers 
Laptops 
Laptops 
Thumb 
drives 
Local 
servers 
Dell - Confidential
Solution: a scale-out storage cloud 
Based on OpenStack and Ceph 
• Housed and managed centrally, accessible 
across campus network 
− File system + cluster, can grow as big as you want 
− Provisions from a massive common pool 
− 400+ TBs at less than 41¢/GB; scalable to 5PB 
• Researchers gain 
− Work with larger, more diverse data sets 
− Save workflows for new devices & analysis 
− Qualify for grants due to new levels of protection 
• Demonstrating utility with applications 
− Research storage 
− Crashplan (cloud back up) on POC 
− Gitlab hosting on POC 
“We’ve made it possible for users to 
satisfy their own storage needs with 
the Dell private cloud, so that their 
research is not hampered by IT.” 
David L. Shealy, PhD 
Faculty Director, Research Computing 
Chairman, Dept. of Physics 
Dell - Confidential
Research Computing System (Today) 
Centralized storage cloud based on OpenStack and Ceph 
Ceph 
node 
University Research Network 
Cep 
node 
Ceph 
node 
Ceph 
node 
Ceph 
node 
POC 
Open 
Stack 
node 
HPC 
Cluster 
HPC 
Cluster 
HPC 
Storage 
DDR Infiniband QDR Infiniband 
10Gb Ethernet 
Cloud services layer 
Virtualized server and storage computing cloud 
based on OpenStack, Crowbar and Ceph 
Dell - Confidential
Building a research cloud 
Project goals extend well beyond data management 
• Designed to support emerging 
data-intensive scientific computing paradigm 
− 12 x 16-core compute nodes 
− 1 TB RAM, 420 TBs storage 
− 36 TBs storage attached to each compute node 
• Individually customized test/development/ 
production environments 
− Direct user control over all aspects of the 
application environment 
− Rapid setup and teardown 
• Growing set of cloud-based tools & services 
− Easily integrate shareware, open source, and 
commercial software 
“We envision the OpenStack-based 
cloud to act as the gateway to our 
HPC resources, not only as the 
purveyor of services we provide, but 
also enabling users to build their own 
cloud-based services.” 
John-Paul Robinson, System Architect 
Dell - Confidential
Research Computing System (Next Gen) 
A cloud-based computing environment with high speed access to 
dedicated and dynamic compute resources 
Open 
Stack 
node 
Open 
Stack 
node 
Open 
Stack 
node 
Ceph 
node 
University Research Network 
Ceph 
node 
Ceph 
node 
Ceph 
node 
Ceph 
node 
Open 
Stack 
node 
Open 
Stack 
node 
Open 
Stack 
node 
Open 
Stack 
node 
HPC 
Cluster 
HPC 
Cluster 
HPC 
Storage 
DDR Infiniband QDR Infiniband 
10Gb Ethernet 
Cloud services layer 
Virtualized server and storage computing cloud 
based on OpenStack, Crowbar and Ceph 
Dell - Confidential
THANK YOU!
Contact Information 
Reach Kamesh additional information: 
Kamesh_Pemmaraju@Dell.com 
@kpemmaraju 
http://www.cloudel.com
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of Storage as-a-Service

Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of Storage as-a-Service

  • 1.
    Best Practices forCeph- Powered Implementations of Storage as-a-Service Kamesh Pemmaraju, Sr. Product Mgr, Dell Ceph Developer Day, New York City, Oct 2014
  • 2.
    Outline • Planningyour Ceph implementation • Ceph Use Cases • Choosing targets for Ceph deployments • Reference Architecture Considerations • Dell Reference Configurations • Customer Case Study
  • 3.
    Planning your CephImplementation • Business Requirements – Budget considerations, organizational commitment – Avoiding lock-in – use open source and industry standards – Enterprise IT use cases – Cloud applications/XaaS use cases for massive-scale, cost-effective storage • Sizing requirements – What is the initial storage capacity? – Is it steady-state data usage vs. Spike data usage – What is the expected growth rate? • Workload requirements – Does the workload need high performance or it is more capacity focused? – What are IOPS/Throughput requirements? – What type of data will be stored? – Ephemeral vs. persistent data, Object, Block, File? • Ceph is like a Swiss Army knife – it can tuned a wide variety of use cases. Let us look at some of them
  • 4.
    Ceph is likea Swiss Army Knife – it can fit in a wide variety of target use cases Virtualization and Private Ceph Cloud Target (traditional SAN/NAS) High Performance (traditional SAN) NAS & Object Content Store (traditional NAS) Capacity Performance Traditional IT Cloud Applications XaaS Compute Cloud Open Source Block XaaS Content Store Open Source NAS/Object Ceph Target
  • 5.
    USE CASE: OPENSTACK Copyright © 2013 by Inktank | Private and Confidential 5
  • 6.
    USE CASE: OPENSTACK Copyright © 2013 by Inktank | Private and Confidential 6 Volumes Ephemeral Copy-on-Write Snapshots
  • 7.
    USE CASE: OPENSTACK Copyright © 2013 by Inktank | Private and Confidential 7
  • 8.
    USE CASE: CLOUDSTORAGE Copyright © 2013 by Inktank | Private and Confidential 8 S3/Swift S3/Swift S3/Swift S3/Swift
  • 9.
    USE CASE: WEBSCALEAPPLICATIONS Copyright © 2013 by Inktank | Private and Confidential 9 Native Protocol Native Protocol Native Protocol Native Protocol
  • 10.
    USE CASE: PERFORMANCEBLOCK Copyright © 2013 by Inktank | Private and Confidential 10 CEPH STORAGE CLUSTER
  • 11.
    USE CASE: PERFORMANCEBLOCK Read/Write Read/Write Copyright © 2013 by Inktank | Private and Confidential 11 CEPH STORAGE CLUSTER
  • 12.
    USE CASE: PERFORMANCEBLOCK Write Write Read Read Copyright © 2013 by Inktank | Private and Confidential 12 CEPH STORAGE CLUSTER
  • 13.
    USE CASE: ARCHIVE/ COLD STORAGE Copyright © 2013 by Inktank | Private and Confidential 13 CEPH STORAGE CLUSTER
  • 14.
    USE CASE: DATABASES Copyright © 2013 by Inktank | Private and Confidential 14 Native Protocol Native Protocol Native Protocol Native Protocol
  • 15.
    USE CASE: HADOOP Copyright © 2013 by Inktank | Private and Confidential 15 Native Protocol Native Protocol Native Protocol Native Protocol
  • 16.
    Architectural considerations –Redundancy and replication considerations • Tradeoff between Cost vs. Reliability (use-case dependent) • Use the Crush configs to map out your failures domains and performance pools • Failure domains – Disk (OSD and OS) – SSD journals – Node – Rack – Site (replication at the RADOS level, Block replication, consider latencies) • Storage pools – SSD pool for higher performance – Capacity pool • Plan for failure domains of the monitor nodes • Consider failure replacement scenarios, lowered redundancies, and performance impacts
  • 17.
    Server Considerations •Storage Node: – one OSD per HDD, 1 – 2 GB ram, and 1 Gz/core/OSD, – SSD’s for journaling and for using SSD pooling (tiering) in Firefly – Erasure coding will increase useable capacity at the expense of additional compute load – SAS JBOD expanders for extra capacity (beware of extra latency, oversubscribed SAS lanes, large footprint for a failure zone) • Monitor nodes (MON): odd number for quorum, services can be hosted on the storage node for smaller deployments, but will need dedicated nodes larger installations • Dedicated RADOS Gateway nodes for large object store deployments and for federated gateways for multi-site
  • 18.
    Networking Considerations •Dedicated or Shared network – Be sure to involve the networking and security teams early when designing your networking options – Network redundancy considerations – Dedicated client and OSD networks – VLAN’s vs. Dedicated switches – 1 Gbs vs 10 Gbs vs 40 Gbs! • Networking design – Spine and Leaf – Multi-rack – Core fabric connectivity – WAN connectivity and latency issues for multi-site deployments
  • 19.
    Ceph additions comingto the Dell Red Hat OpenStack solution Pilot configuration Components • Dell PowerEdge R620/R720/R720XD Servers • Dell Networking S4810/S55 Switches, 10GB • Red Hat Enterprise Linux OpenStack Platform • Dell ProSupport • Dell Professional Services • Avail. w/wo High Availability Specs at a glance • Node 1: Red Hat Openstack Manager • Node 2: OpenStack Controller (2 additional controllers for HA) • Nodes 3-8: OpenStack Nova Compute • Nodes: 9-11: Ceph 12x3 TB raw storage • Network Switches: Dell Networking S4810/S55 • Supports ~ 170-228 virtual machines Benefits • Rapid on-ramp to OpenStack cloud • Scale up, modular compute and storage blocks • Single point of contact for solution support • Enterprise-grade OpenStack software package Storage bundles
  • 20.
    Example Ceph DellServer Configurations Type Size Components Performance 20 TB • R720XD • 24 GB DRAM • 10 X 4 TB HDD (data drives) • 2 X 300 GB SSD (journal) Capacity 44TB / 105 TB* • R720XD • 64 GB DRAM • 10 X 4 TB HDD (data drives) • 2 X 300 GB SSH (journal) • MD1200 • 12 X 4 TB HHD (data drives) Extra Capacity 144 TB / 240 TB* • R720XD • 128 GB DRAM • 12 X 4 TB HDD (data drives) • MD3060e (JBOD) • 60 X 4 TB HHD (data drives)
  • 21.
    What Are WeDoing To Enable? • Dell & Red Hat & Inktank have partnered to bring a complete Enterprise-grade storage solution for RHEL-OSP + Ceph • The joint solution provides: – Co-engineered and validated Reference Architecture – Pre-configured storage bundles optimized for performance or storage – Storage enhancements to existing OpenStack Bundles – Certification against RHEL-OSP – Professional Services, Support, and Training › Collaborative Support for Dell hardware customers › Deployment services & tools
  • 22.
  • 23.
    Overcoming a datadeluge US university that specializes in Cancer and Genomic research • 900 researchers • Data sets challenging resources • Research data scattered everywhere • Transferring datasets took forever and clogged shared networks • Distributed data management reduced productivity and put data at risk • Needed centralized repository for compliance Dell - Confidential
  • 24.
    Research Computing System(Originally) A collection of grids, proto-clouds, tons of virtualization and DevOps HPC Cluster HPC Cluster HPC Storage DDR Infiniband QDR Infiniband 1Gb Ethernet University Research Network Interactive Services Thumb drives Local servers Laptops Laptops Thumb drives Local servers Dell - Confidential
  • 25.
    Solution: a scale-outstorage cloud Based on OpenStack and Ceph • Housed and managed centrally, accessible across campus network − File system + cluster, can grow as big as you want − Provisions from a massive common pool − 400+ TBs at less than 41¢/GB; scalable to 5PB • Researchers gain − Work with larger, more diverse data sets − Save workflows for new devices & analysis − Qualify for grants due to new levels of protection • Demonstrating utility with applications − Research storage − Crashplan (cloud back up) on POC − Gitlab hosting on POC “We’ve made it possible for users to satisfy their own storage needs with the Dell private cloud, so that their research is not hampered by IT.” David L. Shealy, PhD Faculty Director, Research Computing Chairman, Dept. of Physics Dell - Confidential
  • 26.
    Research Computing System(Today) Centralized storage cloud based on OpenStack and Ceph Ceph node University Research Network Cep node Ceph node Ceph node Ceph node POC Open Stack node HPC Cluster HPC Cluster HPC Storage DDR Infiniband QDR Infiniband 10Gb Ethernet Cloud services layer Virtualized server and storage computing cloud based on OpenStack, Crowbar and Ceph Dell - Confidential
  • 27.
    Building a researchcloud Project goals extend well beyond data management • Designed to support emerging data-intensive scientific computing paradigm − 12 x 16-core compute nodes − 1 TB RAM, 420 TBs storage − 36 TBs storage attached to each compute node • Individually customized test/development/ production environments − Direct user control over all aspects of the application environment − Rapid setup and teardown • Growing set of cloud-based tools & services − Easily integrate shareware, open source, and commercial software “We envision the OpenStack-based cloud to act as the gateway to our HPC resources, not only as the purveyor of services we provide, but also enabling users to build their own cloud-based services.” John-Paul Robinson, System Architect Dell - Confidential
  • 28.
    Research Computing System(Next Gen) A cloud-based computing environment with high speed access to dedicated and dynamic compute resources Open Stack node Open Stack node Open Stack node Ceph node University Research Network Ceph node Ceph node Ceph node Ceph node Open Stack node Open Stack node Open Stack node Open Stack node HPC Cluster HPC Cluster HPC Storage DDR Infiniband QDR Infiniband 10Gb Ethernet Cloud services layer Virtualized server and storage computing cloud based on OpenStack, Crowbar and Ceph Dell - Confidential
  • 29.
  • 30.
    Contact Information ReachKamesh additional information: Kamesh_Pemmaraju@Dell.com @kpemmaraju http://www.cloudel.com

Editor's Notes

  • #21 R720XD configurations use 4 TB drives 2 X 300 GB OS drives 2 X 10 GB NIC iDRAC 7 Enterprise LSI 9207-[8i, 8e] HBAs 2 X E5-2650 2 GHz processors (*) - The larger capacity is that were erasure encoding is in use. To get the same redundancy as 2 X in erasure encoding uses a factor of 1.2. Erasure encoding is a feature of the Ceph Firefly release, which is in its final phase of development. Additional performance could be gained by adding either Intel’s CAS or Dell FluidFS DAS caching software packages. Doing so would impose additional memory and processing overhead, and more work in the deployment/installation bucket (because we would have to install and configure it).
  • #24 https://dev.uabgrid.uab.edu/wiki/OpenStackPlusCeph The research computing system (RCS) is built on a collection of distinct hardware systems designed to provide specific services to applications. The RCS hardware includes dedicated compute fabrics that support high performance computing (HPC) applications where hundreds of compute cores can work together on a single application. These clusters of commodity compute hardware make it possible to do data analysis and modelling work in hours, work that would have taken months using a single computer. The clusters are connected with dedicated high bandwidth, low latency networks for applications to efficiently coordinate their actions across many computers and access a shared high speed storage system for working efficiently with terabytes of data. Our newest hardware fabric, acquired 2012Q4, is designed to support emerging data intensive scientific computing and virtualization paradigms. This hardware is very similar to the commodity computers used by our traditional HPC fabrics, however, in addition to having many compute cores and lots of RAM, each individual computer contains 36TB of built in disk storage. Taken together, this newest hardware fabric adds 192 cores, 1TB RAM, and 420TB of storage to the RCS. The built-in disk storage is designed to support applications running local to each computer. The data intensive computing paradigm exchanges the external storage networks of traditional HPC clusters with the native, very high speed system buses that provide access to local hard disks in each computer. Large datasets are distributed across these computers and then applications are assigned to run on the specific computer that stores the portion of the dataset it has been assigned to analyze. The hardware requirements for data intensive computing closely resemble the requirements for virtualization and can benefit tremendously from the configuration flexibility that a virtualization fabric offers. In order to enhance flexibility and further improve support for scaling research applications, we are engineering our latest hardware cluster to act as a virtualized storage and compute fabric. This enables support for a wide variety of storage and compute use cases, most prominently, ample storage capacity for reliably housing large research data collections and flexible application development and deployment capabilities that allow direct user control over all aspects of the application environment. In short, we are tooling this hardware to build a cloud computing environment. We are building this cloud using OpenStack for compute virtualization and Ceph for storage virtualization. Crowbar will provision the raw hardware fabric. This approach is very similar to the mode we have been following with our traditional ROCKS-based HPC cluster environment. The new approach enhances our ability to automatically provision hardware and further improve the economics large scale computing. We are implementing this environment with Dell and Inktank. These vendors and the upstream open source projects on which this platform is built, embrace the DevOps model for systems development. This will support further engineering collaboration with our vendors, enabling the UAB research community to continually enhance our fabric as needed and feed those enhancements upstream for inclusion in future support releases. This solution rounds out the feature set of the RCS core and will provide a general framework to scale future growth.
  • #25 User base: 900+ researchers across Campus. KVM-based 2 Nova nodes 4 primary storage nodes 4 replication nodes 2 control nodes 12 x R720XD systems
  • #26 https://dev.uabgrid.uab.edu/wiki/OpenStackPlusCeph The research computing system (RCS) is built on a collection of distinct hardware systems designed to provide specific services to applications. The RCS hardware includes dedicated compute fabrics that support high performance computing (HPC) applications where hundreds of compute cores can work together on a single application. These clusters of commodity compute hardware make it possible to do data analysis and modelling work in hours, work that would have taken months using a single computer. The clusters are connected with dedicated high bandwidth, low latency networks for applications to efficiently coordinate their actions across many computers and access a shared high speed storage system for working efficiently with terabytes of data. Our newest hardware fabric, acquired 2012Q4, is designed to support emerging data intensive scientific computing and virtualization paradigms. This hardware is very similar to the commodity computers used by our traditional HPC fabrics, however, in addition to having many compute cores and lots of RAM, each individual computer contains 36TB of built in disk storage. Taken together, this newest hardware fabric adds 192 cores, 1TB RAM, and 420TB of storage to the RCS. The built-in disk storage is designed to support applications running local to each computer. The data intensive computing paradigm exchanges the external storage networks of traditional HPC clusters with the native, very high speed system buses that provide access to local hard disks in each computer. Large datasets are distributed across these computers and then applications are assigned to run on the specific computer that stores the portion of the dataset it has been assigned to analyze. The hardware requirements for data intensive computing closely resemble the requirements for virtualization and can benefit tremendously from the configuration flexibility that a virtualization fabric offers. In order to enhance flexibility and further improve support for scaling research applications, we are engineering our latest hardware cluster to act as a virtualized storage and compute fabric. This enables support for a wide variety of storage and compute use cases, most prominently, ample storage capacity for reliably housing large research data collections and flexible application development and deployment capabilities that allow direct user control over all aspects of the application environment. In short, we are tooling this hardware to build a cloud computing environment. We are building this cloud using OpenStack for compute virtualization and Ceph for storage virtualization. Crowbar will provision the raw hardware fabric. This approach is very similar to the mode we have been following with our traditional ROCKS-based HPC cluster environment. The new approach enhances our ability to automatically provision hardware and further improve the economics large scale computing. We are implementing this environment with Dell and Inktank. These vendors and the upstream open source projects on which this platform is built, embrace the DevOps model for systems development. This will support further engineering collaboration with our vendors, enabling the UAB research community to continually enhance our fabric as needed and feed those enhancements upstream for inclusion in future support releases. This solution rounds out the feature set of the RCS core and will provide a general framework to scale future growth.
  • #27 User base: 900+ researchers across Campus. KVM-based 2 Nova nodes 4 primary storage nodes 4 replication nodes 2 control nodes 12 x R720XD systems
  • #28 https://dev.uabgrid.uab.edu/wiki/OpenStackPlusCeph The research computing system (RCS) is built on a collection of distinct hardware systems designed to provide specific services to applications. The RCS hardware includes dedicated compute fabrics that support high performance computing (HPC) applications where hundreds of compute cores can work together on a single application. These clusters of commodity compute hardware make it possible to do data analysis and modelling work in hours, work that would have taken months using a single computer. The clusters are connected with dedicated high bandwidth, low latency networks for applications to efficiently coordinate their actions across many computers and access a shared high speed storage system for working efficiently with terabytes of data. Our newest hardware fabric, acquired 2012Q4, is designed to support emerging data intensive scientific computing and virtualization paradigms. This hardware is very similar to the commodity computers used by our traditional HPC fabrics, however, in addition to having many compute cores and lots of RAM, each individual computer contains 36TB of built in disk storage. Taken together, this newest hardware fabric adds 192 cores, 1TB RAM, and 420TB of storage to the RCS. The built-in disk storage is designed to support applications running local to each computer. The data intensive computing paradigm exchanges the external storage networks of traditional HPC clusters with the native, very high speed system buses that provide access to local hard disks in each computer. Large datasets are distributed across these computers and then applications are assigned to run on the specific computer that stores the portion of the dataset it has been assigned to analyze. The hardware requirements for data intensive computing closely resemble the requirements for virtualization and can benefit tremendously from the configuration flexibility that a virtualization fabric offers. In order to enhance flexibility and further improve support for scaling research applications, we are engineering our latest hardware cluster to act as a virtualized storage and compute fabric. This enables support for a wide variety of storage and compute use cases, most prominently, ample storage capacity for reliably housing large research data collections and flexible application development and deployment capabilities that allow direct user control over all aspects of the application environment. In short, we are tooling this hardware to build a cloud computing environment. We are building this cloud using OpenStack for compute virtualization and Ceph for storage virtualization. Crowbar will provision the raw hardware fabric. This approach is very similar to the mode we have been following with our traditional ROCKS-based HPC cluster environment. The new approach enhances our ability to automatically provision hardware and further improve the economics large scale computing. We are implementing this environment with Dell and Inktank. These vendors and the upstream open source projects on which this platform is built, embrace the DevOps model for systems development. This will support further engineering collaboration with our vendors, enabling the UAB research community to continually enhance our fabric as needed and feed those enhancements upstream for inclusion in future support releases. This solution rounds out the feature set of the RCS core and will provide a general framework to scale future growth.
  • #29 User base: 900+ researchers across Campus. KVM-based 2 Nova nodes 4 primary storage nodes 4 replication nodes 2 control nodes 12 x R720XD systems