Your SlideShare is downloading. ×
Ceph Storage and Penguin Computing on Demand
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Ceph Storage and Penguin Computing on Demand

350
views

Published on

Presentation to Ceph Days Santa Clara on 9/12/2013

Presentation to Ceph Days Santa Clara on 9/12/2013

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
350
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Ceph and Penguin Computing On Demand Travis Rhoden
  • 2. Copyright © 2013 Penguin Computing, Inc. All rights reserved Who is Penguin Computing? ● Founded in 1997, with a focus on custom Linux systems ● Core markets: HPC, enterprise/data-center, HPC cloud services – We see about a 50/50 mix between HPC and enterprise orders – Offer turn-key clusters, full-range of Linux servers ● Now the largest private System Integrator in North America ● Stable, profitable, growing...
  • 3. Copyright © 2013 Penguin Computing, Inc. All rights reserved What is Penguin Computing On Demand (POD)? ● POD launched in 2009 as an HPC-as-a-Service offering ● Purpose-built HPC cluster for on-demand customers – Offers low-latency interconnects, high core counts, plentiful RAM for processing – Non-virtualized compute resources, focused on absolute compute performance – Tuned MPI/cluster stack available “out of the box” ● “Pay as you go” – only for what you use, charge per core-hour ● Customizable, persistent user environment ● Over 50 million commercial jobs run
  • 4. Copyright © 2013 Penguin Computing, Inc. All rights reserved Original POD designs ● Original clusters used standalone DAS NFS servers ● Login nodes ran on VMWare, then KVM, stored locally on host
  • 5. Copyright © 2013 Penguin Computing, Inc. All rights reserved Original POD limitations ● Disparate NFS servers led to non-global namespace – Users unable to take advantage of all installed storage – Not all disks able to contribute to performance (no scale-out effect) – A full NFS server affected co-resident users – NFS server RAID card a SPoF ● Never lost data, but did have times where data was inaccessible ● VM login nodes were handled by a standalone set of hardware – Storage servers not leveraged for hosting VM disks
  • 6. Copyright © 2013 Penguin Computing, Inc. All rights reserved POD New Architecture ● Time for something different – More expandable – More fault tolerant – More flexible ● OpenStack & Ceph
  • 7. Copyright © 2013 Penguin Computing, Inc. All rights reserved POD Ceph Usage – Open Stack ● Ceph OpenStack integration is a big plus – Store Disk images in Ceph (Glance) – Store Volumes in Ceph (Cinder) – Boot VMs straight from Ceph (boot from volume) – Leverage COW semantics for boot volume creation – Live migration ● No immediate need for RADOSGW – Nice to know it's there if we need it
  • 8. Copyright © 2013 Penguin Computing, Inc. All rights reserved POD Ceph Usage - RBD ● The same storage system hosts RBDs for us ● Each POD user has their $HOME in an RBD – To make visible to all compute nodes and customer-accessible login nodes, we mount the RBD on one of several NFS servers and export from there – Aren't quite ready to throw full weight into CephFS, but early testing has started – We know this creates a performance bottleneck, but the pros outweigh the cons
  • 9. Copyright © 2013 Penguin Computing, Inc. All rights reserved POD Ceph Usage – RBD Pros and Cons ● Pros – Thin provisioning – User specific backups or snapshots – Nice block device to export 1:1 mapping ● Cons – NFS server SPoF and bottleneck – Loss of parallel access to OSDs – Slow-ish resize
  • 10. Copyright © 2013 Penguin Computing, Inc. All rights reserved POD Storage Hardware ● Started with 5x Penguin Computing IB2712 chassis – Dual Xeon 5600-series – 48GB RAM – Dual 10GbE – 12x hot-swap 3.5” SATA drives – 2x internal SSDs for OS and OSD journals ● 6 journals on each SSD ● 60x 2TB → 120TB raw storage – 109TB available in Ceph ● XFS on OSDs
  • 11. Copyright © 2013 Penguin Computing, Inc. All rights reserved POD Ceph Storage Config ● Running 3 monitors – On same chassis as OSDs (not recommended by Inktank) ● Running 2 MDS processes – On same chassis as OSDs – 1 active, 1 backup ● Each chassis has a 2-port 10GbE LAG to ToR switch ● 2 replicas ● Separate pools for Glance, Cinder, user $HOMEs
  • 12. Copyright © 2013 Penguin Computing, Inc. All rights reserved CephFS on POD ● Primary use case for storage on POD is users reading and writing data to their $HOME directory ● On our HPC clusters, primarily tends to be sequential writes, but we see sequential reads and some bits of random I/O ● Running VMs also produce random I/O ● Since users can run jobs comprising dozens of compute nodes, potentially all hitting the same folder(s), would be nice to use CephFS rather than NFS ● Testing a scratch space a good way to start ● Using ceph-fuse, as Cluster is CentOS 6.3
  • 13. Copyright © 2013 Penguin Computing, Inc. All rights reserved CephFS initial benchmarks ● Simple dd, 1GB file, 4MB blocks – (dd if=/dev/zero of=[dir] bs=4M count=256 conv=fdatasync)
  • 14. Copyright © 2013 Penguin Computing, Inc. All rights reserved Ceph Lessons Learned ● Our 3rd production Ceph cluster – 1st has been decommissioned, ran Argonaut and Bobtail, used IPoIB – 2nd being decommissioned, still running Bobtail – 3rd is the primary workhorse for a production POD cluster, launched on Bobtail, now running the latest Cuttlefish ● For RBD, very recent Linux kernel a must if using kclient – Pre 3.10 had kpanic issues when using cephx ● SSDs nice, but may not be best bang for buck – 3-4 OSD journals per SSD is ideal, but does add significant cost – We've seen promising results using higher end RAID controllers in lieu of SSDs, due to write-back cache, at an overall lower cost – We still need to test more to determine how this behavior caries over sequential vs random, and small vs large I/O. ● Need to work hard to balance density versus manageable failure domains – Density very popular, but leads to a lot of recovery traffic if server fails
  • 15. Copyright © 2013 Penguin Computing, Inc. All rights reserved Thanks! @off_rhoden trhoden@penguincomputing.com @PenguinHPC