1. Scaling Ceph at CERN
Dan van der Ster (email@example.com)
Data and Storage Service Group | CERN IT Department
2. CERN’s Mission and Tools
●  CERN studies the fundamental laws of nature
○  Why do particles have mass?
○  What is our universe made of?
○  Why is there no antimatter left?
○  What was matter like right after the “Big Bang”?
●  The Large Hadron Collider (LHC)
○  Built in a 27km long tunnel, ~200m underground
○  Dipole magnets operated at -271°C (1.9K)
○  Particles do ~11’000 turns/sec, 600 million collisions/sec
○  Four main experiments, each the size of a cathedral
○  DAQ systems Processing PetaBytes/sec
Scaling Ceph at CERN - D. van der Ster 3
3. Big Data at CERN
Physics Data on CASTOR/EOS
●  LHC experiments produce ~10GB/s
User Data on OpenAFS & DFS
●  Home directories for 30k users
●  Physics analysis development
●  Project spaces for applications
Service Data on AFS/NFS
●  Databases, admin applications
Tape archival with CASTOR/TSM
●  RAW physics outputs
●  Desktop/Server backups
Scaling Ceph at CERN - D. van der Ster 4
Service Size Files
OpenAFS 290TB 2.3B
CASTOR 89.0PB 325M
EOS 20.1PB 160M
4. IT Evolution at CERN
Scaling Ceph at CERN - D. van der Ster 5
Cloudifying CERN’s IT infrastructure ...
●  Centrally-managed and uniform hardware
○  No more service-specific storage boxes
●  OpenStack VMs for most services
○  Building for 100k nodes (mostly for batch processing)
●  Attractive desktop storage services
○  Huge demand for a local Dropbox, Google Drive …
●  Remote data centre in Budapest
○  More rack space and power, plus disaster recovery
… brings new storage requirements
●  Block storage for OpenStack VMs
○  Images and volumes
●  Backend storage for existing and new services
○  AFS, NFS, OwnCloud, Data Preservation, ...
●  Regional storage
○  Use of our new data centre in Hungary
●  Failure tolerance, data checksumming, easy to operate, security, ...
5. Ceph at CERN
Scaling Ceph at CERN - D. van der Ster 6
6. 12 racks of disk server quads
Wiebalck / van der Ster -- Building an organic block storage service at CERN with Ceph
8. Use-Cases Being Evaluated
1.  Images and Volumes for OpenStack
2.  S3 Storage for Data Preservation / Public
3.  Physics data storage for archival and/or
Scaling Ceph at CERN - D. van der Ster 9
#1 is moving into production. #2 and #3 are more
exploratory at the moment.
9. OpenStack Volumes & Images
•  Glance: using RBD for ~3 months now.
•  Only issue was to increase ulimit -n above 1024 (10k
•  Cinder: testing with close colleagues.
•  126 Cinder Volumes attached today – 56TB used
Scaling Ceph at CERN - D. van der Ster 10
Growing # of volumes/images Usual traffic is ~50-100MB/s
with current usage. (~idle)
10. RBD for OpenStack Volumes
•  Before general availability, we need to test and
enable qemu iops/bps throttling
•  Otherwise VMs with many IOs can disrupt other
•  One ongoing issue is that a few clients are
getting an (infrequent) segfault of qemu during
a VM reboot.
•  Happens on VMs with many attached RBD’s.
•  Difficult to get a complete (16GB) core dump.
Scaling Ceph at CERN - D. van der Ster 11
11. CASTOR & XRootD/EOS
•  Exploring RADOS backend for these two HEP-developed
•  Gateway model, similar to S3 via RADOSGW
•  CASTOR needs raw throughput performance (to feed
many tape drives at 250MBps each).
•  Striped RWs across many OSDs are important.
•  XRootD/EOS may benefit from the highly scalable
namespace to store O(billion) objects
•  Bonus: XRootD also offers http/webdav with X509/kerberos,
possibly even fuse mountable.
•  Developments are in early stages.
Scaling Ceph at CERN - D. van der Ster 12
12. Operations & Lessons Learned
Scaling Ceph at CERN - D. van der Ster 13
13. Configuration and Deployment
•  Dumpling 0.67.7
•  Fully Puppet-ized
•  Automated server deployment,
automated OSD replacement
•  Very few custom ceph.conf
•  Experimenting with the
•  we find that disabling it
completely gives better IOps
•  But don’t do this!!!
Scaling Ceph at CERN - D. van der Ster 14
14. Cluster Activity
Scaling Ceph at CERN - D. van der Ster 15
15. General Comments…
•  In these ~7 months of running the cluster, there have been very
•  No outages
•  No data losses/corruptions
•  No unfixable performance issues
•  Behaves well during stress tests
•  But now we’re starting to get real/varied/creative users, and this
brings up many interesting issues...
•  “No amount of stress testing can prepare you for real users”
•  (point being, don’t take the next slides to be too negative – I’m
just trying to give helpful advice ;)
Scaling Ceph at CERN - D. van der Ster 16
16. Latency & Slow Requests
•  Best latency we can achieve is 20-40ms
•  Slow SATA disks, no SSDs: hard to justify SSDs in a multi-PB cluster,
but could in a smaller limited use-case cluster (e.g. for Cinder-only)
•  Latency can increase dramatically with heavy usage
•  Don’t mix latency-bound and throughput-bound users on the same
•  Local processes scanning the disks can hurt performance
•  Add /var/lib/ceph to the updatedb PRUNEPATH
•  If you have slow disks like us, you need to understand your disk IO
scheduler – e.g. deadline prefers reads over writes: writes are given a
5 second deadline vs. 500ms for reads!
•  Kernel tuning: vm.* sysctl, dirty page flushing, memory
•  “Something is flushing the buffers, blocking the OSD processes”
•  Slow requests: monitor them, eliminate them.
Scaling Ceph at CERN - D. van der Ster 17
17. Life with 250 million objects
•  Recently, a user decided to write 250 million 1kB objects
•  Not so unreasonable: 250M * 4MB = 1PB, so this simulates the cluster
being full of RBD images, at least in terms of # objects
•  It worked – no big problems from holding this many objects.
•  Tested single OSD failure: ~7 hours to backfill, including a
double-backfill glitch that we’re trying to understand.
•  But now we want to cleanup, and it is not trivial to remove 250M
•  rados rmpool generated quite a load when we rm’d a 3 million object
pool (some OSDs were temporarily marked down).
•  Probably due to a mistake in our wbthrottle tuning
Scaling Ceph at CERN - D. van der Ster 18
18. Other backfilling issues
•  During a backfilling event (draining a whole server),
we started observing repeated monitor elections
•  Caused by the mons’ LevelDBs being so active that the
local SATA disks couldn’t keep up.
•  When a mon falls behind, it calls an election
•  Could be due to LevelDB compaction…
•  We moved /var/lib/ceph/mon to SSDs – no more
elections during backfilling
•  Avoid double backfilling when taking an OSD out of
•  Start with ceph
•  If you mark the OSD out first, then crush rm it, you will
compute a new CRUSH map twice, i.e. backfill twice.
Scaling Ceph at CERN - D. van der Ster 19
19. Fun with CRUSH
•  CRUSH is simple yet powerful, so it is tempting to
play with the cluster layout
•  But once you have non-zero amounts of data, significant
CRUSH changes will lead to massive data movements,
which create extra disk load and may disrupt users.
•  Early CRUSH planning is crucial!
•  A network switch is a failure domain, so we should
configure CRUSH to replicate across switches,
•  But (assuming we don’t have a private cluster network)
that would send all replication traffic via the switch uplinks
•  Unclear tradeoff between uptime and performance.
Scaling Ceph at CERN - D. van der Ster 20
20. CRUSH & Data distribution
•  CRUSH may give your cluster
an uneven data distribution
•  An OSD’s used space will
scale with the number of PGs
assigned to it
•  After you have designed your
cluster, created your pools,
started adding data, check the
PG and volume distributions
is useful to iron out an uneven
•  The hashpspool flag is also
important if you have many
Scaling Ceph at CERN - D. van der Ster 21
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30nOSDs
Number of OSDs having N PGs
(for pool = volumes)
21. RBD Reliability with 3 Replicas
•  RBD devices are chunked across thousands of objects:
•  A full 1TB volume is composed of 250,000 4MB objects
•  If any single object is lost, the whole RBD can be considered to be corrupted
(obviously, it depends which blocks are lost!)
•  If you lose an entire PG, you can consider all RBDs to be lost / corrupted.
•  Our incorrect & irrational fears:
•  Any simultaneous triple disk failure in the cluster would lead to objects being
lost – and somehow all RBDs would be corrupted.
•  As we add OSDs to the cluster, the data gets spread wider, and the chances of
RBD data loss increase.
•  But this is wrong!!
•  The only triple disk failures that can lead to data loss are those combinations
actively used by PGs – so having e.g. 4096 PGs for RBDs means that only
4096 combinations out of the 10^9 possible combinations matter.
•  N_PGs * ~(P_diskfailure^3) / 3!
•  We use 4 replicas for the RBD volumes, but this is probably overkill.
Scaling Ceph at CERN - D. van der Ster 22
22. Trust your clients
•  There is no server-side per-client throttling
•  A few nasty clients can overwhelm an OSD, leading to slow requests
•  When you have a high load / slow requests, it is not always
trivial to identify and blacklist/firewall the misbehaving client
•  Could use some help in the monitoring: per-client perf stats?
•  One of our creative users found a way to make the mon’s
generate 5*40 MBps of outbound network traffic
•  Could saturate the mon network, lead to disruptions
•  RADOS is not for end-users. A cephx keyring is for trusted
persons only, not for Joe Random User.
Scaling Ceph at CERN - D. van der Ster 23
23. Fat fingers
•  A healthy cluster is always vulnerable to human errors
•  We’ve thus far avoided any big mistakes
•  Used PG splitting to grow a pool from 8 to 2048 PGs
•  Leads to unresponsive OSDs who get marked down à degraded objs.
•  Safer & now-enforced to grow in 2x or 4x steps
•  ulimits, ulimits, ulimits
•  With a large number of OSDs (say, more than 500), you will hit num
file and num process limits everywhere:
•  Glance, qemu, radosgw, ceph/rados CLI, …
•  If you use XFS, don’t put your OSD journal as a file on the disk
•  Use a separate partition, the first partition!
•  We still need to reinstall our whole cluster to re-partition the OSDs
Scaling Ceph at CERN - D. van der Ster 24
24. Scale up and out
•  Scale up: we are demonstrating the viability of a
3PB cluster with O(1000) OSDs.
•  What about 10,000 or 100,000 OSDs?
•  What about 10,000 or 100,000 clients?
•  Many Ceph instances is always an option, but not ideal
•  Scale out: our growing data centre in Budapest
brings many options:
•  Replicate over the WAN (though, 30ms RTT)
•  Tiering / Caching pools (new feature, need to get
•  Data locality – direct IOs to nearby replica or caching pool
Scaling Ceph at CERN - D. van der Ster 25
Scaling Ceph at CERN - D. van der Ster 26
•  CERN IT infrastructure is undergoing a private
cloud revolution, and Ceph is providing the
•  Our CASTOR and XRootD physics data use-
cases may exploit RADOS for improved
•  In seven months with a 3PB cluster, we’ve not
had any disasters. Actually it’s working quite
•  Presented some lessons learned, I hope they
prove useful in your Ceph explorations.
Scaling Ceph at CERN - D. van der Ster 27