This lightning talk introduces the challenges in implementing a free software petabyte-scale storage system based on Ceph. The rationale it was chosen in favor of a non-free commercial solution and the difficulties in operating and managing such a system in an enterprise like environment. It will also provide a short technical overview of Ceph and its current usage inside Eurac Research.
11. a unified, distributed storage system designed for excellent
performance, reliability and scalability.
12. • Created by Sage Weil for his PHD dissertation at the University of California, Santa Cruz in
2007
• From fall 2007 he worked full-time on Ceph at Dreamhost (he is one of the co-founder of
Dreamhost)
• In 2012 founded Inktank Storage for professional services and support of Ceph
• First release Argonaut on July 2, 2012
• April 2014 Red Hat purchased Inktank
• October 2015 the Ceph Community Advisory Board was formed. (Canonical, CERN, Cisco,
Fujitsu, Intel, Red Hat, SanDisk, and SUSE.)
• November 2018 The Ceph Foundation (funded under the Linux Foundation)
• Until now 13 releases (Mimic is the latest)
14. • All components must scale horizontally
• There can be no single point of failure
• The solution must be hardware agnostic
• Should use commodity hardware
• Self-Managed whenever possible
• Open Source (LGPL)
15. RGW
A web services
gateway for object
storage, compatible
with S3 and Swift
LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
RBD
A reliable, fully-
distributed block
device with cloud
platform integration
CEPHFS
A distributed >le
system with POSIX
semantics and scale-
out metadata
management
APP HOST/VM CLIENT
16. • Maintain cluster membership and state
• Provide consensus for distributed decision-making
• Small and odd number
• Not part of the data path
• Storing and serving data to clients
• On per disk (HDD, SSD, NVMe)
• 10s to 1000s in a cluster
• Intelligently peer for replication & recovery
Monitors
Object Storage Daemon (OSD)
17. CRUSH Controlled, Scalable, Decentralized Placement of Replicated Data
Pseudo-random placement algorithm
• Fast calculation, no lookup
• Repeatable, deterministic
Statistically uniform distribution
Rule-base configuration
• Infrastructure topology aware (CRUSH map)
• Adjustable replication
• Weighted devices (different sizes)
Data Placement
26. • 39 Storage nodes (12 * 4 TB SATA, 4 * 400 GB SSD)
• 546 Object Storage Daemon
• 5 Monitor nodes
• 2 Metadata nodes
• Dedicated public and cluster network
• Storage nodes with 2 x 10 Gbit/s for public & 2 x 40 Gbit/s for cluster connectivity
• Monitor & Metadata have 2 x 10 Gbit/s public connectivity
• Hardware: Supermicro, Mellanox
• OS: CentOS 7.4
• Ceph version: Luminous 12.2.4
27.
28. • Mainly CephFS and RBD used
• Integrated into internal Kubernetes and LXD infrastructure
• Started in 2014 with Hammer LTS
• Done 3 major version upgrades (Hammer -> Jewel -> Luminous)
• Scaled from initial 600 TB to 1,7 PB
• Raw used: ~1 PB
• Usage:
RBD
23%
CephFS
77%