Linux Stammtisch Munich: Ceph - Overview, Experiences and Outlook

OVERVIEW, EXPERIENCES & OUTLOOK
1
Danny Al-Gaaf
Linux-Stammtisch Munich, 24.11.2015

OUTLINE
● Architecture
○ Ceph basics
○ Ceph and OpenStack
● Experiences
○ Support
○ Performance
○ Security
○ HA
● Outlook
○ Hardware
○ Features
● Conclusions
2

WHY SOFTWARE DEFINED STORAGE?
4
Proprietary hardware → Common, off-the-shelf hardware
Low cost, standardized supply chain
Scale-Up architecture → Scale-Out architecture
Increase operational flexibility
Hardware-based intelligence → Software-based intelligence
More programmability, agility, and control
Closed development process → Open development process
More flexible, well integrated technology

OVERVIEW
5
● LGPL v2.1
● Object based storage
● Scale horizontally
● Terabytes to exabytes
● Fault tolerant: no SPF
● Hardware agnostic
● Commodity Hardware
○ Ethernet (1, 10, 40 GbE …)
○ Fibre Channel
○ SATA/SAS, HDD/SSD, …
● Self-managed
● Community:
○ 635 developers
○ 116 companies

RADOS COMPONENTS
● OSDs
○ 10s - 1000s per cluster
○ One per device (HDD/SDD/RAID Group, SAN …)
○ Store objects
○ Handle replication and recovery
○ Flexible backend
■ FileStore (XFS, btrfs, …)
■ KeyValueStore (levelDB, rocksDB, libkinetic)
■ ErasureCoding (ISA, SHEC, Jerasure, LRC)
○ Cluster and public networks
● MONs
○ Maintain cluster membership and states
○ Use PAXOS protocol to establish quorum consensus
○ Small, lightweight
○ Odd number
7

CRUSH - PLACEMENT
10
● Each PG maps independently to a pseudo-
random set of OSDs
● PGs that map to the same OSD have always
replicas that do not
● In case of an OSD failure:
○ CRUSH provides an other OSD target
○ Each PG on this OSD will be re-
replicated by a different OSD
○ Highly parallel recovery and
rebalancing

CRUSH - PLACEMENT
11
Controlled Replication Under Scalable Hashing
● Pseudo-random data placement function
○ Fast calculation - O(log n)
○ No lists or tables, no lookups
○ Repeatable and deterministic
● Uniform, weighted distribution
● Stable mapping
○ Predictable, bounded migration of data on changes (e.g.
adding/removing nodes)
● Rule-based configuration
○ Aware of physical infrastructure topology
■ OSDs in hosts/racks/rows/fire compartments/data center …
○ Adjustable replication
○ Adjustable weighting
● Calculated by clients

CLIENT DATA HANDLING
● Client
○ Calculate placement with
CRUSH and writes only to
primary OSD
● OSDs
○ Handle data replication,
rebalancing, and recovery
based on CRUSH
13

REPLICATION VS ERASURE CODING
Full copies of stored objects
● Very high durability
● 3x (200% overhead)
● Quicker recovery
14
One copy plus parity
● Cost-effective durability
● 1.5x (50% overhead)
● Expensive recovery
● RBD/CephFS would require cache tiering

● Provide block interface to Ceph
● Stripes images across entire pool/cluster
● Snapshots (read-only)
● Copy-on-write clones
● Integration
○ Qemu/KVM
○ XEN
○ LXC
○ Linux kernel (rbd.ko)
○ iSCSI (through gateway) e.g. for VMWare or Windows
○ OpenStack, CloudStack, Nebula, …
● Incremental backup
● Client side caching
RBD FEATURES
16

THE RADOS GATEWAY
● RESTful object storage proxy
○ Store objects via RADOS
○ Stripes large RESTful objects across many RADOS objects
○ API supports buckets, accounts
○ Usage accounting
● Interfaces compatible to:
○ Amazon S3
○ Swift
● Federated RGW
○ Zones, regions topology
○ Global bucket and user namespace
○ Asynchronous replicated across data center
○ Read affinity
■ Serve local data from local/closest DC
21

META DATA SERVER
● MDS
○ Manages only metadata for POSIX - compliant shared
file system
■ Directory hierarchy
■ File metadata (owner, timestamps, mode, …)
○ Clients stripe file data in RADOS
■ MDS is not in the data path
○ MDS stores metadata in RADOS
○ Dynamic cluster scales to 10s or 100s
● CephFS
○ Linux kernel module or fuse driver
○ Not production ready !!!
24

ARCHITECTURE: CEPH AND OPENSTACK
26

SUPPORT
● Who claims to provide support?
○ Red Hat (Red Hat Storage)
○ SUSE (SUSE Enterprise Storage)
○ Canonical
○ Mirantis
● What is required for support?
○ Automation
○ Support organisation and processes
○ Knowledge and experience of the technology
■ Ceph, filesystems, kernel, KVM, ...
○ Developers
○ Active upstream contributions
● Consider which distribution you already know and use!
28

SUPPORT
Ceph Kernel
29
# company commits authors
1 Red Hat / Inktank 57147 71 / 42 (overlap)
2 Dreamhost 15774 26
3 Intel 1613 11
4 SUSE 1591 18
5 Deutsche Telekom 1480 2
13 Mirantis 283 9
37 Canonical 25 4
Source: metrics.ceph.com
(20151121)
# company changesets authors
1 Intel 37602 733
2 Red Hat 31490 333
3 SUSE 18018 132
80 Canonical 187 22
173 Mirantis 15 1
Source: linux.git + gitdm (20151121)

PERFORMANCE
● Performance depends on
○ Network
○ OSD drives
○ Journal drives
○ CPU, memory
○ Ceph backend
● What performance do you need?
● What is your use-case and load profile?
31

or:
DO YOU LIKE BUG REPORTS?
32
The customer complained about bad performance on new production system:
# Current live env:
root@webserver01:~# dd if=/dev/zero of=test.bin bs=1k count=1M
1073741824 bytes (1.1 GB) copied, 11.4649 s, 93.7 MB/s
# Staging env:
root@webserver01:~# dd if=/dev/zero of=test.bin bs=1k count=1M
1073741824 bytes (1.1 GB) copied, 22.4607 s, 47.8 MB/s
... Ceph rbd disk shows bad 4k performance with synthetic MongoDB perf tests ...

HOW TO IDENTIFY BOTTLENECKS ?
33

GET THE RIGHT TOOL
● Ceph performance tool not fully reliable for deep analysis
● Needed a tool to analyse all layers rbd.ko, librbd, block devices and files
○ Don’t reinvent the wheel !
○ There is a “swiss army knife” - fio
○ Contributed RBD support for fio upstream
○ Other helpers: blktrace, systemtap
34
“Get Jens' FIO code. It does things right, including writing
actual pseudo-random contents, which shows if the disk
does some de-duplication (aka optimize for
benchmarks): [...] Anything else is suspect - forget about
bonnie or other traditional tools.”
Linus Torvalds

PERFORMANCE ANALYSIS - SYMPTOMS
35

MITIGATE
● Make use of RAID controller
○ RAID 0 out of multiple disks
○ Use RAID write-back cache
● Move journals always to separate
drives
● Consider scale-out your cluster
● Carefully design your hardware
○ commodity != consumer grade
● Users must understand your
platform
○ e.g. don’t run replicated databases
like MongoDB on replicated storage
37

STORAGE HARDWARE RECOMMENDATIONS
● Hardware recommendations in Ceph documentation good starting point
● Network
○ Ethernet (1/10/40 GbE)
○ Multiple NICs and/or ports
○ bounded
○ Jumbo frames
○ Use the fastest network you can afford
● CPU
○ x86_64 or 64bit ARM
○ Consider at least one dedicated physical core per OSD
● Memory
○ Recommendation is: 1GByte per TByte storage
○ More memory doesn’t hurt!
38

STORAGE HARDWARE RECOMMENDATIONS
● Storage
○ One disk per OSD
○ Journals on separate disk, SSD highly recommended
○ Use enterprise devices !
○ If affordable prefer SAS over SATA
■ e.g. Unrecoverable Silent Data Corruption on channel: 10E21 vs 10E17
○ Highly depends on performance requirements and cost
○ Consider density regarding size of failure zones (node, rack)
○ Consider different storage types for different use cases
■ Standard, performance, archive/backup/cold data
○ SSDs: keep DWPD in mind!
● Storage controller
○ Bandwidth, performance, cache size
○ HBA prefered over RAID
○ RAID with RAID0: battery or flash protected write-back cache has huge impact on writes
39

● Generic vectors
○ Network
○ Authentication
○ Code flaws
● Many potential component specific vectors
○ RBD
■ librbd vs. kernel module
○ RadosGW
■ as every web service
○ CephFS
● Deep dive: http://www.slideshare.net/dalgaaf/open-stack-dost-frankfurt-ceph-security
ATTACK VECTORS
41

COUNTERMEASURES - TIPS
● Network
○ Always use separated cluster and public net
○ Always separate your control nodes from other networks
○ Don’t expose to the open internet
○ Encrypt inter-datacenter traffic
● Avoid hyper-converged infrastructure
○ Isolate compute and storage resources
○ Scale them independently
○ Risk mitigation if daemons are compromised or DoS’d
○ Don’t mix
■ compute and storage
■ control nodes (OpenStack and Ceph)
42

COUNTERMEASURES - RADOSGW
● Big and easy target through
HTTP(S) protocol
● Small appliance per tenant with
○ Separate network
○ SSL terminated proxy forwarding
requests to radosgw
○ WAF (mod_security) to filter
○ Placed in secure/managed zone
● Don’t share buckets/users
between tenants
43

COUNTERMEASURES - CEPHX
● Monitors are trusted key servers
○ Store copies of all entity keys
○ Each key has an associated “capability”
■ Plaintext description of what the key user is allowed to do
● What you get
○ Mutual authentication of client + server
○ Extensible authorization w/ “capabilities”
○ Protection from man-in-the-middle, TCP session hijacking
● What you don’t get
○ Secrecy (encryption over the wire)
● What you can do
○ Restrict capabilities associated with each key
○ Limit administrators’ power
■ use ‘allow profile admin’ and ‘allow profile readonly’
■ restrict role-definer or ‘allow *’ keys
○ Careful key distribution (Ceph/OpenStack nodes)
44

SECURITY - STATUS
● Reactive processes are in place
○ security@ceph.com , CVEs, downstream product updates, etc.
● Proactive measures in progress
○ Code quality improves (SCA, etc.)
○ Unprivileged daemons
○ MAC (SELinux, AppArmor)
○ Encryption
● Progress defining security best-practices
○ Document best practices for security
● Ongoing process
45

HA - FAILURE SCENARIOS
● Power and network outage
● Failure of a server or component
● Failure of a software service
● Disaster
● Human errors
○ Still often leading cause of outage
○ Misconfiguration
○ Accidents
○ Emergency power-off
● Deep dive: http://www.slideshare.net/dalgaaf/99999-available-openstack-cloud-a-builders-guide
47

CEPH AND OPENSTACK - WORST CASES
48
● Fire compartment fails

CEPH AND OPENSTACK - WORST CASES
49
● Split brain scenarios

CEPH AND OPENSTACK - MITIGATION
50
● Most DCs have
backup rooms
● Only a few servers
to host quorum
related services
● Less cost intensive
than three fire
compartments
● Can mitigate split
brain between FCs
(depending on
network layout)

MITIGATION - APPLICATIONS
51
● Target for five 9’s is E2E
○ Five 9’s on data center level very expensive
○ No pets allowed !!!
○ Only cloud-ready applications

HARDWARE
● Open Kinetic
○ Classic 3.5” form factor
○ SAS HDD connector
○ Dual 1Gb/s Ethernet
○ Connect drives directly to data center fabric
○ Requires enclosures (e.g. Supermicro)
○ Allows intelligent drives
● Seagate Kinetic KV drives
○ Provide a RESTful KV interface
○ Improve performance of SMR (Shingled Magnetic Recording) drives
○ Eliminates filesystem and ease up logic in Ceph
○ Requires still compute power for Ceph
53

HARDWARE
● HGST
○ Open Ethernet drive architecture
○ Drive running Linux
○ CPU (32bit ARM) and RAM
○ Each drive appears as a Linux server
○ Can be provisioned with SDS applications
● Toshiba
○ Open Kinetic form factor with Open Ethernet drive like arch
○ 64bit MIPS processor
○ Contains e.g. 2x2.5” drives (HDD or SSD plus NVRAM)
○ Drive running Linux
○ Can be provisioned with Ceph
○ Eliminates need for compute power to run OSDs
○ Consumes 14-16 W per drive
54

SOME ROADMAP ITEMS
● RBD
○ Mirroring (async)
○ Native VMWare driver
● RGW
○ Active/Active multi-site
● CephFS
○ Get it production ready (fsck, repair, …)
● QoS for clients
● Performance
○ NewStore (bypass journal and file system where possible)
○ Cache Tiering improvements
● Security
○ SELinux/AppAmor MAC profiles
○ Non-root daemons
● Encryption in flight ?
56

Conclusions
● Ceph
○ RBD/RGW is ready for enterprise production environments
○ CephFS is on the way
● Security
○ Reactive processes are in place
○ Proactive measures in progress
● High Availability
○ Requires careful planning, especially within a cloud setup
○ 3- or quorum room required
○ NO PETS, only cloud ready applications !!!
● Open Kinetic based drives can simplify your DC
58

Get involved !
● Ceph
○ https://ceph.com/community/contribute/
○ ceph-devel@vger.kernel.org
○ IRC: OFTC
■ #ceph,
■ #ceph-devel
○ Ceph Developer Summit
● OpenStack
○ Cinder, Glance, Manila, ...
59

danny.al-gaaf@bisect.de
dalgaaf
blog.bisect.de
@dannnyalgaaf
linkedin.com/in/dalgaaf
xing.com/profile/Danny_AlGaaf
Danny Al-Gaaf
Senior Cloud Technologist
Q&A - THANK YOU!

Linux Stammtisch Munich: Ceph - Overview, Experiences and Outlook

More Related Content

What's hot

Similar to Linux Stammtisch Munich: Ceph - Overview, Experiences and Outlook

More from Danny Al-Gaaf

Recently uploaded

Linux Stammtisch Munich: Ceph - Overview, Experiences and Outlook