5 Reasons to Use Arm-Based Micro-
Server Architecture for Ceph Cluster
Aaron Joue
Ambedded Technology
1
Ceph is Scalable & No Single Point of Failure
• CRUSH algorithm – distribute object across OSDs according to
pre-defined a failure domain.
• Clients use maps and Hash calculation to write/read objects to/from
OSDs
• No controller, no bottleneck limit the scalability, No single point of
failure
• Self-healing automated data recovery following server/device
failure
• Great Concept: Hardware will will finally fail. Ceph can protect the
data when the hardware fails.
2
Hardware Fail is Still Painful
3
N is 8, 12
or more
• Single node failure causes multiple OSD down
• Single HDD/SSD capacity is growing
• Re-heal data cause network storm
• Solution: reduce the failure domain to a single
OSD
Network
OSD Node
20/40Gb
Client
servers
Client
servers
N disks
OSD Node
N disks
OSD Node
N disks
20/40Gb 20/40Gb
OSD Node
N disks
20/40Gb
• Single microserver failure would cause
single OSD down
• Arm based micro server: Low power, small
foot print,
• Micro server is more reliable due to less
components
(1) Minimize the Failure Domain
Network
40Gb
N micro servers
Client
servers
N disks N disks
Client
servers
N micro servers N micro server
40Gb 40Gb
4
N micro servers
40Gb
N disks N disks
Arm-based Micro-Server
Architecture
• 1 x HDD & SSD per
micro-server
• 8 Arm servers in a 1 U
chassis
• 4x10Gbps uplink @1U
• Max. 105 Watts power
consumption
• Single Ceph daemon on
single node
5
(2) Single Processor, Single Task
• Don’t have to deal with NUMA
• Dedicated CPU, Memory & LAN for
every single OSD
• OSD workload is evenly balanced
• No resource competition
• Result: higher Ceph performance
than a big server node
OSD
6
(3) Bonus: Reduce Power
Consumption
• 60% Power Saving on a 1PB size Ceph
Cluster
5 Year
Power Saving
7
(4) Higher Density in Rack
 Limitation of power supply to a
rack: 5KVA
 X86 server : 2U/16 : 560W
 Arm server (Mars 400): 160W
 1 Rack can accommodate
 X86 server : 9 pcs = 144 OSD
 Arm server (Mars 400): 30 pcs
=240 OSD
Arm micro server has >1.66
times density
To provide more TB capacity
5KVA power
limit in
Rack
5KVA
power
limit in
Rack
9pcs with 144 OSD 30 pcs with 240 OSD8
(5) Start H/A Ceph Cluster from 3 U
• 3 x de-centralized 1U 8x nodes servers has 24 Arm server
nodes
– Nodes can be MDS, RGW too
– Reduce limitation on erasure code pool size
• Smallest HA Ceph cluster
• Best for edge data centers
1x MON
1x MON
1x MON
7x OSD
7x OSD
7x OSD
9
Benchmark
10
Cluster Customer x86 ceph cluster Ambedded ceph cluster
Servers type
6 x 2U Nodes, 64GB,
Dual Xeon E5-2600, Dual 10Gb
LAN
9 x Ambedded Mars 200
72 Arm V7 Dual Core
Total OSD 72 SATA OSD = 6 servers 72 SATA OSD = 9 pcsserver
total space 12U height (6x2U) 9U height (9x1U)
Power consumption
(exclude the HDD) 1980 Watts (330Wattsx6) 900 Watts (100Watts x 9)
Performance - MB/s
(Replica 3, with 100%
Write, 100%Read)
Read : 2529
Write : 587
Read : 6637
Write : 923
Ceph 10.2.10, Replica 3 pool, Object Storage
CosBench 100% Write & 100% Read (Arm 32)
11
1,606
517
1,171
278
1,517
514
935
230
-
200
400
600
800
1,000
1,200
1,400
1,600
1,800
100% read 100% write 80% read 20% write
Ceph Luminous, 20 x OSD, RGW HDD, 2x RGW,
CosBench Rep3 vs. EC 4+2 (Arm64)
Replica 3
EC 4+2
Object Store Performance
RBD FIO Performance
12
1,286
2,377
2,529
380
740 756
-
500
1,000
1,500
2,000
2,500
3,000
64KB 1MB 4MB
READ (MB/s)
Write (MB/s)
20 x OSD
Replica 3
Object size 4MB
Thank You
13

5 Reasons to Use Arm-Based Micro Server Architecture for Ceph Cluster

  • 1.
    5 Reasons toUse Arm-Based Micro- Server Architecture for Ceph Cluster Aaron Joue Ambedded Technology 1
  • 2.
    Ceph is Scalable& No Single Point of Failure • CRUSH algorithm – distribute object across OSDs according to pre-defined a failure domain. • Clients use maps and Hash calculation to write/read objects to/from OSDs • No controller, no bottleneck limit the scalability, No single point of failure • Self-healing automated data recovery following server/device failure • Great Concept: Hardware will will finally fail. Ceph can protect the data when the hardware fails. 2
  • 3.
    Hardware Fail isStill Painful 3 N is 8, 12 or more • Single node failure causes multiple OSD down • Single HDD/SSD capacity is growing • Re-heal data cause network storm • Solution: reduce the failure domain to a single OSD Network OSD Node 20/40Gb Client servers Client servers N disks OSD Node N disks OSD Node N disks 20/40Gb 20/40Gb OSD Node N disks 20/40Gb
  • 4.
    • Single microserverfailure would cause single OSD down • Arm based micro server: Low power, small foot print, • Micro server is more reliable due to less components (1) Minimize the Failure Domain Network 40Gb N micro servers Client servers N disks N disks Client servers N micro servers N micro server 40Gb 40Gb 4 N micro servers 40Gb N disks N disks
  • 5.
    Arm-based Micro-Server Architecture • 1x HDD & SSD per micro-server • 8 Arm servers in a 1 U chassis • 4x10Gbps uplink @1U • Max. 105 Watts power consumption • Single Ceph daemon on single node 5
  • 6.
    (2) Single Processor,Single Task • Don’t have to deal with NUMA • Dedicated CPU, Memory & LAN for every single OSD • OSD workload is evenly balanced • No resource competition • Result: higher Ceph performance than a big server node OSD 6
  • 7.
    (3) Bonus: ReducePower Consumption • 60% Power Saving on a 1PB size Ceph Cluster 5 Year Power Saving 7
  • 8.
    (4) Higher Densityin Rack  Limitation of power supply to a rack: 5KVA  X86 server : 2U/16 : 560W  Arm server (Mars 400): 160W  1 Rack can accommodate  X86 server : 9 pcs = 144 OSD  Arm server (Mars 400): 30 pcs =240 OSD Arm micro server has >1.66 times density To provide more TB capacity 5KVA power limit in Rack 5KVA power limit in Rack 9pcs with 144 OSD 30 pcs with 240 OSD8
  • 9.
    (5) Start H/ACeph Cluster from 3 U • 3 x de-centralized 1U 8x nodes servers has 24 Arm server nodes – Nodes can be MDS, RGW too – Reduce limitation on erasure code pool size • Smallest HA Ceph cluster • Best for edge data centers 1x MON 1x MON 1x MON 7x OSD 7x OSD 7x OSD 9
  • 10.
    Benchmark 10 Cluster Customer x86ceph cluster Ambedded ceph cluster Servers type 6 x 2U Nodes, 64GB, Dual Xeon E5-2600, Dual 10Gb LAN 9 x Ambedded Mars 200 72 Arm V7 Dual Core Total OSD 72 SATA OSD = 6 servers 72 SATA OSD = 9 pcsserver total space 12U height (6x2U) 9U height (9x1U) Power consumption (exclude the HDD) 1980 Watts (330Wattsx6) 900 Watts (100Watts x 9) Performance - MB/s (Replica 3, with 100% Write, 100%Read) Read : 2529 Write : 587 Read : 6637 Write : 923 Ceph 10.2.10, Replica 3 pool, Object Storage CosBench 100% Write & 100% Read (Arm 32)
  • 11.
    11 1,606 517 1,171 278 1,517 514 935 230 - 200 400 600 800 1,000 1,200 1,400 1,600 1,800 100% read 100%write 80% read 20% write Ceph Luminous, 20 x OSD, RGW HDD, 2x RGW, CosBench Rep3 vs. EC 4+2 (Arm64) Replica 3 EC 4+2 Object Store Performance
  • 12.
    RBD FIO Performance 12 1,286 2,377 2,529 380 740756 - 500 1,000 1,500 2,000 2,500 3,000 64KB 1MB 4MB READ (MB/s) Write (MB/s) 20 x OSD Replica 3 Object size 4MB
  • 13.

Editor's Notes

  • #3 SCALABILITY AND HIGH AVAILABILITY In traditional architectures, clients talk to a centralized component (e.g., a gateway, broker, API, facade, etc.), which acts as a single point of entry to a complex subsystem. This imposes a limit to both performance and scalability, while introducing a single point of failure (i.e., if the centralized component goes down, the whole system goes down, too). Ceph eliminates the centralized gateway to enable clients to interact with Ceph OSD Daemons directly. Ceph OSD Daemons create object replicas on other Ceph Nodes to ensure data safety and high availability. Ceph also uses a cluster of monitors to ensure high availability. To eliminate centralization, Ceph uses an algorithm called CRUSH. Before discussing crush algorithm let’s look at the traditional way of storing data. Traditional data storage method includes a metadata lookup table. Metadata is the data about data. It stores information such as location of the data. When a new data is added, the metadata table is updated first with the metadata of the file. The data is actually stored on the disk only after this procedure. When a data is retrieved from the disk, the metadata table is searched to find the location of the file. This metadata lookup slows down the storage/retrieval process. Also if you lose the metadata, all data will be lost. As the amount of data increases the storage/retrieval mechanism will suffer from bottlenecks and performance will be slower. Ceph doesn’t have any metadata lookup mechanism. This improves performance of ceph cluster. It is the duty of CRUSH algorithm to calculate the metadata and location of the object. The metadata is calculated only when needed.