How ARM based Microserver Cluster
Performs in CEPH
1
晨宇創新
Aaron 周振倫
Agenda
• About Ambedded
• What is the Issues of Using Single Server Node with Multiple Ceph
OSD?
• Use Single ARM Micro Server with One host to only Ceph OSD
• The benefits
• The basic High Availability Ceph Cluster
• Scale it Out
• Does Network matter?
• How fast it can self-heal a failed OSD?
• Ambedded make Ceph easy
• How much you can save on Energy
2
About Ambedded Technology
Y2013
Y2016
Y201
5
Y2014
Founded in Taiwan Taipei,
Office in National Taiwan University Innovative Innovation Center
Launch Gen 1 microserver architecture Storage Server
Product
Demo in ARM Global Partner Meeting UK Cambridge.
Partnership with European customer for the Cloud
Storage Service. Installed 1500+ microservers & 5.5PB
in operating since 2014
• Launch the 1st ever Ceph Storage Appliance powered by
Gen 2 ARM microserver
• Awarded as the 2016 Best of INTEROP Las Vegas
Storage product. Defeat VMware virtual SAN.
3
Issues of Using Single Server Node
with Multiple Ceph OSDs
• The smallest failure domain is the OSDs inside a server.
One Server fail causes many OSD down.
• CPU utility is 30%-40% only when network is saturated.
The bottleneck is network instead of computing.
• The power consumption and thermal heat is eating your
money
4
One OSD with one Micro Server
x N x N x N
Network
M
S
M
S
xN M
S
M
S
xN M
S
M
S
xNM
S
M
S
M
S
40Gb 40Gb 40Gb
Micro server
cluster
Micro server
cluster
Micro server
cluster
ARM micro server
cluster
- 1 to 1 to reduce
failure risk
- Aggregated network
bandwidth without
bottle neck
Traditional
Server #1
Traditional
Server #2
Traditional
Server #3
x N x N x N
Client #1 Client #2
Network
10Gb 10Gb 10Gb
Traditional server
- 1 to many cause
higher risk of a server
fail
- CPU utility low due
to Network bottle
neck
5
The Benefit of Using
1 Node to 1 OSD Architecture on CEPH
• True no single point of failure.
• The smallest failure domain is one OSD
• The MTBF of a micro server is much higher than a all-in-one mother
board
• Dedicate H/W resource to get stable OSD service
• Aggregate network bandwidth with failover
• Low power consumption and cooling cost
• OSD, MON, gateway are all in the same boxes.
• 3 units form a high availability cluster
6
Mars 200: 8-Node ARM Microserver Cluster
8x 1.6GHz ARM v7 Dual Core hot swappable microserver
- 2G Bytes DRAM
- 8G Bytes Flash
- 5 Gbps LAN
- < 5 Watts power consumption
Storage
- 8x hot swappable SATA3
HDD/SSD
- 8x SATA3 Journal SSD
300 Watts
Redundant
power supply
OOB BMC port
Dual hot swappable uplink
switches
- Total 4x 10 Gbps
- SFP+/10G Base-T Combo
7
The Basic High Availability Cluster
8
Scale it out
Scale Out Test (SSD)
62,546
125,092
187,639
8,955
17,910
26,866
0
5,000
10,000
15,000
20,000
25,000
30,000
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
180,000
200,000
0 5 10 15 20 25
4K Read
4K Write
Number of OSDs
7 OSD
14 OSD
21 OSD
Random
Read
IOPS
Random
write
IOPS
9
Network does Matters
16x OSD
20Gb uplink 40Gb Uplink Increase
BW IOPS BW IOPS
4K Write 1 Client 7.2 1,800 11 2,824 57%
4K Write 2 Client 13 3,389 20 5,027 48%
4K Write 4 Client 22 5,570 35 8,735 57%
4K Write 10 Client 39 9,921 60 15,081 52%
4K Write 20 Client 53 13,568 79 19,924 47%
4K Write 30 Client 63 15,775 90 22,535 43%
4K Write 40 Client 68 16,996 96 24,074 42%
The purpose of this test is to know how much improvement if the uplink bandwidth is
increased from 20Gb to 40Gb. Mars 200 has 4x 10Gb uplinks ports. The test result
shows 42-57% improvement on IOPS.
10
OSD Self-Heal vs. RAID Re-build
11
Test Condition Microserver Ceph Cluster Disk Array
Disk number/capacity 16 x 10TB OSD 16 x 3TB disk
Data Protection Replica = 2 RAID 5
Data Stored in the disk 3TB Not related
Time for re-heal/re-build 5 hours, 10 min. 41 Hours
Administrator involve Re-heal activate automatically Re-build after replacing a
new disk
Re-heal vs. re-build Only the capacity of lost data
need re-heal
The whole disk capacity
need re-build
Re-heal time vs. total
number of disk
More disk - > less recover time More disk -> longer
recover time
Ceph Storage Appliance
12
ARM micro Server Ceph
Unified
Virtual
Storage
Manager
Ceph Storage
Appliance
2U 8 Nodes
Front Panel Disk
Access
1U 8 Nodes
High Density
We make Ceph Simple
Unified Virtual Storage Manager (UniVir Store)
13
Dashboard Cluster Manager CRUSH Map
13
What You Can do with UniVir Store
 Deploy OSD, MON, MDS
 Create Pool, RBD image, iSCSI LUN, S3 user
 Support replica (1- 10) And Erasure Code (K+M)
 OpenStack back storage management
 Create CephFS
 Snapshot, Clone, Flatten image
 Crush Map configuration
 CephX user access right management
 Scale out your cluster
14
(200W-60W) x 24h x 365 days /1000 x $0.2
USD x 40 units X 2 (power & Cooling)
= $19,622/rack
This electricity cost is based on TW rate, it could be
double or triple in Japan or Germany
15
How Much You Can Save on Energy
16
Aaron 周振倫
aaron@ambedded.com.tw
晨宇創新股份有限公司

Ceph Day Taipei - How ARM Microserver Cluster Performs in Ceph

  • 1.
    How ARM basedMicroserver Cluster Performs in CEPH 1 晨宇創新 Aaron 周振倫
  • 2.
    Agenda • About Ambedded •What is the Issues of Using Single Server Node with Multiple Ceph OSD? • Use Single ARM Micro Server with One host to only Ceph OSD • The benefits • The basic High Availability Ceph Cluster • Scale it Out • Does Network matter? • How fast it can self-heal a failed OSD? • Ambedded make Ceph easy • How much you can save on Energy 2
  • 3.
    About Ambedded Technology Y2013 Y2016 Y201 5 Y2014 Foundedin Taiwan Taipei, Office in National Taiwan University Innovative Innovation Center Launch Gen 1 microserver architecture Storage Server Product Demo in ARM Global Partner Meeting UK Cambridge. Partnership with European customer for the Cloud Storage Service. Installed 1500+ microservers & 5.5PB in operating since 2014 • Launch the 1st ever Ceph Storage Appliance powered by Gen 2 ARM microserver • Awarded as the 2016 Best of INTEROP Las Vegas Storage product. Defeat VMware virtual SAN. 3
  • 4.
    Issues of UsingSingle Server Node with Multiple Ceph OSDs • The smallest failure domain is the OSDs inside a server. One Server fail causes many OSD down. • CPU utility is 30%-40% only when network is saturated. The bottleneck is network instead of computing. • The power consumption and thermal heat is eating your money 4
  • 5.
    One OSD withone Micro Server x N x N x N Network M S M S xN M S M S xN M S M S xNM S M S M S 40Gb 40Gb 40Gb Micro server cluster Micro server cluster Micro server cluster ARM micro server cluster - 1 to 1 to reduce failure risk - Aggregated network bandwidth without bottle neck Traditional Server #1 Traditional Server #2 Traditional Server #3 x N x N x N Client #1 Client #2 Network 10Gb 10Gb 10Gb Traditional server - 1 to many cause higher risk of a server fail - CPU utility low due to Network bottle neck 5
  • 6.
    The Benefit ofUsing 1 Node to 1 OSD Architecture on CEPH • True no single point of failure. • The smallest failure domain is one OSD • The MTBF of a micro server is much higher than a all-in-one mother board • Dedicate H/W resource to get stable OSD service • Aggregate network bandwidth with failover • Low power consumption and cooling cost • OSD, MON, gateway are all in the same boxes. • 3 units form a high availability cluster 6
  • 7.
    Mars 200: 8-NodeARM Microserver Cluster 8x 1.6GHz ARM v7 Dual Core hot swappable microserver - 2G Bytes DRAM - 8G Bytes Flash - 5 Gbps LAN - < 5 Watts power consumption Storage - 8x hot swappable SATA3 HDD/SSD - 8x SATA3 Journal SSD 300 Watts Redundant power supply OOB BMC port Dual hot swappable uplink switches - Total 4x 10 Gbps - SFP+/10G Base-T Combo 7
  • 8.
    The Basic HighAvailability Cluster 8 Scale it out
  • 9.
    Scale Out Test(SSD) 62,546 125,092 187,639 8,955 17,910 26,866 0 5,000 10,000 15,000 20,000 25,000 30,000 0 20,000 40,000 60,000 80,000 100,000 120,000 140,000 160,000 180,000 200,000 0 5 10 15 20 25 4K Read 4K Write Number of OSDs 7 OSD 14 OSD 21 OSD Random Read IOPS Random write IOPS 9
  • 10.
    Network does Matters 16xOSD 20Gb uplink 40Gb Uplink Increase BW IOPS BW IOPS 4K Write 1 Client 7.2 1,800 11 2,824 57% 4K Write 2 Client 13 3,389 20 5,027 48% 4K Write 4 Client 22 5,570 35 8,735 57% 4K Write 10 Client 39 9,921 60 15,081 52% 4K Write 20 Client 53 13,568 79 19,924 47% 4K Write 30 Client 63 15,775 90 22,535 43% 4K Write 40 Client 68 16,996 96 24,074 42% The purpose of this test is to know how much improvement if the uplink bandwidth is increased from 20Gb to 40Gb. Mars 200 has 4x 10Gb uplinks ports. The test result shows 42-57% improvement on IOPS. 10
  • 11.
    OSD Self-Heal vs.RAID Re-build 11 Test Condition Microserver Ceph Cluster Disk Array Disk number/capacity 16 x 10TB OSD 16 x 3TB disk Data Protection Replica = 2 RAID 5 Data Stored in the disk 3TB Not related Time for re-heal/re-build 5 hours, 10 min. 41 Hours Administrator involve Re-heal activate automatically Re-build after replacing a new disk Re-heal vs. re-build Only the capacity of lost data need re-heal The whole disk capacity need re-build Re-heal time vs. total number of disk More disk - > less recover time More disk -> longer recover time
  • 12.
    Ceph Storage Appliance 12 ARMmicro Server Ceph Unified Virtual Storage Manager Ceph Storage Appliance 2U 8 Nodes Front Panel Disk Access 1U 8 Nodes High Density
  • 13.
    We make CephSimple Unified Virtual Storage Manager (UniVir Store) 13 Dashboard Cluster Manager CRUSH Map 13
  • 14.
    What You Cando with UniVir Store  Deploy OSD, MON, MDS  Create Pool, RBD image, iSCSI LUN, S3 user  Support replica (1- 10) And Erasure Code (K+M)  OpenStack back storage management  Create CephFS  Snapshot, Clone, Flatten image  Crush Map configuration  CephX user access right management  Scale out your cluster 14
  • 15.
    (200W-60W) x 24hx 365 days /1000 x $0.2 USD x 40 units X 2 (power & Cooling) = $19,622/rack This electricity cost is based on TW rate, it could be double or triple in Japan or Germany 15 How Much You Can Save on Energy
  • 16.

Editor's Notes

  • #5 It will take very long time to re-heal multiple OSD fail. 大家覺得耗電是理所當然,因為你別無選擇
  • #12 技術以及市場的消長