Presentation from OpenStack Summit Tokyo
Online video link is below.
https://www.openstack.org/summit/tokyo-2015/videos/presentation/approaching-open-source-hyper-converged-openstack-using-40gbit-ethernet-network
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Approaching hyperconvergedopenstack
1. Copyright 2015 Bit-isle Inc. All Rights Reserved
Approaching
Open-Source Hyper-Converged OpenStack
using 40Gbit Ethernet Network
Ikuo Kumagai – Bit-isie Inc.
Yuki Kitajima - Altima corp.
Special Thanks
Masayoshi Oka - Netone systems
2. Copyright 2015 Bit-isle Inc. All Rights Reserved
Background
The Services of Bit-isle
‣ IDC service
▪ iDCs
- 5 iDCs in Tokyo Metropolitan area and 1 iDCs in Osaka.
▪ Network connectivity
- Providing Internet or Private Network
▪ Rental service
- Server and network equipment rentals are available for collocation.
▪ Managed
- This offers a fully managed environment across on-premises,
collocation in Data Center.
‣ There is also Cloud services
▪ It is not in today’s topic.
3. Copyright 2015 Bit-isle Inc. All Rights Reserved
Hyper converged infrastructure – our needs
Element of Hyper Converged
‣ Structured as simple as possible
‣ Deploying as rapid as possible
‣ Managing integrated
‣ As much as possible flexible scalability
Our Concept
‣ No Special Appliance
‣ No product
4. Copyright 2015 Bit-isle Inc. All Rights Reserved
Our Goals① Providing easily
【Goal】
Short Leadtime at less cost.
▪ Supply as fast as possible
▪ Stock as less as possible
▪ Cost as cheap as possible
‣ Scale easily
▪ Easy to deploy physical machine
▪ Easy to deploy logical conponents
【Method】
‣ Physical systems as simple as possible
▪ Simple Using 1U Server(Our service base)
▪ Only 2 Switch systems
▪ Using Ceph Cluster
5. Copyright 2015 Bit-isle Inc. All Rights Reserved
Network device
‣ 1 × 40G Network for All Service
‣ 1 × 1G Network for IPMI
OpenStack Nodes
‣ 1 Control and NW
‣ 5 Compute and Storage
Deployment Node
‣ Juju /Maas Server
Basic Structure
Compute&
Storage
CTRL/NW
1 node
Deployment
Router
CTRL/NW
Compute/OSD
Compute/OSD
Compute/OSD
MAAS/Juju
OpenStack Segment IPMI Segment
Compute/OSD
Compute/OSD
6. Copyright 2015 Bit-isle Inc. All Rights Reserved
【Goal】
Performance is much higher
Base is provided open source.
Specific applications option is for profit.
Our Goals② Performance
【Method】
▪ Basic Server (spec will be linear upgrade.)
▪ Using 40Gbit/56Gbit Switch
▪ PCIe SSD (for Ceph Journal & OSD)
7. Copyright 2015 Bit-isle Inc. All Rights Reserved
Resource Server
40GB/Ethernet
Hyper
Visor
Compute & Storage Server
KVM
VM
VM
VM
VM
VM
VM
KVM
VM
VM
VM
VM
VM
VM
KVM
VM
VM
VM
VM
VM
VM
KVM
VM
VM
VM
VM
VM
VM
Ceph
Cluster SSD(OSD))
SSD(Journal)
SSD(OSD)
SSD(Journal)
SSD(OSD))
SSD(Journal)
SSD(OSD))
SSD(Journal)
Server fundamentals
Server HP ProLiant DL360 Gen9
CPU E5-2690v3 2.60GHz 1P/12C * 2
HDD SAS 1TB HDD *2
2 RAID1 for OS
PCIeSSD Fusion-io iodrive 320GB * 1
1 for Journal , 1 for OSD
40Gbps NIC Mellanox ConnectX3-Pro
8. Copyright 2015 Bit-isle Inc. All Rights Reserved
Server & Storage
Server(HP DL360 Gen9)
PCIe SSD
‣ Fusion-io ioDrive Duo 320GB
9. Copyright 2015 Bit-isle Inc. All Rights Reserved
Network Device
HW selection
‣ Adapter (NIC)
‣ Switch
10 / 40 / 56GbE
RDMA supported
VXLAN offload supported
36 ports x QSFP
48 ports x SFP+ , 12 ports x QSFP
12 ports x QSFP (48 ports x SFP+)
*Breakout Cable
10 / 40 / 56GbE
220ns Low Latency
Best suits for SDS Network
10. Copyright 2015 Bit-isle Inc. All Rights Reserved
【Goal】
Easy to Customize
‣ Deploy server more easily
Sharing knowledge.
Our Goals③ Knowledge Sharing
【Method】
‣ Using Juju/MAAS (open sourced deployment tool)
11. Copyright 2015 Bit-isle Inc. All Rights Reserved
Deploy
Using Juju/Maas
‣ Nodes Setup (by Local Charm )
▪ Installing OS
▪ Installing Device Drivers & Network settings
- For 40G NIC & PCIe SSD Driver
‣ Deploy Ceph and OpenStack Components
cs: ceph
cs: ceph-osd
cs:trusty/ntp
cs:trusty/ceph
cs:trusty/ceph-osd
cs:trusty/rsyslog
cs:trusty/rsyslog-forwarder-ha
local:trusty/nova-compute
cs:trusty/percona-cluster
cs:trusty/rabbitmq-server-32
cs:trusty/keystone
local:trusty/openstack-dashboard
local:trusty/nova-cloud-controller
cs:trusty/neutron-api
cs:trusty/neutron-gateway
cs:trusty/cinder
cs:trusty/glance
cs:trusty/cinder-ceph
cs:trusty/neutron-openvswitch
cs:trusty/hacluster
13. Copyright 2015 Bit-isle Inc. All Rights Reserved
Test Items (Network)
KVM
Compute Node-2
OVS
VXLAN
VM VM VM VM
KVM
Compute Node-1
OVS
VM VM VM VM
VM to VM between physical nodes
1 – 16 VM per physical node
Metering by iperf3 TCP & UDP
14. Copyright 2015 Bit-isle Inc. All Rights Reserved
Basic Perfomance TCP Bandwidth
Total & Average Performance(※iperf3 default)
Bandwidth(GBits/sec)
1-1 2-2 4-4 8-8 16-16
Total 2.05 3.18 5.74 7.18 10.53
Average 2.05 1.59 1.43 0.90 0.66
0.00
2.00
4.00
6.00
8.00
10.00
12.00
1-1 2-2 4-4 8-8 16-16
Total
Average
15. Copyright 2015 Bit-isle Inc. All Rights Reserved
Basic Perfomance (UDP Bandwidth by packet size)
Total
Average
16. Copyright 2015 Bit-isle Inc. All Rights Reserved
Basic Perfomance (UDP Laytency by packet size)
Laytency (Jitter)
Lost Packet
17. Copyright 2015 Bit-isle Inc. All Rights Reserved
Ceph
Cluster
Test Items (IOPS)
KVM
Compute Node-2
Network
VM VM VM VM
KVM
Compute Node-1
VM VM VM VM
KVM
Compute Node-3
VM VM VM VM
SSD(Journal) SSD(Journal) SSD(Journal)
HDD(OSD) HDD(OSD) HDD(OSD)
FIO
FIO (8k 100jobs )
‣ 1 – 16 vm (1 ,2 or 4 VM per Host, Hosts count : 1 – 4 )
18. Copyright 2015 Bit-isle Inc. All Rights Reserved
Basic Performance of storage(Bandwidth)
Bandwidth(8k MByte/sec)
‣ Total
‣ Average
【FYI】 Fio parameters
bs=8k size=10M runtime=60 iodepth=32 numjobs=80 group_reporting
19. Copyright 2015 Bit-isle Inc. All Rights Reserved
Basic Performance of storage (IOPS)
IOPS(8k MByte/sec)
‣ Total
‣ Average
【FYI】 Fio parameters
bs=8k size=10M runtime=60 iodepth=32 numjobs=80 group_reporting
21. Copyright 2015 Bit-isle Inc. All Rights Reserved
Counter Plan
In order to using the 40Gbit network more effectively
‣ Network performance improvement
▪ Using VXLAN Offload
- Offloading cpu workload of VXLAN
▪ Using DPDK
- Reduce network function cost of Linux kernel
‣ Ceph IO performance improvement
▪ Using Ceph RDMA
- Enable direct memory access over ethernet for storage cluster
23. Copyright 2015 Bit-isle Inc. All Rights Reserved
VXLAN offload
OVS + Normal NIC [General Understanding]
‣ VXLAN process handled by OVS.
‣ It means that CPU works for packet process of VXLAN packets.
‣ Normal NIC can NOT take care about,
▪ Checksum, TSO, RSS, etc
24. Copyright 2015 Bit-isle Inc. All Rights Reserved
VXLAN offload
What is the VXLAN offload
‣ Offload VXLAN protocol on edge-point (NIC)
‣ VXLAN offload engine enables TCP/IP offload
▪ Enable checksum, TSO, RSS, GRO
‣ Get more throughput, Less latency and less CPU resource
VM generate inner packetOVS generates outer packet
25. Copyright 2015 Bit-isle Inc. All Rights Reserved
VXLAN offload
HW selection
型番
MCX311A-
XCCT
MCX312B-XCCT MCX313A-BCCT MCX314A-BCCT
Port Single
10GbE
Dual
10GbE
Single
/10/40/56GbE
Dual
/10/40/56GbE
Port Type SFP+ SFP+ QSFP QSFP
Cable Cupper, Optical
Host Bus PCIe 3.0 x 8
Features VXLAN/NVGRE offload, RDMA, SR-IOV, etc
OS RHEL, SLES, Microsoft Windows Sever, FreeBSD, Ubuntu, VMWare ESXi
26. Copyright 2015 Bit-isle Inc. All Rights Reserved
VXLAN offload result(TCP Bandwidth)
Compair VXLAN offload and normal
Bandwidth(GBps) - VXLAN offload
1-1 2-2 4-4 8-8 16-16
Total 14.40 21.70 30.00 31.43 24.63
Average 14.40 10.85 7.50 3.93 1.54
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
1-1 2-2 4-4 8-8 16-16
Total
Average
Bandwidth(Gbps) normal
1-1 2-2 4-4 8-8 16-16
Total 2.05 3.18 5.74 7.18 10.53
Average 2.05 1.59 1.43 0.90 0.66
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
1-1 2-2 4-4 8-8 16-16
Total
Average
28. Copyright 2015 Bit-isle Inc. All Rights Reserved
Virtualization bottole-neck
Bandwidth(GBps) - VXLAN offload
1-1 2-2 4-4 8-8 16-16
Total 14.40 21.70 30.00 31.43 24.63
Average 14.40 10.85 7.50 3.93 1.54
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
1-1 2-2 4-4 8-8 16-16
Total
Average
Why
29. Copyright 2015 Bit-isle Inc. All Rights Reserved
How to use CPU
Allocate CPU Core to DPDK explicit.
Data
Plane
Data
Plane
DPDK DPDKKernel
Linux
Processor 0 Processor 1
C1 C2 C3 C4 C1 C2 C3 C4
Data PlaneControl Plane
Linux Linux Linux Linux Linux
30. Copyright 2015 Bit-isle Inc. All Rights Reserved
Network Bottleneck by Linux kernel Stack
Network Bottleneck by Linux kernel Stack
‣ Data Plane Development Kit
Application
General Process DPDK
Linux Kernel
System Call
Packet Copy
Network Device
Device Driver
Hyper-visor
Kernel
Device Driver
Application
DPDK liblaryKernel
31. Copyright 2015 Bit-isle Inc. All Rights Reserved
DPDK
1 to 1 performance
N to N perfomane
To be verified next month
33. Copyright 2015 Bit-isle Inc. All Rights Reserved
Ceph RDMA
What is the RDMA?
‣ Remote DMA
‣ Zero-Copy Technology
‣ Protocol
▪ iSER, RoCE, iWARP
Application
General Process RDMA Process
Socket API RDMA verbs API
Application
Socket
TCP
Network Device
Device Driver
Kernel
Device Driver
34. Copyright 2015 Bit-isle Inc. All Rights Reserved
Ceph RDMA
RDMA network suits for the flash storage
RDMA Advantage for Ceph
‣ Reduce CPU workload of Hypervisors for IO transaction
‣ Much faster IO for east-west traffic and fail-over(fail-back)
‣ Gets higher throughput and IOPS
Total: 45usec Total: 25.7usec
RoCE
35. Copyright 2015 Bit-isle Inc. All Rights Reserved
Ceph RDMA
Ceph supports RDMA
‣ v0.94 Hammer released
https://ceph.com/releases/v0-94-hammer-released/
http://tracker.ceph.com/projects/ceph/wiki/Accelio_RDMA_Messenger
Fuction :XioMessenger
Library :Accelio
36. Copyright 2015 Bit-isle Inc. All Rights Reserved
Ceph RDMA RESULT
For reference purpose only
‣ 3 nodes Ceph Cluster & Fio access direct rbd
‣ Bandwidth
IOPS
37. Copyright 2015 Bit-isle Inc. All Rights Reserved
Summary
VXLAN offload is one of the effective solution
The other solutions require continual verification
To be continued.
38. Copyright 2015 Bit-isle Inc. All Rights Reserved
Next Plan
More Performance
‣ Network workload offload
‣ Increase Memory (DIMM NAND flush)
‣ NVMe SSD and DIMM Storage
Scale Flexibility
‣ Scale Internet Gateway (SDN or NFV)
‣ Multi region scaling