Unrevealed Story Behind Viettel Network Cloud Hotpot | Đặng Văn Đại, Hà Mạnh Đông, Nguyễn Tuấn Anh

Unrevealed Story
behind Biggest Cloud at Vietnam
Đặng Văn Đại @daikk115 Viettel Networks
Hà Mạnh Đông @donghm Viettel Networks
Nguyễn Tuấn Anh @anhnt425 Viettel Networks
1

Image source: https://www.discoverlosangeles.com 2

Client
Internal VTNet VTM
VTT ...
Storage
Server
Switch
Cloud Hotpot?
3

Agenda
1. Viettel Networks Cloud Hotpot
2. Mixing Compute Resources
3. Mixing Storage Resources
4. Tuning Sensitive Points on OpenStack
4

1. Viettel Networks Cloud Hotpot
5

Viettel Networks Cloud Hotpot
The early days
6

Nowaday containerized cloud environment
7

Tech Stack
OpenStack Docker Prometheus EFK
Ceph Ansible Kolla Stackstorm
Python, Go & Bash
8
Grafana

❖ OpenStack projects:
➢ Nova, Neutron, Cinder, Keystone, Glance
➢ Octavia, Mistral, Magnum, Ironic, Heat
❖ External services are developed by Viettel
➢ Notify over SMS
➢ Auto healing for compute resources
➢ V2V, P2V tools
9

2. Mixing Compute Resources
10

Mixing Compute Resources
Dealing with different CPU models
❖ CPU model we have: SandyBridge,
SandyBridge-IBRS, Broadwell, Skylake,...
❖ OpenStack configurations
11

Dealing with different CPU models
❖ Same computes should have same BIOS and
firmware version
❖ Check flags and cpu model mapping:
/usr/share/libvirt/cpu_map.xml
12

CPU Pinning
❖ Dedicated cpu for virtual machine
❖ Supported configurations
➢ [DEFAULT]
vcpu_pin_set=0,1,2,3,8,9,10,11,16,17,18,19,24,25,26,27
➢ # grubby --update-kernel=ALL
--args="isolcpus=0,1,2,3,8,9,10,11,16,17,18,19,24,25,26,27"
➢ # cat /sys/devices/system/cpu/isolated
➢ # cat /sys/devices/system/cpu/present
13

CPU Pinning
❖ Flavor properties
➢ hw:cpu_policy=dedicated
➢ hw:cpu_thread_policy=*
■ require
■ isolate
■ prefer (default)
❖ Pinning CPU for existed VMs
➢ virsh vcpupin instance-0000002e 0 5
14

3. Mixing Storage Resources
15

Mixing Storage Resources
Schedule VMs to different storage backends
❖ In fact, not all our compute nodes have HBA
to connect SAN storage
❖ Use Host Aggregates to determine which
compute have HBA or not
➢ Metadata: hba=false or hba=true
❖ Flavor properties
➢ aggregate_instance_extra_specs:hba=true
➢ aggregate_instance_extra_specs:hba=false
17

18
Ceph
❖ One of our Ceph cluster with
difference OSD type, OSD size, need
to be adjusted:
➢ CRUSH rules
➢ OSD weight

19
Ceph
❖ None-raid mode for osd
❖ Three network: public-net, replica-net and
monitoring-net; all are bonding mode 4
(802.3ad)
❖ With HDD OSD: using SSD for rocksdb + wal
➢ Create SSD partition (LVM)
➢ 40GB SSD per HDD OSD (RockDB level 1 only)

20
SAN
❖ Using FC (Fibre Channel)
❖ Fabric mode (Storage controller <-> Switch
<-> Server)
❖ Redundancy: multipath
➢ Update HBA (qlogic/emulex/...) config to
fit with all SAN storage
➢ Update multipath.conf for multiple SAN
storage if need

4. Tuning Sensitive
Points in OpenStack
21

Tuning Sensitive Points in OpenStack
HAProxy
❖ By default, HAProxy will not accept over 2000
established connection for one backend per
thread
❖ In our case, when HAProxy not accept more
connection to MariaDB backend, then most of
services which are connecting to MariaDB
over HAProxy went down
22

HAProxy
❖ Increasing threshold
➢ Increase maxconn
➢ Use multi-thread for HAProxy
23

HAProxy
❖ Increasing timeout for cinder-api when
creating multiple VMs simultaneously on SAN.
❖ Regarding to timeout: RPC timeout for Nova
and Cinder also need to be increased
24

RabbitMQ
❖ Some services of OpenStack published messages
to queues that does not have any consumer
➢ Need to set TTL for them
❖ OpenStack Ceilometer with notification bus
can sent too many connections to RabbitMQ
➢ Separating RabbitMQ cluster for Ceilometer
➢ Or destroy Ceilometer and all Telemetry
services as we done
25

RabbitMQ
❖ Set TTL for queues
# rabbitmqctl set_policy TTL "notifications*" '{"message-ttl":5400000}'
--priority 1 --apply-to queues -p "/"
# rabbitmqctl set_policy TTL "versioned_notifications.*"
'{"message-ttl":5400000}' --priority 1 --apply-to queues -p "/"
26

RabbitMQ
❖ Delete all messages in queues if need
$ rabbitmqctl purge_queue notifications.error -p "/"
$ rabbitmqctl purge_queue versioned_notifications.error -p "/"
$ rabbitmqctl purge_queue versioned_notifications.info -p "/"
27

28
Thank You for Listening!
Any question?
Be Part of Our Story!
Đặng Văn Đại @daikk115 Viettel Networks
Hà Mạnh Đông @donghm Viettel Networks
Nguyễn Tuấn Anh @anhnt425 Viettel Networks

Appendix
❖ Live migration problems
29

Live migration problems
❖ 100% CPU usage, virtual machine hangs
➢ Rarely
➢ Not ﬁx yet
30
Bugs
VM have 30 vCPUs

❖ VM live migration abort due to high memory usage
➢ Sometimes
➢ Solution
■ Force live migrate
● nova live-migration-force-complete <instance_id> <migration_id>
■ Slow down CPU
● set live_migration_permit_auto_converge=true in nova.conf
31
Bugs
Refer:
https://docs.openstack.org/nova/pike/admin/live-migration-usage.html

❖ Unacceptable CPU info: CPU doesn't have compatibility
➢ Sometimes
➢ Solution
■ cpu_mode = custom in nova.conf
■ Use host aggregates to schedule VM into compatible compute node
32
Bugs

Unrevealed Story Behind Viettel Network Cloud Hotpot | Đặng Văn Đại, Hà Mạnh Đông, Nguyễn Tuấn Anh

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Unrevealed Story Behind Viettel Network Cloud Hotpot | Đặng Văn Đại, Hà Mạnh Đông, Nguyễn Tuấn Anh

Similar to Unrevealed Story Behind Viettel Network Cloud Hotpot | Đặng Văn Đại, Hà Mạnh Đông, Nguyễn Tuấn Anh (20)

More from Vietnam Open Infrastructure User Group

More from Vietnam Open Infrastructure User Group (20)

Recently uploaded

Recently uploaded (20)

Unrevealed Story Behind Viettel Network Cloud Hotpot | Đặng Văn Đại, Hà Mạnh Đông, Nguyễn Tuấn Anh