Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

CodiLime Tech Talk - Michal Ploski: Creation of a SDWAN performance test infrastructure based on bare-metals

56 views

Published on

Tech Talk CodiLime 31.01.2018 DevOps by Example

Michał Płoski - Creation of a SDWAN performance test infrastructure based on bare-metals

You can find the recording here: https://youtu.be/W68UizLJDcg

Published in: Technology
  • Be the first to comment

  • Be the first to like this

CodiLime Tech Talk - Michal Ploski: Creation of a SDWAN performance test infrastructure based on bare-metals

  1. 1. DevOps by Example 31.01.2018
  2. 2. Creation of a SDWAN performance test infrastructure based on bare-metals Michał Płoski
  3. 3. Agenda General SDWAN architecture What is the problem? Bootstrapping Provisioning Images management Statistics Monitoring Machines tuning What does the future bring?
  4. 4. SDWAN Architecture CONTROLLER Dashboards and API Device mgmt Network mgmt User mgmt Overlay Network Underlay Network Internet
  5. 5. What is the actual problem? We need to be able to deploy around 2000 routers Old way: 1.5 weeks of work (60 working hours) New way: Around 1.5 hour! Boost: 40 times!
  6. 6. What is the actual problem? - Assumptions Bare-metals Infrastructure as a code Limited time Monitoring :-)
  7. 7. What is the actual problem? - Domain of problem Split the problem to smaller pieces: 1. Bootstrapping 2. Provisioning 3. Monitoring / Statistics + system tuning
  8. 8. Bootstrapping - Automate stuff Bootstrapping
  9. 9. Bootstrapping - Automate stuff Problem How can we deploy several dozen of machines with one click? We need tool that ● Allows PXE usage ● Covers complexity (DHCP, DNS, TFTP, HTTP) ● Possible integration via API (not GUI only) ● Easy to setup and use
  10. 10. Bootstrapping - 1st stage Cobbler in 2 minutes Linux installation server that allows for rapid setup of network installation environments User needs to configure few cobbler elements: ● Distro ● Profile ● System
  11. 11. Bootstrapping - 1st stage Cobbler in 2 minutes
  12. 12. Bootstrapping - 2nd stage So we installed base OS… But how to do it for dozens of machines? We don’t want to do this via graphical user interface...
  13. 13. HOST Bootstrapping - 2nd stage 1
  14. 14. HOST Bootstrapping - 2nd stage variable "lime_host_interface1_config" { description = "First lime host interface configuration" type = "map" default = { "1" = "87:d3:cd:1c:4c:3a" "2" = "87:d3:cd:1c:61:b3" "3" = "87:d3:cd:1c:43:e1" "4" = "87:d3:cd:1c:21:c3" "5" = "87:d3:cd:1c:23:54" "6" = "87:d3:cd:1c:54:89" } resource "cobbler_distro" "lime_distro" { ... } resource "cobbler_kickstart_file" "lime_base_os_kickstart" { ... } resource "cobbler_profile" "lime_profile" { ... } resource "cobbler_system" "lime_base_machine" { .. } 2
  15. 15. HOST Bootstrapping - 2nd stage 3 Intel(R) Boot Agent XE v2.1.40 CLIENT MAC ADDR: 90 E2 BA Al 9F 10 GUID: B4BB4732 B25A D911 9937 386C1F4F2300 CLIENT IP: 10.10.1.100 MASK: 255.255.255.0 DHCP IP: 10.12.5.150 GATEWAY IP: 10.10.1.254 PXELINUX 4.05 0x54f93f16 Copyright (C) 1994—2011 H. Peter Anvin et a1 !PXE entry point found (we hope) at 974F:0106 via plan A My IP address seems to be 0A220129 10.34.1.41 ip=10.10.1.100:10.12.5.150:10.10.1.254:255.255.255.0 BOOTIF=01-90-e2-ba-a1-9f-74 SYSUUID=b4bb4732-b25b-d911-9937-387c1f4f3300 TFTP prefix: / Trying to load: pxelinux.cfg/01-90-e2-ba-al-9f-10 ok Loading /images/ubuntu-16.04.2-server-x86_64/linux…... Loading /images/ubuntu-16.04.2-server-x86_64/initrd.gz………….
  16. 16. HOST Bootstrapping - 2nd stage provisioner "remote-exec" { connection { type = "ssh" user = "root" host = "${cidrhost("10.12.5.0/24",element(keys(var.lime_host_interface1_config), count.index))}" timeout = "40m" } inline = [ "echo 'just checking for ssh. ttyl. bye.'"] } 4
  17. 17. Bootstrapping - 3rd stage Problem: Password/access management What we must take into account ? ● Cobbler API ● IPMI We don’t want to pass passwords every time via command line or keep it in plain-text configuration file Solution: Hashicorp Vault ● Secret as a Service ● Password access defined via policies ● Token based access ● Easy integration with Terraform ● Audit - We know who and when fetch data
  18. 18. Bootstrapping - 3rd stage Problem: Keep state consistency What happens when multiple users start to modify infrastructure in the same time? What happens when state file will be removed? Solution: AWS S3 ● Infrastructure state kept in S3 Bucket ● Access managed via IAM role (separate user + IP filtering) ● Versioning
  19. 19. Bootstrapping - 3rd stage terraform { backend "s3" { bucket = "sdwan-terraform-state" key = "lime/base-os/terraform.tfstate" region = "us-east-2" } } 2
  20. 20. Bootstrapping - 3rd stage provider "vault" { address = "https://vault.testdomain.io" } data "vault_generic_secret" "admin_password" { path = "secret/v1/infrastructure/cobbler/passwords/password" } data "vault_generic_secret" "ipmi_admin_password" { path = "secret/v1/infrastructure/lime/passwords/ipmi_password" } 3
  21. 21. Bootstrapping - 3rd stage $ vault auth -method=ldap username=michal.ploski Password (will be hidden): Successfully authenticated! You are now logged in. $ terraform plan Refreshing Terraform state in-memory prior to plan... -/+ module.lime.cobbler_system.lime_base_machine[35] (tainted) (new resource required) Id: "lime-node-69" => <computed> (forces new resource) Plan: 1 to add, 0 to change, 1 to destroy. $ terraform apply 4
  22. 22. Bootstrapping - 3rd stage variable "ansible_playbook" { description = "Ansible lime node playbook" type = "string" default = "deploy_vrouter_infra.yml" } provisioner "local-exec" { command = "cd ${path.root}/provision/; ansible-playbook -i ${cidrhost("10.12.5.0/24",element(keys(var.lime_host_interface1_config), count.index))}, ${var.ansible_playbook} } 7
  23. 23. Bootstrapping - 3rd stage PLAY [all] ********************************************************************* Thursday 25 January 2018 16:50:42 +0100 (0:00:00.086) 0:00:00.086 ****** TASK [setup] ******************************************************************* ok: [10.12.5.41] Thursday 25 January 2018 16:50:50 +0100 (0:00:08.205) 0:00:08.292 ****** TASK [vrouter : Add vrouter user] **************************************************** ok: [10.12.5.41] Thursday 25 January 2018 16:50:51 +0100 (0:00:01.318) 0:00:09.610 ****** TASK [vrouter : Set authorized key took from file] ******************************** ok: [10.12.5.41] 8
  24. 24. Provisioning
  25. 25. Provisioning - Model Physical router is Linux OS on x86 architecture… Use already created and tested ISO image along with KVM/libvirt virtualization Problem: How to simulate virtual router behaviour? LAN client is Linux OS connected with SDWAN router LAN interface Use Docker image and connect its bridge to router interface
  26. 26. LAN CLIENT VIRTUAL ROUTER eth0 LAN LAN WAN BR-LANBR-WAN HOST INTERNET Provisioning - Model
  27. 27. 1. virtual router c. deploy b. network a. configure 4. monitoring client a. install c. scripts 2. virtual router LAN b. configure c. network d. deploy 3. statistics a. install b. configure b. configure Provisioning a. install
  28. 28. Provisioning - Password management Problem: How to manage ansible passwords? What we must take into account ? ● Docker Registry ● SSH ● Monitoring What about access policy ?
  29. 29. graphite_db_pass: "{{ lookup('vault', 'secret/v1/infrastructure/lime/passwords/graphite_web_db_password', 'graphite_web_db_password') }}" graphite_web_pass: "{{ lookup('vault', 'secret/v1/infrastructure/lime/passwords/graphite_web_secret_key', 'graphite_web_secret_key') }}" 8 9 7 5 3 1 6 4 2 Provisioning - Password management
  30. 30. Provisioning Ansible is slow… What to do? ● Set proper number of forks if run on large number of hosts ● Configure pipelining ● If you don’t use facts, disable it ● Enable profiling plugin ● Use asynchronous tasks for long running steps ● If it’s possible try to run multiple plays in parallel ● Use free strategy for plays
  31. 31. Provisioning - Images management Problem: How to manage static images? KVM Virtual Machine Image ● Take ISO image and build QCOW image using Jenkins Pipeline Docker image ● Build simple image and push to Private registry using Jenkins Pipeline
  32. 32. Jenkins - Build virtual router image
  33. 33. Jenkins - Build Docker image
  34. 34. Statistics
  35. 35. How to access used virtual router IP’s How to access dynamically created virtual routers? ● Present list of current infrastructure status ● Simple health-check ● Do not modify virtual routers image Possible Solutions: ● Fetch IP addresses from DHCP servers ● Register routers IP addresses in external service ● Install qemu-guest-agent on guest machine ● Scan router dedicated subnet every few minutes (*)
  36. 36. How to access used virtual router IP’s - Per node Dynamic list per host List of MAC Extract list of used Mac addresses from libvirt Present list to user Run script in cron and present data in NGINX Scan network Scan subnet used by routers via fping Write list to file Create yaml file based on host IP address, router mac and router IP Pair MAC with IP Search MAC - IP pair in ARP table
  37. 37. How to access used virtual router IP’s - Aggregate yaml list Aggregated dynamic list Present aggregated list Expose list via NGINX Download lists Cron script download all dynamic yaml list from vrouter physical host Parse lists Parse and concatenate lists into one
  38. 38. Monitoring & Tuning
  39. 39. First thing first - Monitoring
  40. 40. First thing first - Monitoring
  41. 41. Overbooking We want to utilize machines fully ● Virtual routers won’t work at 100% ● Docker instances are used in specific cases i.e QA tests ● Slow Disk’s :-( Solution: Tune for overbooking ● RAM usage: KSM ● CPU: Pinning + libvirt cache tune ● DISK: Maximize host cache ● NETWORK: Maximize throughput
  42. 42. Overbooking Problem ● High RAM usage ● No enough space for cache Solution ● Kernel Same-page Merging
  43. 43. Overbooking - Kernel Same Paging - Memory Used ○
  44. 44. Overbooking - Kernel Same Paging - Memory Cached ○
  45. 45. Overbooking - Kernel Same Paging - Load ○
  46. 46. Overbooking - CPU Problem ● High overall CPU usage ● High Context switching Solution (Partial) ● Cpu pinning ● KVM host-passthrough CPU model
  47. 47. Overbooking - CPU - Pinning + Host passthrough ○
  48. 48. Overbooking - Problem Problem ● High CPU system usage ● Still High Context switching ● Strange network usage Solution ● Receive Packet Steering tune ● Remove network saturation source problem
  49. 49. Overbooking - Network tune - Receive Packet Steering
  50. 50. Overbooking - Remove Network Scanner
  51. 51. Overbooking - CPU - DISK Problem Huge Load and disk usage during virtual router creation Solution ● KVM guest caching mode = writeback ● Host kernel dirty pages/cache tuning ○ vm.swappiness ○ vm.vfs_cache_pressure ○ vm.dirty_background_ratio ○ vm.dirty_ratio ○ vm.dirty_writeback_centisecs ○ vm.dirty_expire_centisecs
  52. 52. Overbooking - CPU - DISK - Before
  53. 53. Overbooking - CPU - DISK - After
  54. 54. What does the future bring? Another way of building deployment statistics Shorten bare-metal creation time Fasten Provisioning time (Docker and libvirt role running in parallel) Bootstrap and Provisioning tests Terraform run from Rundeck or Jenkins
  55. 55. DevOps by Example 31.01.2018 Thanks!

×