Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

How our Cloudy Mindsets Approached Physical Routers

Download to read offline

Talk at DENOG12 (German Network Operators Group)
Virtual, 09.11.2020

Related Books

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

How our Cloudy Mindsets Approached Physical Routers

  1. 1. How our Cloudy Mindsets Approached Physical Routers SNMP was not an option Steffen Gebert DENOG12, 09.11.2020
  2. 2. After the latest project, EMnify became a 99% only cloud company. To meet growing scalability and reliability requirements of the interconnection between our AWS- based deployments and multiple carriers, BGP peerings had to be moved out of AWS. Therefore, a pair of Juniper routers were put into place. For a company fully relying on cloud services so far, this alien technology resulted in several challenges. We want to share, how we solved the integration puzzle of this physical equipment into our existing workflows and tools. The use of CI/CD systems for applying changes, AWS CloudWatch, Prometheus and Grafana for monitoring as well as the reluctance to run applications that require a lot of shepherding lead our research to find the right glue - the glue between these pieces of iron and our cloud infrastructure. Being used to CI/CD processes backed by automated tests, we wanted to adapt these practices here as well. As a result, configuration changes are rolled out by an automated pipeline using Ansible. Efforts for automated testing were made, where we failed. We explain why and what we did instead as well as what we envision for the future. As every other part of our system, we want its monitoring data accessible via Grafana. With the help of pmacct and fluentbit, we can treat IPFIX flow records as they were logs. With the help of jtimon, Prometheus stores the routers’ metrics as we are used to do, in doubt tickled out through few custom YANG models. In summary, the integration worked very well, while we still have several learnings and pain points to share. Abstract
  3. 3. Thanks to our Sponsors!
  4. 4. Cloudy Mindset?
  5. 5. 5 years ago All rights reserved | Confidential 10. 11.
  6. 6. 3-10 years ago 6@StGebert
  7. 7. Since 2017 7@StGebert
  8. 8. Is This a Better World?
  9. 9. Focus on Business Value
  10. 10. Prefer Managed Services
  11. 11. And Suddenly… Hardware?
  12. 12. Agenda Context Deploy- ment Moni- toring
  13. 13. EMnify’s IoT Connectivity Platform 13@StGebert Cellular connectivity in 500+ networks in 185 countries RESTful APIs Pay as you go pricing SMS/USSD to REST bridge Secure connectivity via VPN and AWS natively Implemented using own virtualized mobile core network
  14. 14. Supporting Global IoT Deployments Home-routing of roaming SIM data prevents distributed architecture EMnify’s mobile core network is deployed in multiple AWS regions – keeping data local @StGebert 14
  15. 15. GRX/IPX Network (GPRS Roaming Exchange) 15 Internet Telco world GRX/IPX @StGebert
  16. 16. Our scale[throughput] bores you
  17. 17. We’re critical to our customers’ success
  18. 18. Increased demands vs. AWS as “General Purpose Cloud”
  19. 19. Running BGP on AWS?
  20. 20. We had to move logic out of AWS
  21. 21. We could not find a fitting managed service
  22. 22. We had to get hardware
  23. 23. We chose boring technology
  24. 24. Greenfield project
  25. 25. • Juniper MX204 • Colocation rack space • Fiber links towards 2 carriers AWS • Out-of-band management access via OPNsense 25@StGebert Setup – Twice per region
  26. 26. Integration Points 26 Metrics Syslog Flows records Alerts Config @StGebert
  27. 27. 27 Design Principles 80/20 rule aka MVP Don’t get out of our comfort zone Don’t setup anythat that requires lot of handholding @StGebert
  28. 28. Deployment
  29. 29. A human shall not SSH into something M Y I N N E R S E L F
  30. 30. juniper_junos Ansible Modules @StGebert 30
  31. 31. Configuration Deployment Render Config •Single config file •Parametrized using variables Commit •Upload using juniper_junos_config •Load override mode •Issue (to be) confirmed commit Test •Ping tests Confirm •Confirm commit @StGebert 31
  32. 32. Ansible Playbook - Code Example @StGebert 32 - name: install generated configuration file onto device juniper_junos_config: provider: "{{ juniper_connection_settings }}" src: "{{ conf_file }}" load: override comment: "playbook execution, commit confirmed" confirmed: 3 # wait X minutes until rollback diff: yes ignore_warning: yes register: config_results notify: confirm previous commit
  33. 33. • Separate AWS account • Isolated connectivity 33@StGebert Config Pipeline AWS CodePipeline AWS CodeBuild
  34. 34. And where are the tests?
  35. 35. In a Perfect World.. 35 • 2-star review could have been mine :D • Latest version: 18.4R1 • Takes ~30min to be ready • AWS does not support VLANs! • Only for manual testing • Maybe eve-ng or GNS3 could help? @StGebert
  36. 36. • Start virtualized topology in network emulator • Apply configuration pipeline • Emulate BGP peers • Execute end-to-end connectivity tests • Emulate link failures • Verify connectivity • AWS: run on bare metal host (b/c CPU VMX) On My Bucket List 36@StGebert
  37. 37. Routine Operations (Runbooks)
  38. 38. 38@StGebert Firmware Update - Checks /workspace/prod # ansible-playbook upgrade_check.yaml -u steffen.gebert ... TASK [Validate result] ***************** [ mx204-am3 ] Chassis Alarms ---------------------------- Expect: No alarms currently active Actual: No alarms currently active ... [ mx204-am3 ] Core Dumps ---------------------------- Expect: /var/crash/*core*: No such file or directory Actual: /var/crash/*core*: No such file or directory [ mx204-am3 ] ⚠ Proceed? ⚠ : Press 'C' to continue the play or 'A' to abort
  39. 39. 39@StGebert Firmware Update - Draining - name: Drain traffic juniper_junos_config: provider: "{{ juniper_connection_settings }}" load: 'set' lines: - 'activate policy-options policy-statement OUT-OF-SERVICE-SWITCH term as-path-pr comment: 'Drain traffic to router for upgrade' - name: Traffic drained pause: prompt: | [ {{item}} ] Traffic is draining. Verify that traffic is completely drained on the following dashboard before proce [ {{item}} ] ⚠ Proceed with the JunOS upgrade ⚠? loop: "{{ ansible_play_hosts }}"
  40. 40. 40@StGebert Firmware Update – Execute! - name: Install Junos OS package juniper_junos_software: provider: host: "{{ ansible_host }}" timeout: 3600 remote_package: "{{ junos_vm_file }}" validate: True cleanfs: False vmhost: True reboot: True ignore_errors: yes # rpc times out when upgrading, despite the provider timeout setti register: output
  41. 41. Deploy a file ¯_(ツ)_/¯ Challenges Max length of file copy URLs Feedback for invalid confg Amount of boilerplate code
  42. 42. Monitoring
  43. 43. Syslogs
  44. 44. •Who logged into the router? •What’s happening in the router? 44@StGebert Syslog Implementation rsyslog Amazon CloudWatch Logs 1 containers to run
  45. 45. Flow Records
  46. 46. Flow Records Network-to- Network Interface (GTP traffic) ~20k parallel flows 1 flow => 1 log line?! 46@StGebert
  47. 47. Amazon CloudWatch Logs •How much traffic per AS? •Did we receive any signaling from XYZ and did we really respond? 47@StGebert Flow Records Collection IPFIX 2 containers to run BMP FireLens
  48. 48. pmacct / nfacct @StGebert 48 libpcap NFLOG NetFlow IPFIX BGP BMP RPKI tag filter aggregate split RDBMS NoSQL MQ NetFlow Memory File stdout
  49. 49. protocol Enrichment @StGebert 49 host_ip src_ip dst_ip src_port dst_port src_asn dst_asn src_ifindex dst_ifindex … direction organization hostname carrier drop BGP & BFD
  50. 50. Lua Magic @StGebert 50 -- Sets GTP-c or GTP-u protocol depending on port numbers function setGTPProtocol(tag, timestamp, record) local code = 0 local gtp_ports = { ["GTP-c"] = 2123, ["GTP-u"] = 2152, ["GTP'"] = 3386, } local new_record = record for protocol, port in pairs(gtp_ports) do if record["source.port"] == port or record["destination.port"] == port then new_record["network.application"] = "GTP" new_record["network.protocol"] = protocol code = 2 end end return code, timestamp, new_record
  51. 51. 51@StGebert
  52. 52. 52@StGebert
  53. 53. Inbound traffic by AS query @StGebert 53 fields concat(source.as.organization.name, ', ’, source.as.organization.country, ' (AS ', source.as.number, ')') as org | filter @logStream = "flows" | filter host.name like /^$host$/ | filter concat(source.as.number, ' ', source.as.organization.name, ' ', source.as.organization.country,' ',source.as.organization.tadig) like /$operator/ OR concat(destination.as.number, ' ', destination.as.organization.name, ' ’, destination.as.organization.country, ' ', destination.as.organization.tadig) like /$operator/ | filter network.peer.destination.as.organization.name like /^$carrier$/ | filter network.direction = "inbound" | filter network.protocol like /$protocol/ | filter 10000 | stats sum(network.bytes)/60*8 as `` by org,bin($time_interval) | sort `` desc
  54. 54. CloudWatch Read Limits Challenges CloudWatch Write Limits pmacct config “creativity”
  55. 55. Metrics
  56. 56. Metrics Demand 56 Temperature, light levels, etc. State, throughput, errors, etc. State, prefixes received/accepted/installed Hard- ware Inter- faces BGP @StGebert
  57. 57. Prometheus • High cardinality, high frequency metrics collection 57@StGebert Metrics Implementation Junos Telemetry (JTI) 1 container to run JTIMON
  58. 58. 58@StGebert Metrics Examples
  59. 59. JTI Sensor availability Challenges JTIMON config file duplication JTIMON ENUM support PKI setup
  60. 60. Alerting
  61. 61. Prometheus • Prometheus-integrated alerting 61@StGebert Alerting Implementation Junos Telemetry (JTI) JTIMON Alertmanager
  62. 62. • Integrated hardware into an otherwise fully cloud-based environment • Avoid new processes • Avoid new (user-facing) tooling • Found tooling to bridge gaps to “what we’re comfortable with” • 1-2 containers running existing open source tooling • No guarantuee that this scales to 10s of devices • Please contact me, if you want details (configs etc.) or have suggestions! 62@StGebert Summary & Conclusion
  63. 63. Need a Lockdown Project? Go to emnify.com/devs

Talk at DENOG12 (German Network Operators Group) Virtual, 09.11.2020

Views

Total views

82

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

0

Shares

0

Comments

0

Likes

0

×