© 2018 LUMINA NETWORKS, INC. © 2018 LUMINA NETWORKS, INC.
SDN Meetup
Hitless Controller Upgrade
© 2018 LUMINA NETWORKS, INC.
Introduction
Towards a Hitless upgrade.
• Traditional Network Upgrades
– Closed Systems
• HW and Control Bundled (From the one Vendor)
• HW upgrade sometimes requires Control plane refresh
– Line card needs new OS and/or RE upgrade.
– Large Events
• Sometimes Months of Planning
• Failure is handled by rollback
– End Game is lots of small Automated Upgrades.
© 2018 LUMINA NETWORKS, INC.
Brutal Automation is the only way
Its easy to regress back to inefficient practices.
• Arash Ashouriha, Deutsche Telekom AG (NYSE: DT)'s deputy
chief technology officer, said the only way that his company could
now succeed was through a process of "brutal automation.”
THE HAGUE -- SDN NFV World Congress 2017
© 2018 LUMINA NETWORKS, INC.
Controller Upgrade CI/CD Toolsets
Software Practices and Toolsets that need to be employed.
• Upgrades MUST be Automated.
• Automated Dev Test Framework.
– NO Shortcuts!
• Pre Validation Checks.
• Engineer Hands off Upgrade Process.
• Post Validation Checks.
• Automated Rollback.
• Post Rollback Validations.
© 2018 LUMINA NETWORKS, INC.
Data and Control Layer Separation
• Data Plane
– Rule driven
• Openflow rules
• Configured by application on controller
– Isolated from control plane
– Benefits of no control traffic between nodes
– Decisions made by application
– Any "white box" with OF interface
– Flows and groups are static until reprogrammed
© 2018 LUMINA NETWORKS, INC.
Data and Control Layer Separation
• Control Plane
– Application/"Flow Manager"
• Controller acts as message bus
• Application calculates flows/groups
– Receives LLDP from nodes
• Topology built
– Shares/Distributes network state to all Controllers
– Drives potential for "hitless" upgrade
– Has it’s challenges…
© 2018 LUMINA NETWORKS, INC.
Challenges with Openflow Hitless Upgrade
Can it be hitless?
Types of Changes we need to understand.
• Controller APP Change
– Path Computational change that requires an algorithm change
– Service Change (new way of using abstracted resources)
• Controller Change
– Project Updates - openflow plugin /stats manager /topology manager etc
– Plugin Updates - openflow 1.3 -> 1.4
– MDSAL/Model changes - yang model changes
• Dataplane Pipeline
– No Pipeline Change >>>> HITLESS ☺
• Flows, Groups, Tables stay the same
– Pipeline Change
• Flows, Groups, Tables are not supporting new Pipeline
© 2018 LUMINA NETWORKS, INC.
Controller APP Change
• Can you overlay a PCE Change?
• New LSP Mesh / SR topo (Nodes SID)
• Even if you could handle a new Label base, you need to handle:
– Match Duplication (on ingress)
• How would you handle this?
– Action Duplication (on egress)
• Resource Limits
– Group Limits - stats manager with lots of groups - clustering then replicates
that data
– Flow limits
© 2018 LUMINA NETWORKS, INC.
Controller Infrastructure
• Plugin Changes
– Experimenter (mechanism for proprietary messages within the
protocol)
– Version Bump
• Controller Project Changes
– Is Hitless Upgrade Considered Part of the Project?
– Namespace
– Functionality
© 2018 LUMINA NETWORKS, INC.
No PCE change or Pipeline change (Easiest Scenario) But we still
have to be aware of:
• Group Limits
• Flow Limits
• Stats Manager
– Reconciling Flows
– General Load (lots of data)
No pipeline change
© 2018 LUMINA NETWORKS, INC.
• Flow and or Group type changes.
– Flows actions you may need change
• Ingress flow now has a new action?
– Group Tables you may need change
• Change from All to a Hierarchy
– New Tables
• Table reassignment
• Flow and group tables perform different functions
• Packet match lookups/forwarding
Pipeline Change
© 2018 LUMINA NETWORKS, INC.
Node Upgrades
• Switch OS upgrade
– Remove from service
• Rerouting any transit services
• Got ingress or egress services?
– They are dual homed right? If they aren’t, well..
– Upgrade
– Check
– Place Back into Service.
© 2018 LUMINA NETWORKS, INC.
Controller & Application Upgrades
• Option A
• Single cluster
• Disconnect switches - data plane continues, flows/groups state is persistent
• Perform upgrade
• Re-deploy
• Reconnect Switches
• Reliably manage outage window
• Not completely hitless
© 2018 LUMINA NETWORKS, INC.
Multi Site Cluster/Controller groups
Not so easy
• Option B
• Idea of having a fall back cluster
• Increased redundancy, Increased cost
• Point switches to this cluster - if datastore are shared across both clusters, can
upgrade one cluster at a time
• Will this be hitless?
• Key lies in what is actually being upgraded
• However - hitless rollback if required
• Saves production state in case of emergency
© 2018 LUMINA NETWORKS, INC.
How we do it
Not so easy
• Avoiding initial data plane impact
– Prepare
• Stop running controller process
• Disconnect controllers from switches
• Environment tools - orchestration/monitoring systems
– Checks
• Switch connections
• Controller status
• Data plane
– Upgrade
© 2018 LUMINA NETWORKS, INC.
Automation Tools
• Software provisioning/IT automation
• Completely hands off - process driven upgrade
• Operational ready process - tested and proven
• Powerful automation tool - Ansible Project
• Concept of roles/playbooks and inventories
– Pre-Check
• Ability to check for existing packages/files/information
• Make decisions based on OS
• Run native/non-native commands direct to servers
– Upgrade
• Copy, move and edit files
• Extract and install packages
• Native Linux Functionality built into native ansible commands
– Post-Check
• Validation
• File cksum checks
• Application Config
© 2018 LUMINA NETWORKS, INC.
In-house DevOps Tools
• Compare and validate datastore with switches
• Use to understand current state of network -
– Nodes?
• LLDP received?
– Links?
• Is topology built internally?
• Is appropriate topology datastore populated correctly?
– Flows?
• Comparison of operational/config datastore
• Are flows reported on switches and in operational?
• Verify correct flow and group calculation
© 2018 LUMINA NETWORKS, INC.
Challenges
• Lab and Production environment differences
• Users/Permissions
• Directory Structure
• Addressing schemes
• Resource limitation
• Hard to get "identical" production environment
• Inventory management
• Variables, secrets, package versioning
• Process needs to be "bullet proof"
• Tested/Refined,Feedback, etc
• CI/CD
• Accounting for differences between lab and production can be tricky
• Product Changes/Customer tool changes
• Changes in orchestration applications
• Application namespace changes and functionality changes
• Regression testing needs to be thorough and capture corner cases
• Appropriate testing framework
© 2018 LUMINA NETWORKS, INC.
Way around the challenges
• Automation, automation, automation
• Know the environment/product well enough to automate the entire process
• Automated Testing framework - thorough use case and functionality testing
• No changes implemented that aren’t tested
• No engineering "hands on" during upgrade
• Anyone can run the upgrade is the goal
• Knowledge
– Knowledge is in the process
– Knowledge is in the automation and toolset / CI/CD
– Efficiency, effectiveness - not reliant on individuals or their knowledge in
constantly changing industry
© 2018 LUMINA NETWORKS, INC.
Thank you!

Hitless Controller Upgrades

  • 1.
    © 2018 LUMINANETWORKS, INC. © 2018 LUMINA NETWORKS, INC. SDN Meetup Hitless Controller Upgrade
  • 2.
    © 2018 LUMINANETWORKS, INC. Introduction Towards a Hitless upgrade. • Traditional Network Upgrades – Closed Systems • HW and Control Bundled (From the one Vendor) • HW upgrade sometimes requires Control plane refresh – Line card needs new OS and/or RE upgrade. – Large Events • Sometimes Months of Planning • Failure is handled by rollback – End Game is lots of small Automated Upgrades.
  • 3.
    © 2018 LUMINANETWORKS, INC. Brutal Automation is the only way Its easy to regress back to inefficient practices. • Arash Ashouriha, Deutsche Telekom AG (NYSE: DT)'s deputy chief technology officer, said the only way that his company could now succeed was through a process of "brutal automation.” THE HAGUE -- SDN NFV World Congress 2017
  • 4.
    © 2018 LUMINANETWORKS, INC. Controller Upgrade CI/CD Toolsets Software Practices and Toolsets that need to be employed. • Upgrades MUST be Automated. • Automated Dev Test Framework. – NO Shortcuts! • Pre Validation Checks. • Engineer Hands off Upgrade Process. • Post Validation Checks. • Automated Rollback. • Post Rollback Validations.
  • 5.
    © 2018 LUMINANETWORKS, INC. Data and Control Layer Separation • Data Plane – Rule driven • Openflow rules • Configured by application on controller – Isolated from control plane – Benefits of no control traffic between nodes – Decisions made by application – Any "white box" with OF interface – Flows and groups are static until reprogrammed
  • 6.
    © 2018 LUMINANETWORKS, INC. Data and Control Layer Separation • Control Plane – Application/"Flow Manager" • Controller acts as message bus • Application calculates flows/groups – Receives LLDP from nodes • Topology built – Shares/Distributes network state to all Controllers – Drives potential for "hitless" upgrade – Has it’s challenges…
  • 7.
    © 2018 LUMINANETWORKS, INC. Challenges with Openflow Hitless Upgrade Can it be hitless? Types of Changes we need to understand. • Controller APP Change – Path Computational change that requires an algorithm change – Service Change (new way of using abstracted resources) • Controller Change – Project Updates - openflow plugin /stats manager /topology manager etc – Plugin Updates - openflow 1.3 -> 1.4 – MDSAL/Model changes - yang model changes • Dataplane Pipeline – No Pipeline Change >>>> HITLESS ☺ • Flows, Groups, Tables stay the same – Pipeline Change • Flows, Groups, Tables are not supporting new Pipeline
  • 8.
    © 2018 LUMINANETWORKS, INC. Controller APP Change • Can you overlay a PCE Change? • New LSP Mesh / SR topo (Nodes SID) • Even if you could handle a new Label base, you need to handle: – Match Duplication (on ingress) • How would you handle this? – Action Duplication (on egress) • Resource Limits – Group Limits - stats manager with lots of groups - clustering then replicates that data – Flow limits
  • 9.
    © 2018 LUMINANETWORKS, INC. Controller Infrastructure • Plugin Changes – Experimenter (mechanism for proprietary messages within the protocol) – Version Bump • Controller Project Changes – Is Hitless Upgrade Considered Part of the Project? – Namespace – Functionality
  • 10.
    © 2018 LUMINANETWORKS, INC. No PCE change or Pipeline change (Easiest Scenario) But we still have to be aware of: • Group Limits • Flow Limits • Stats Manager – Reconciling Flows – General Load (lots of data) No pipeline change
  • 11.
    © 2018 LUMINANETWORKS, INC. • Flow and or Group type changes. – Flows actions you may need change • Ingress flow now has a new action? – Group Tables you may need change • Change from All to a Hierarchy – New Tables • Table reassignment • Flow and group tables perform different functions • Packet match lookups/forwarding Pipeline Change
  • 12.
    © 2018 LUMINANETWORKS, INC. Node Upgrades • Switch OS upgrade – Remove from service • Rerouting any transit services • Got ingress or egress services? – They are dual homed right? If they aren’t, well.. – Upgrade – Check – Place Back into Service.
  • 13.
    © 2018 LUMINANETWORKS, INC. Controller & Application Upgrades • Option A • Single cluster • Disconnect switches - data plane continues, flows/groups state is persistent • Perform upgrade • Re-deploy • Reconnect Switches • Reliably manage outage window • Not completely hitless
  • 14.
    © 2018 LUMINANETWORKS, INC. Multi Site Cluster/Controller groups Not so easy • Option B • Idea of having a fall back cluster • Increased redundancy, Increased cost • Point switches to this cluster - if datastore are shared across both clusters, can upgrade one cluster at a time • Will this be hitless? • Key lies in what is actually being upgraded • However - hitless rollback if required • Saves production state in case of emergency
  • 15.
    © 2018 LUMINANETWORKS, INC. How we do it Not so easy • Avoiding initial data plane impact – Prepare • Stop running controller process • Disconnect controllers from switches • Environment tools - orchestration/monitoring systems – Checks • Switch connections • Controller status • Data plane – Upgrade
  • 16.
    © 2018 LUMINANETWORKS, INC. Automation Tools • Software provisioning/IT automation • Completely hands off - process driven upgrade • Operational ready process - tested and proven • Powerful automation tool - Ansible Project • Concept of roles/playbooks and inventories – Pre-Check • Ability to check for existing packages/files/information • Make decisions based on OS • Run native/non-native commands direct to servers – Upgrade • Copy, move and edit files • Extract and install packages • Native Linux Functionality built into native ansible commands – Post-Check • Validation • File cksum checks • Application Config
  • 17.
    © 2018 LUMINANETWORKS, INC. In-house DevOps Tools • Compare and validate datastore with switches • Use to understand current state of network - – Nodes? • LLDP received? – Links? • Is topology built internally? • Is appropriate topology datastore populated correctly? – Flows? • Comparison of operational/config datastore • Are flows reported on switches and in operational? • Verify correct flow and group calculation
  • 18.
    © 2018 LUMINANETWORKS, INC. Challenges • Lab and Production environment differences • Users/Permissions • Directory Structure • Addressing schemes • Resource limitation • Hard to get "identical" production environment • Inventory management • Variables, secrets, package versioning • Process needs to be "bullet proof" • Tested/Refined,Feedback, etc • CI/CD • Accounting for differences between lab and production can be tricky • Product Changes/Customer tool changes • Changes in orchestration applications • Application namespace changes and functionality changes • Regression testing needs to be thorough and capture corner cases • Appropriate testing framework
  • 19.
    © 2018 LUMINANETWORKS, INC. Way around the challenges • Automation, automation, automation • Know the environment/product well enough to automate the entire process • Automated Testing framework - thorough use case and functionality testing • No changes implemented that aren’t tested • No engineering "hands on" during upgrade • Anyone can run the upgrade is the goal • Knowledge – Knowledge is in the process – Knowledge is in the automation and toolset / CI/CD – Efficiency, effectiveness - not reliant on individuals or their knowledge in constantly changing industry
  • 20.
    © 2018 LUMINANETWORKS, INC. Thank you!