We introduce NRE and DevNetOps with inspiration from SRE and DevOps. We get into some detail on what it means to be a Network Reliability Engineer (NRE) and implement a DevNetOps pipeline. We cover the journey from manual and basic automation in NetOps to NRE generically and with the example from the journey at Riot Games. In closing we flip the narrative and look at networking in tradition DevOps and software engineering as well to contrast networking's role in DevOps vs. DevNetOps and NRE.
Network Reliability Engineering and DevNetOps - Presented at ONS March 2018
1. James Kelly, Juniper Networks
Matt Oswalt, Juniper Networks
Doug Lardo, Riot Games
NETWORK RELIABILITY ENGINEERING
and DevNetOps
2. • From DevOps/SRE to DevNetOps/NRE
• What is an NRE? And why do you care?
• A DevNetOps pipeline and cycle
• 5 steps to NRE
• Riot Games journey to NRE
• Shift-left SDN in DevOps
• Invisible networking/security for developers
AGENDA
3. SRE is published by Google
2016
DevOps Handbook
2015
DevOps is Coined
2009
INSPIRATION
4. NEW HEROS IN THE DEVOPS SAGA
DevNetOps & DevSecOps
5. DEFINING TERMS
For application development ops DevOps mentality around security ops DevOps mentality around network ops
DevOps DevSecOps DevNetOps
DevOps brings together development and operations:
- PEOPLE and cultural principles and behavior through the entire business-level service lifecycle
- PROCESSES from design to production to maintenance reliability, scale, performance, security
- TOOLS to scale architecture, automate, collaborate, measure and thus improve quality and speed
Software is crafted, built and run in the
same organization
Silos are internal to IT department
Security and networking solutions are mostly bought and assembled
Silos are vendor-customer so co-creation is required
6. DEFINING TERMS… and ROLES
For application development ops DevOps mentality around security ops DevOps mentality around network ops
DevOps DevSecOps DevNetOps
DevOps brings together development and operations:
- PEOPLE and cultural principles and behavior through the entire business-level service lifecycle
- PROCESSES from design to production to maintenance reliability, scale, performance, security
- TOOLS to scale architecture, automate, collaborate, measure and thus improve quality and speed
Software is crafted, built and run in the
same organization
Silos are internal to IT department
NRE: Network Reliability Engineer
7. DEFINING TERMS… and ROLES
For application development ops DevOps mentality around security ops DevOps mentality around network ops
DevOps DevSecOps DevNetOps
DevOps brings together development and operations:
- PEOPLE and cultural principles and behavior through the entire business-level service lifecycle
- PROCESSES from design to production to maintenance reliability, scale, performance, security
- TOOLS to scale architecture, automate, collaborate, measure and thus improve quality and speed
In classic DevOps, traditional ops concerns like security and infrastructure are shifting left, moving earlier on the
code-to-cash timeline. These alter egos are part of classic DevOps and app development + operations:
• ? SecDevOps ? aka Rugged DevOps propels security earlier in considerations of DevOps
• ? NetDevOps ? (less popular term) propels networking into considerations of DevOps (eg. apps controlling the network)
The Shift Left
NRE: Network Reliability Engineer
Software is crafted, built and run in the
same organization
Silos are internal to IT department
8. WHAT IS (an) NRE?
Defn An NRE
The professional that implements network reliability engineering
Defn NREing
Engineering an automated network to deliver measurable reliability
(SLO/SLA of MTTF, MTTR, etc.) under measureable conditions
(scale, rate of change, performance, etc.)
Defn DevNetOps
Like NRE, engineering automated network, but more explicitly says:
• Take a developer (software engineering) approach
• The application of the approach is to NetOps
• Focus on shorter cycles and lead time in code-to-prod pipeline
• Our work begins in pre-production and follows CI/CD/CR
9. WHY N- RELIABILITY -E
1. It’s not just about network automation or technology:
“Network automation does not an automated network make.”
2. Reliability must be ensured before acceleration
“It’s not how fast you drive, it’s how you drive fast”
3. Reliability forces us to automate and simplify
Encompassing NetOps goals: resiliency, security, metrics…
Higher-order DevOps principles…
1. Eliminate toil and technical debt with automation
2. Truth and transparency. Management by metrics.
3. Allow for failure; Iterate and evolve with Agile; then triangulate...
4. Continuous improvement: turn local lessons into global ones
5. Continuous learning (kaizen)
10. WHY NR - ENGINEER
1. NRE focus sidesteps DevOps vs. DevNetOps confusion
There are clear NetOps projects outside of software teams, but
some confusion on terms remains. NRE is more straightforward.
2. Engineers are builders and we build networks
Vendors cannot do this; pick-up where they leave off
3. More to it than in-production NOC dashboards
More creative, more satisfying, more money
4. By building, testing, stressing, staging what you build…
you prepare for better operations and better outcomes
5. By operating what you build…
you better understand what you’ve built
11. A DevNetOps PIPELINE
Resiliency Design and Drills
Orchestrated Upgrades
Pipeline Orchestration
Network as Code
Micro & Immutable Architecture
Continuous Response
Continuous Measurement
TOOLING PROCESSES PEOPLE
Repos of config, secrets, artifacts + gitOps branching, reviewing, pairing, Agile Code skills (not necessarily programming)
Pipeline CI/CD tools, test frameworks TDD, measurement judgements Build and debug skills, pipeline pros
Baking deliveries for ZTP, vendor refactors Small-step commits/deployment Hands off CLI/TTY
ZTD, virtualization, labs, traffic draining Staging and simulation, canary analysis In-hours maintenance, maybe roll forward
Traffic generation, DoS, chaos monkey Chaos windows, document limits Force failure for understanding
Big data analytics, ML, ITops integration Incident playbooks, capacity planning Management by stats, metrics, efficiency
Auto-remediation, FaaS, predictive stats Supervise self-driving Drink tea, meditate
Continuous Improvement Upgrades, features, fixes, changes Record local lesson into global knowledge Active open-mindedness, post-mortems
12. 5-STEP JOURNEY
NetOps at the device or
system UI
People are technicians not
technologists / engineers
Automate the design of ops:
workflows
Focus on frequent tasks like
troubleshooting more than
config management
Connect actions to triggers
and think in TDD terms
Rethink troubleshooting as
testing and provisioning +
actions as code to be tested
DevNetOps pipeline and
principles for agility, speed
Accuracy at speed, shorten
feedback, smaller changes,
automate safe deployments
Network reliability engineer
service-level outcomes
Plan error budgets and
objectives, use service-level
indicators, plan capacity
Manual ops
Automated workflows
Automated triggers and
networks as code
Continuous processes
on a continuous pipeline
(CI/CD/CR)
Engineering Outcomes
(NRE)
DESTINATION
People:
Network
Reliability
Engineers (NRE)
Process:
DevNetOps
And NRE’ing
Technology:
Self-Driving
Network
13. STEPS ON JOURNEY
Manual ops
Automated workflows
Automated triggers and
networks as code
Continuous processes
on a continuous pipeline
(CI/CD/CR)
Engineering Outcomes
(NRE)
Master architecture.
Document workflows.
Automate and aggregate
repetitive workflow actions,
especially troubleshooting
Trigger actions
automatically. Codify like a
developer. Think TDD and
proactive instead of reactive
troubleshooting
Automate build, testing,
deployment & response like
engineering teams. Move
quickly in small steps for
agility with accuracy.
Engineering. Simplicity.
Network Reliability,
Business Agility,
Continuous Improvement,
Positive Outcomes
Your present does not determine where you can go, merely where you start
14. RIOT’S JOURNEY
Visio & GNS.
Manual
Validation
Perl. New
VLAN with a
Firewall & LB
Context
DC for Infra
Dev & QA
T shaped
Infrastructure
Engineers
Python &
Jenkins
Fully
Automated
Failover
Testing
Ansible &
Jinja2
templates.
50/50
Automated /
Manual
Failover
Testing
15.
16. Next for Riot: Can’t take away CLI yet. But…
dlardo@rpod6-k2007-spine1> show system commit
0 2017-12-13 20:11:31 UTC by netbldr via netconf
pushed from ClosBuilder
1 2017-10-04 23:54:13 UTC by whogan via cli commit confirmed,
rollback in 3mins
2 2017-10-04 23:50:07 UTC by whogan via cli commit confirmed,
rollback in 3mins
3 2017-10-03 01:12:51 UTC by netbldr via netconf
pushed from ClosBuilder
4 2017-10-02 23:17:42 UTC by netbldr via netconf
pushed from ClosBuilder
5 2017-09-01 23:07:16 UTC by netbldr via netconf
pushed from ClosBuilder
6 2017-08-29 22:44:29 UTC by netbldr via netconf
pushed from ClosBuilder
7 2017-08-25 00:50:01 UTC by llambert via netconf
8 2017-08-25 00:49:44 UTC by llambert via netconf
17. Next for Riot
• Investigating Service Meshes. Envoy, Istio, others?
• Introducing Chaos Engineering
• Microservices are hard. Brownouts are real hard.
• Investigating Software Defined Edge
• DDoSaaS, IPBaaS, Game Server aware load balancers
• Evolving App <> App permissions for better app deploys
• Networking Global Policy
18. SDN in TRADITIONAL DevOps
Consider SDN in modern cloud-native clusters:
• If you have different IPAM, DNS, Policy, LB, etc. in test / staging / prod
• …then you break consistency between environments!
• Ensure the same setups between environments
• Overlays and multi-tenancy make this easier
• Multi-tenant SDN makes multi-use/multi-project clusters much simpler
• Overlays provide separation of concerns and network/security isolation and
reproducible unit
Can we make network security invisible for app and platform engineering?
• Automatic micro-segmentation
• Avoid a ton of Kubernetes NetworkPolicy objects
• Improve infosec and netops visibility
What about SDN with a DevNetOps pipeline ?
• SDN as code, building, testing, etc. is easier because it’s software
The Shift Left
19. James Kelly, Juniper Networks
Matt Oswalt, Juniper Networks
Doug Lardo, Riot Games
NETWORK RELIABILITY ENGINEERING
and DevNetOps
A network reliability engineer (NRE) is an IT operations role that applies an engineering approach to measuring and automating the reliability of the network to align with service-level objectives, agreements, and goals of the IT organization and business
Img credit https://pixabay.com/en/human-gear-man-machine-machine-192607/
- txt files
- perl
- handed off
- team had no idea how to do simple programming. Replaced variables in code using global search / replace.
- 2nd DC build went live after failover testing, but the configs actually were a mess. Dual HSRP & OSPF running at the same time - only one did anything.
- Junos Space policy management was worked around as it was too slow & confusing. Configs became the source of truth. People would half ass it by not applying global policy changes to all firewalls. Then the next engineer would come along and have to worry about pushing changes they didn't understand. What did they do? Copy/paste the commands around Junos space, and leave the gernade for the next guy.
Next Project - rCluster v1
- Ansible with Jinja Templates
- Some automated failover testing. Humans had to watch it.
- Implementation team "ran the failover" testing. Cheated and copy / pasted results from one DC to the next. We didn't figure it out till years later when outages occurred and we found out that only some uplinks were functional due to wires being ran to the wrong place.
rCluster v2
- Moved to 100% python, Ansible wasn't designed to handle a lot of logic. Still used Jinja Templates at it's core.
- Fully automated the failover testing. Results printed to a log file.
- Jenkins jobs added, all templates are generated and ran against production every night.
- All run inside Docker containers
Today
- Sites are mostly red / jobs are failing.
- Why? People work around the system because they can.
- Culture isn't there yet. Leaders are writing down "fix it for real" and then putting at the bottom of the backlog. Teams don't embody the mindset so they don't push back enough.
Next steps?
- Shame Bot