Speaker: Garrett Nowak, 11:11 Systems
Abstract: Leader behind the network automation in our company. I would love a chance to show off how we went from no network automation to majority network automation in under a year. Current network automation infrastructure hooks into Prometheus, Grafana, and Netbox to maintain our network infrastructure.
4. Square
one
M O N I T O R I N G
All monitoring, configuration backups, and alerting happen through Orion
Solarwinds NPM and NCM. Alerts go to an email distribution list. Backups
occur once every 24 hours and are stored on the Orion server.
C H A N G E C O N T R O L
There is no peer review and no QA process. Changes are made at any time of
the day regardless of environment or impact. When changes are pushed, an
email is sent to an email distribution list for historical reference, but only a
select few people have access to read them.
A S S E T M A N A G E M E N T
Device inventory is kept in Solarwinds and is updated as new devices come into
the network. Physical datacenter locations are kept in a separate DCIM tool
called Nlyte. IP addresses are documented in yet another separate tool called
6Connect. All data entry is manual.
T E M P L A T E S
Configuration templates are held in text files on user desktops and shared
via copy/paste when needed.
6. 01
Locate a template for
the device and
configure it
accordingly
02
Send an email to our
change control
distribution list
03
Update all of our
systems manually:
Monitoring
Backups
DCIM
IPAM
BMS
04
Solarwinds backs up
the device within 24
hours and stores it on
its local server
Configuration changes
Square one
7. Major issues
Square one
Any time something in the
environment changes,
engineers must update a
handful of systems with
similar information in each
system.
M U L T I P L E
S O U R C E S O F
T R U T H
If data is entered into one of
our systems, or a change in
the network occurs, none of
the other systems know
about it.
U N C O O R D I N A T E
D S Y S T E M S
Peer review and QA are
completely hidden from
those who don’t participate
in the peer review and QA. If
a change is pushed and
someone wasn’t involved in
it, they don’t know about it.
L A C K O F V I S I B I L I T Y
Everything is done
manually. Whether we’re
deploying new gear or
making configuration
changes, everything is done
by hand.
M A N U A L C H A N G E S
8.
9. S I N G L E S O U R C E O F T R U T H
We can’t trust multiple systems because a
disparity in one system renders the
information in all other systems suspect.
C E N T R A L I Z E D M A N A G E M E N T
Once the single source of truth is established,
update that one source and force the other systems to
react to it.
N O M I S T A K E S
Nobody’s perfect, but we should strive to be and
build systems around us that can get us as close
to a 100% success rate as possible.
Core concepts
Planning phase
10. M O N I T O R I N G &
D A T A
G A T H E R I N G
D A T A
V I S U A L I Z A T I O
N
D C I M & I P A M
P R O J E C T
M A N A G E M E N
T
C O N F I G U R A T I O
N B A C K U P S
T E A M
C O M M U N I C A T I O
N S
Systems
Planning phase
11. Major issues
Planning phase
We had everything down on
paper but theory are reality
rarely line up perfectly. We
weren’t sure if everything
would work like we
expected, we didn’t know if
our code would break
everything, and since this
was our first time rolling out
a project like this, we didn’t
know what we didn’t know.
U N C E R T A I N T Y
This project was a massive
undertaking. We constantly
had to check ourselves to
make sure we weren’t biting
off more than we could
chew. It was incredibly
important for us to define
the scope and stick to it.
S C O P E
Deploying something from
scratch and maintaining
something day-in and day-
out are two very different
things. We needed to make
sure that not only could we
deploy this in our
infrastructure, but that we
could also maintain it for
years to come.
M A I N T A I N A B I L I T Y
It’s hard selling this to upper
management when it’s
never been done in the
company before. This
project was very cost
effective but it still had cost.
We addressed this by
deploying our systems in
parallel to the existing ones
and proving that it would
work and that cutover
would be seamless.
B U Y - I N
12.
13. Autonet is our in-house automation platform that handles
communications between all of our systems, pushes configurations to
devices, audits devices for configuration drfit, and dynamically keeps track
of the devices on our network. Everything we do with network automation
goes through Autonet.
I n t r o d u c i n g
autonet
14. Autonet ecosystem
Current design
P R O M E T H E U S
Metrics, monitoring,
and alerting
P A G E R D U T Y
Incident resolution
A U T O N E T
Centralized automation
server running on Ubuntu
S L A C K
Team communications
B I T B U C K E T
Code repository
J I R A
Software
development and
project management
G R A F A N A
Data visualization
N E T B O X
DCIM and IPAM
C O N F L U E N C E
Documentation
U N I M U S
Configuration
backups
15. New device Workflow
Current design
A D D D E V I C E T O
P R O M E T H E U S
New devices in our network are added to
Prometheus, our single source of truth.
Devices can be added manually, or found
automatically by Autonet scripts.
A U T O N E T U P D A T E S A L L
S Y S T E M S
Once a device is in Prometheus, Autonet
triggers updates across all of our systems via
a set of API requests.
G R A F A N A
Grafana is updated in real-time as soon as
Prometheus is updated. It has a direct
connection to all of our Prometheus servers.
U N I M U S
Autonet updates the device list in Unimus
and triggers device backups on newly added
devices.
N E T B O X
Autonet updates the device list in Netbox,
racks the devices in their physical datacenter
location in DCIM, and scans the device for IP
addresses to add to IPAM.
16. New configuration Workflow
Current design
A U T O N E T G E N E R A T E S
C O N F I G
Engineers select the appropriate script to
generate a configuration, add any required
arguments, and then run a script that
outputs a configuration.
E N G I N E E R R E V I E W S
C O N F I G
The engineer reviews the configuration on
the spot and performs a QA. We’re validating
that the configuration is correct and if there
are any improvements that could be made
to the automation.
C O N F I G I S P U T I N T O J I R A
The engineer either manually puts the
configuration into Jira, or we automatically
create a Jira issue if the script allows it.
When a Jira issue is made for a
configuration, the engineer has an option to
request a peer review. If a configuration
came from automation, peer reviews are
optional.
E N G I N E E R P U S H E S C O N F I G
The engineer pushes the configuration to
the device(s). This can be done either
manually or via the script that generated the
configuration depending on the use case.
T E A M I S N O T I F I E D V I A
S L A C K
After an engineer pushes a configuration,
the Jira issue is marked as “configuration
pushed.” Jira automation sends a message
to our team channel in Slack notifying the
group that a change was just made.
U N I M U S D E T E C T S C H A N G E
Unimus scans all devices every hour and
looks for changes. When a change is
noticed, it triggers a full configuration
backup and sends a message to the team
channel in Slack that shows a diff between
the previous configuration and the new one.
17. ISP connectivity issues
B G P S E S S I O N F L A P S
A BGP session with one of our upstream ISPs goes
down. It can come back up or stay down, that part is
irrelevant to our automation.
R O U T E R A D D S P R E P E N D S
Our router automatically prepends advertisements
out to that provider.
A U T O N E T V E R I F I C A T I O N
Autonet keeps track of changes so that we can
review and resolve them.
P R O M E T H E U S N O T I F I E S U S I N
S L A C K / P D
Prometheus sends API requests to Slack and
PagerDuty.
C u r r e n t d e s i g n
S q u a r e O N E
B G P S E S S I O N F L A P S
A BGP session with one of our upstream ISPs goes
down. If the circuit stays down, we reach out to our
ISP for assistance. If the circuit bounces, our
engineers make a judgement call on whether or not
to take further action.
S H U T D O W N T H E B G P S E S S I O N
If further action is deemed necessary, an engineer
manually shuts down the BGP session until we can
get a resolution from the ISP.
V E R I F I C A T I O N
Our engineers monitor the circuit status and bring
up the circuit when all issues are perceived to be
resolved.
18. A U T O N E T C O N F I G U R E S A S A V P N
Lastly, our automation configures the remote end of the VPN tunnels which
terminate on Cisco ASA firewalls.
A U T O N E T C O N F I G U R E S N S X E D G E
Once the NSX Edge has been deployed, our automation configures its firewall and
NAT rules, and builds several VPN tunnels required for management and security.
A U T O N E T D E P L O Y S V M W A R E C O M P O N E N T S
With the network layer complete, our automation moves on to the VMware stack.
It deploys a dvPortGroup in vCenter, external and org networks in vCloud
Director, and an NSX Edge firewall in the appropriate vCD org.
A U T O N E T D E P L O Y S R O U T E R A N D S W I T C H C O N F I G S
Based on the engineer’s parameters from the previous step, our automation builds
the necessary network configurations for our devices.
R E Q U I R E M E N T S A R E D E F I N E D
Things like IP addresses and vCloud organization IDs are provided so that our
automation knows what it’s deploying and where.
Vmware deployment
Current design
19. Major issues
Current design
D O U B L E T H E S K I L L S , H A L F T H E F O C U S
We don’t have a dedicated automation team, so we handle all of the
programming ourselves. Not only do our engineers need to be
progressing in their network skills, but now they also need to be
progressing in their programming skills. We’ve doubled the amount of
skills they need, but we haven’t doubled the amount of time they get
to work on those skills. This effectively halves the amount of time we
spend focusing on networking so that the team can progress with
their programming skills, and vice versa.
We’ve accepted that it takes time for us to get to a true “network as
code” environment and for now, our answer to this problem is to lean
on each other for help. We hold team meetings where we shadow
someone on a network automation script, we teach each other the
things we learn throughout the week, and we make sure that if we see
someone struggling, we pick them up and help them. We move
forward as a team and without that, I don’t think we would have
succeeded like we did in such a short amount of time.
D E V I A T I O N S A D D C O M P L E X I T Y
Our goal is to standardize as much as possible. However, due to things
like customer requirements, supply chain issues, technology advances,
and shifting business requirements, it’s impossible for us to
standardize all of our devices and infrastructure across all of our
datacenters. This leads to one-offs and slight deviations throughout
the infrastructure that add complexity to our automation.
We account for this as high up the programming chain as possible so
that it propagates down to all of our automation and reduces the
amount of work done when our network requirements change. For
example, if something in a specific datacenter changes, we’ll account
for this at our Device class for that datacenter so that all of the scripts
using that class get the update.
20. improvements
Future design
N E T W O R K A S C O D E
We’re currently restructuring Autonet to become the single source of
truth. All changes to the network will be defined as a configuration file
on the Autonet server and our automation will convert it to a network
configuration and push it to devices as requested.
All configuration generation, peer review, network changes, QA, and
change logs will live within Autonet.
F U L L T E S T S U I T E
All changes to the Autonet code repository will go through a full test
suite before making it into production, and all aspects of Autonet will
be automatically tested daily.
By building a virtual lab that contains all of our current firmware
versions, we’ll be able to make sure all of our authentication,
authorization, syntax, and overall logic remains functional and
performs how we expect.
21. LESSONS LEARNED
DO, OR DO NOT. THERE IS NO TRY.
FEAR IS THE PATH TO THE DARK SIDE.
SIZE MATTERS NOT.
PASS ON WHAT YOU HAVE LEARNED.
STRENGTH, MASTERY… BUT WEAKNESS,
FOLLY, FAILURE, ALSO.
YES, FAILURE, MOST OF ALL. THE GREATEST
TEACHER, FAILURE IS.
22. Questions
F I N D M E A T T H E
C O N F E R E N C E
I’d love to meet you and talk to you more
about my presentation or hear any
feedback you have for me.
E M A I L M E
Garrett Nowak
gnowak@1111systems.com
B U Y M E A B E E R
I like networking.
I like automation.
I like beer.
3 GREAT
WAYS TO
CONNECT
23. thanks
N E T W O R K A U T O M A T I O N F O R U M
Thank you to everyone at NAF for giving me the opportunity to speak at this conference, and thank you to everyone who came here and listened to me. I hope
to be included in future conferences and to meet with you all again!
1 1 : 1 1 S Y S T E M S
Thank you to the company for supporting me. Thank you to my mentors for providing me with a foundation with which I could build a fulfilling career. Thank
you to my team for always being there for me; I couldn’t have accomplished these things without you.
M Y W I F E
Thank you for listening to this presentation 900 times over the past few months. Thank you for being an amazing wife and mother. Thank you for always
believing in me.
G A R R E T T N O W A K
Senior Director of Network Architecture