In December 2017, Roblox’s network was managed in a traditional way without automation.
To sustained its growth, the team had to deploy 2 datacenters, a global network and multiple point of presence around the world in few months, the only solution to be able to achieve that was to automate everything.
6 months later, the team has made tremendous progress and many aspects of the network lifecycle has been automated from the routers, switches to the load balancers.
Synopsis
This talk is a retrospective of Roblox’s journey into Network automation:
How we got started and how we automated an existing network.
How we organized the project around Github and an DCIM/IPAM solution (netbox),
How Docker helped us to package Ansible and create a consistent environment.
How we managed many roles and variations of our design in single project
How we have automated the provisioning of our F5 Load Balancers.
For each point, we’ll cover what was successful, what was more challenging and what limitations we had to deal with.
Roadmap to Membership of RICS - Pathways and Routes
Ansiblefest 2018 Network automation journey at roblox
1. Adam Mills, Principal Network Engineer
Damien Garros, Network Reliability Engineer
Ansiblefest, Austin October 3rd 2018
Network Automation Journey at Roblox
from manual to highly automated network in 6 months
2. 1. How did you get started with Ansible?
2. How long have you been using it?
3. What's your favorite thing to do when you Ansible?
Adam Mills Damien Garros
3. 1. What is Roblox ?
2. Automation Project Architecture
3. Managing Device Configurations
4. Managing Changes from Design to Implementation
5. Culture & Organization
Questions ?
Agenda
5. ● Educational platform for young software developers
● Gaming and Social platform
● Core audience is children ages 9-12
● 70 million monthly active users
What is Roblox?
6. Source: comScore Custom Analysis, Total Digital (does not include Mobile data), December 2017
Comparison of Top Online Media Properties,
Total Monthly Hours
51.5
32.5
59.4
19.4
3.4
6.8
1.9
7.3
.52
2.3
.27
1.7
.18
1.3
.11
1.2
.07 .96 .04 .23
(in Millions)
7. The Challenge in Front of us
DC1 DC2
POP
POPPOP
POP POP
POP POP
POP
POP
DC3
POP
Dec 2017
Dec 2018
9. How we got started ..
A Github
Account
Network Engineer
with Laptops Some
VMs
10. You don’t need much to get started
● Doesn’t require a lot of resources
● Doesn’t require a team of python developers
● Doesn’t requires a CI/CD pipeline
11. Netbox as a DCIM / IPAM Solution
● We decided to use Netbox for
○ IPAM Solution
○ Cabling information
○ Device Inventory management system
■ Network and Server
https://github.com/digitalocean/netbox
12. Track Device status in Netbox
Planned for future deployment
Physically Racked and Powered, not configured yet
Configured but not in production / maintenance mode
Production
Device configuration changes based on its own status
and status of peers
13. High level Workflow
Devices list / Role
IP addresses
Connections
Jinja
Existing
Devices
New
Devices
Network
Builder
(Roblox developed)
14. Import existing devices
1. Create inventory file manually
2. Create a playbook to create devices in netbox using API
3. Create a playbook to get interfaces list from devices using
Napalm and create them in Netbox
4. Create a playbook to get Ips from devices using Napalm and
create them in Netbox
5. Create a playbook to get LLDP info and create links in netbox
16. Pull information from Netbox
Dynamic Inventory
Run before each playbook
Pull device list and basics device attributes
Create Group Dynamically based on :
role, custom fields, location etc ..
Need to be Fast
Custom Module
Execute a lot of queries and Merge all the
information into a single Data structure
(device model)
1 Execution per device
Create a local cache in host_vars
Somehow slow
takes couple mins to run
17. Custom module to generate device model
● Pull interfaces / IP / links / circuits information from Netbox
● Create a single data structure with all information
● Pre-calculated all peers IPs for Point to Point links (/31&/127)
● Generate interface description based on internal rules
● Save all information in a local cache under host_vars
18. Precalculated Peers IP address for point to point
links
p2p_peers:
- ip_family: 4
link_is_active: true
local_int: et-0/0/1.0
local_ip: 10.10.194.31/31
local_status: Active
peer_int: et-0/0/63
peer_ip: 10.10.194.30/31
peer_name: cs1-c1-chi1
- ip_family: 4
link_is_active: true
local_int: et-0/0/7.0
local_ip: 10.10.194.39/31
local_status: Active
peer_int: et-0/0/63
peer_ip: 10.10.194.38/31
peer_name: cs2-c1-chi1
22. Use docker to create multiple Environment
Datacenter
Dynamic inventory config
Specific Playbooks
Specific Local Variables
Backbone
Dynamic inventory config
Specific Playbooks
Specific Local Variables
Load Balancer
Dynamic inventory config
Specific Playbooks
Specific Local Variables
Servers
Dynamic inventory config
Specific Playbooks
Specific Local Variables
Dynamic Inventory Script
Shared Roles and Modules
Shared Playbooks
23. netbox:
group_by:
default: [ device_role, rack, site ]
custom: [ design_rev, service_group ]
filters:
dc:
- site: [ dc1, dc2 ]
border-router:
- role: border-router
hosts_vars:
ip:
ansible_ssh_host: primary_ip
general:
platform: platform
role: device_role
site: site
device_type:
device_type: slug
status:
status: label
Dynamic Inventory Configuration
Based on AAbouZaid/netbox-as-ansible-inventory project
Dynamic Inventory script behavior defined in a config file
group_by to define the ansible groups we need
Filters to limit the devices list that get pulled from netbox
host_vars to define device host_vars to populate
● This is a very important piece of the puzzle
● So strategic that we decided to fork the initial project
and maintain our own version
28. Different approach to automation
Config deployment Load Override Merge-ish
Change Diff Supported/Easy --check
Add new elements Easy Easy
Remove elements Easy Hard
29. Build the configs with reusable templates
P5 P6P3 P4
T2
● Unique set of properties
per device
T1
P1 P2
● Template per role
● Reusable base
B
● Banners
● Logging strings
● Communities
31. Building in Flight & Dealing with Legacy
● Had to start without the full tool kit
● Handle different stages of the life cycle
○ New
○ Retrofit
○ Maintain
● We needed away to Test before commit
32. Playbooks used
├── pb.config.generate.yaml
├── pb.config.diff.yaml
├── pb.config.commit.yaml
● Junos Diff
○ Iterate on one device at a time
○ Bring “legacy” and “brownfield” devices under Ansible.
○ Once the templates match reality “commit and-quit” with
confidence
33. Test:
Validate template changes with Diff
Intended Result:
Diff file empty
When the results are True:
Automation matches reality
TDD with Junos using “Diff”
38. ● Diff for other vendors
○ Arista added this feature
● F5
○ Full config not an option in Ansible
○ Custom modules for bootstrap
○ Manage VIPs, nodes, pools
After thoughts...
42. Examples of Rack Switch Variation
web
web
application
virtualization
database
virtualization
game
virtualization
provisioning
43. Build the configs with reusable parts
P5 P6P3 P4
T2
● Unique set of properties
per device
T1
P1 P2
● Template per role
● Reusable base
B
● Banners
● Logging strings
● Communities
P4 P6
46. ● Ownership is pushed out
● Avoids asynchronous communication
● Keeps both teams honest
● Ensures that all things are codified
Helping both server teams and networking teams
47. If it is captured in code, it’s not a one off.
Network Design
Naming Convention
Cabling Convention
Datacenter Layout
Vendor Specific Information
Device Revision
Rack Revision
v1
v2
v2.1
v1.1
v2.2
v2.1
v1.2
v2.4
v2.1
v1
v2
v2.1
v1.1
v2.2
v2.1
v1.2
v2.4
v2.1
49. How to win
● Strong support within the organization
○ Automation is the only long term solution
● Move quickly and iterate
● It’s okay if it’s not perfect the first time
○ It WON’T be right the first time
● Persistence
○ Insist the solution
○ But, listen and adapt
50. The winning team of NE + NRE
Network Engineer
(NE)
Responsible to define the network
architecture
Consume automation tools
Own config templates
Comfortable with Git
Network Reliability Engineer
(NRE)
Responsible to define the automation suite
architecture
Package / Develop / Maintain the tools
Comfortable with network devices and
architecture.