Building a Dynamic DNS Infrastructure
Khalid Hasanov
★ Motivation
★ Legacy DNS Infrastructure
★ New DNS Infrastructure
○ Design
○ Monitoring
○ Performance
★ Questions
Overview
2
1. Authoritative DNS server - simply the owner of the hostname
2. Recursive DNS server - resolves any query they receive by consulting a
corresponding authoritative server if there is no answer in its own cache
3. A validating DNS server - a resolver which verifies the response it has
received is correct
DNS Terminology
3
DNS Update Workflow
4
Legacy DNS Architecture - Bind
5
Issues we had with Bind
● Performance
○ Several attempts to optimize Bind - CI build plans
○ It would take at least 15 minutes for Bind updates to take
effect
● Not centralized source of truth for all DNS Servers
● No automatic replication
● No dynamicity, no API to programmatically modify DNS data
6
Needs for a new DNS system
● Better performance
● Dynamic updates
● Automatic replication and failover
● Centralized source of truth for all our DNS servers
● Migration should be transparent for our engineers
7
What’s out there? DNS server software
8
PowerDNS Architecture
9
Interaction with other infrastructure components
10
PowerDNS replication
We use PowerDNS in native replication mode:
○ PowerDNS will not send out DNS update notifications
○ PowerDNS will not react to DNS update requests
○ The database backend is taking care of replication
11
PowerDNS native replication
● PostgreSQL continuously ship
Write-Ahead Log records to the
standby servers
● Each standby server operates in
continuous recovery mode
12
Disaster Scenarios - Actions
1. Database backend failure:
Action: Kill PostgreSQL backend
2. Authoritative backend failure:
Action: Kill PowerDNS authoritative application
3. Recursor failure:
Action: Kill PowerDNS recursor application
13
Disaster Scenarios - Observations
1. Database backend failure:
a. No issue if the requested domain is already in the recursor cache
b. If not, we can always talk to the slave PowerDNS servers
2. Authoritative backend failure:
The same behaviour as it was observed in the previous step
3. Recursor failure:
a. No request can be served from the corresponding PowerDNS
server
b. The requests are going to be handled by the slave nameservers
14
Production readiness - Load testing
15
● Load testing using JMeter
● JMeter tests started on 6 different VMs in parallel
● The test continued in 10 minutes and the total number of samples reached over 6 million
● Only 5 error in 6 million samples
Production readiness - Load testing
16
● Load testing using JMeter
● JMeter tests started on 6 different VMs in parallel
● The test continued in 10 minutes and the total number of samples reached over 6 million
● Only 5 error in 6 million samples
Production readiness - Load testing
17
● Load testing using JMeter
● JMeter tests started on 6 different VMs in parallel
● The test continued in 10 minutes and the total number of samples reached over 6 million
● Only 5 error in 6 million samples
Integration with existing workflow
● Existing process of updating DNS zones and records:
○ Ansible inventories and group vars
○ We need a way to be in sync with PDNS while we use
existing BIND based DNS
● Internal tool called Staple:
○ Parses Git logs of Ansible inventory files to get DNS
related changes
○ The changes are synced with PDNS using its HTTP API
18
What metrics we monitor and alert on
● Record-by-record and zone-by-zone comparison between Bind and
PowerDNS
● PowerDNS specific metrics
● Host specific metrics
19
Record-by-record and zone-by-zone comparison between
Bind and PowerDNS
20
PowerDNS specific metrics
21
Host specific metrics
22
User Interface
23
Future Work ...
24
Questions?
25

Building a Dynamic DNS Infrastructure

  • 1.
    Building a DynamicDNS Infrastructure Khalid Hasanov
  • 2.
    ★ Motivation ★ LegacyDNS Infrastructure ★ New DNS Infrastructure ○ Design ○ Monitoring ○ Performance ★ Questions Overview 2
  • 3.
    1. Authoritative DNSserver - simply the owner of the hostname 2. Recursive DNS server - resolves any query they receive by consulting a corresponding authoritative server if there is no answer in its own cache 3. A validating DNS server - a resolver which verifies the response it has received is correct DNS Terminology 3
  • 4.
  • 5.
  • 6.
    Issues we hadwith Bind ● Performance ○ Several attempts to optimize Bind - CI build plans ○ It would take at least 15 minutes for Bind updates to take effect ● Not centralized source of truth for all DNS Servers ● No automatic replication ● No dynamicity, no API to programmatically modify DNS data 6
  • 7.
    Needs for anew DNS system ● Better performance ● Dynamic updates ● Automatic replication and failover ● Centralized source of truth for all our DNS servers ● Migration should be transparent for our engineers 7
  • 8.
    What’s out there?DNS server software 8
  • 9.
  • 10.
    Interaction with otherinfrastructure components 10
  • 11.
    PowerDNS replication We usePowerDNS in native replication mode: ○ PowerDNS will not send out DNS update notifications ○ PowerDNS will not react to DNS update requests ○ The database backend is taking care of replication 11
  • 12.
    PowerDNS native replication ●PostgreSQL continuously ship Write-Ahead Log records to the standby servers ● Each standby server operates in continuous recovery mode 12
  • 13.
    Disaster Scenarios -Actions 1. Database backend failure: Action: Kill PostgreSQL backend 2. Authoritative backend failure: Action: Kill PowerDNS authoritative application 3. Recursor failure: Action: Kill PowerDNS recursor application 13
  • 14.
    Disaster Scenarios -Observations 1. Database backend failure: a. No issue if the requested domain is already in the recursor cache b. If not, we can always talk to the slave PowerDNS servers 2. Authoritative backend failure: The same behaviour as it was observed in the previous step 3. Recursor failure: a. No request can be served from the corresponding PowerDNS server b. The requests are going to be handled by the slave nameservers 14
  • 15.
    Production readiness -Load testing 15 ● Load testing using JMeter ● JMeter tests started on 6 different VMs in parallel ● The test continued in 10 minutes and the total number of samples reached over 6 million ● Only 5 error in 6 million samples
  • 16.
    Production readiness -Load testing 16 ● Load testing using JMeter ● JMeter tests started on 6 different VMs in parallel ● The test continued in 10 minutes and the total number of samples reached over 6 million ● Only 5 error in 6 million samples
  • 17.
    Production readiness -Load testing 17 ● Load testing using JMeter ● JMeter tests started on 6 different VMs in parallel ● The test continued in 10 minutes and the total number of samples reached over 6 million ● Only 5 error in 6 million samples
  • 18.
    Integration with existingworkflow ● Existing process of updating DNS zones and records: ○ Ansible inventories and group vars ○ We need a way to be in sync with PDNS while we use existing BIND based DNS ● Internal tool called Staple: ○ Parses Git logs of Ansible inventory files to get DNS related changes ○ The changes are synced with PDNS using its HTTP API 18
  • 19.
    What metrics wemonitor and alert on ● Record-by-record and zone-by-zone comparison between Bind and PowerDNS ● PowerDNS specific metrics ● Host specific metrics 19
  • 20.
    Record-by-record and zone-by-zonecomparison between Bind and PowerDNS 20
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.