1. a study of
our DNS full-resolvers
Matsuzaki ‘maz’ Yoshinobu
<maz@iij.ad.jp>
2. Topic for today
• Lesson learned from an outage of our full-resolvers
• An interesting behavior of clients
3. Users and DNS full-resolver
full-resolver
(cache nameserver)
Most users are using ISP’s full-
resolvers as those information
are provided automatically
maz@iij.ad.jp 3
4. DNS cache nameserver
• Usually ISPs provide 2 nameservers for customers
• Just in case
• Our assumptions here:
• Even single server was failed, another server can handle
DNS queries
• Users somehow automatically pick an usable one up for
their use
maz@iij.ad.jp 4
5. In 2009, we had a trouble
• Trouble on cache nameservers for consumers
• Apr, 2009
• On two (all) nameservers
• 1st failure happened on
a server (ns01)
• then 2nd failure happened
on another server (ns11)
• About 12min blackout
maz@iij.ad.jp 5
6. Failures on our cache nameservers
• ns01: 17:14:26 - 17:48:07 (33min14sec)
• ns11: 17:35:51 - 17:48:52 (13min01sec)
• During the both servers were in trouble
(12min16sec), our users couldn’t resolve
hostnames
• During this trouble, the servers couldn’t answer
14,005,644 DNS queries
maz@iij.ad.jp 6
8. Before failure
• Clients prefer to use ns01
• Order of configuration?
• Clients sent DNS queries to
another server as well
• measuring delays?
• just in case?
maz@iij.ad.jp 8
9. During single failure
• DNS queries to ns01 were
discarded during this period
• It seems users could still resolve
hostnames as the ns11 was alive
• No strange traffic pattern here
• Users might feel some delays
• A bit higher rate of queries in the
first 3min, and then ‘stable’ state
maz@iij.ad.jp 9
10. During single failure (cont.
• Query rate(ns01+ns11) looks
almost the same as before
• Even though ns01 was discarding
queries during this period
• Probably most clients usually send
DNS queries to both nameservers
maz@iij.ad.jp 10
11. During double failure (outage)
• Users couldn’t resolve hostnames
at all during this period
• Query rate suddenly increased on
both nameservers
• Those are all discarded though
• Mostly because of ‘retries’
• We observed multiple queries that
has the same QNAME
maz@iij.ad.jp 11
12. Restoration
• Once ns01 was restored, it got
about 7times more queries than
usual for several seconds
• Web pages are composed by many
“modules”
• Single web page makes several and
more DNS queries sometimes
• Browsers’ prefetch function
• 12min was enough to flush
clients’ side DNS cache
maz@iij.ad.jp 12
13. Restora(on (cont.
• Then ns11 was also restored
• It also got higher rate of queries for
several seconds
• Gradually the query rate were
getting ‘normal’ state as same as
the before
maz@iij.ad.jp 13
14. Lesson learned
• Single server failure will not cause a disaster, when
users configure multiple DNS cache servers on their
device
• Probably the impact could be negligible
• During double failure (full-outage), nameservers
got more queries
• Once a server is restored from full-outage, it gets
higher query rate for a while
• In our case, 7 times more than usual for several seconds
maz@iij.ad.jp 14
15. Redundancy is important
• DNS resolving works somehow as long as one of servers
is functional on each part of DNS
• Full-resolvers (caching nameservers)
• Authoritative nameservers
• Do have redundancy, avoid outage
• A multiple server deployment works well
• IP anycast would be also useful
• my bdnog7 talk - https://www.slideshare.net/bdnog/ip-anycasting
• Once outage, we should expect a large amount of
queries during and just after the outage
• A warning for those who has a security device in front of
nameservers
16. Users and DNS full-resolver
full-resolver
(cache nameserver)
Many devices
on a home network
maz@iij.ad.jp 16
17. Usual graph - 5min average
measurement date: 2018/05/16
18. A bit different view - 1sec average
measurement date: 2018/05/16
19. Peaks on the hour
• Minor peaks on the hour and half (like 07:30)
• “Alarm clock” wakes up the phone itself
• Some applications are also initiated by the wakeup
• I guess those are mostly coming from smartphones
• QNAMEs also hint
20. A spike at 15sec before the hour
measurement date: 2018/05/16
22. Summary
• It’s reasonable for us to provide 2 full-resolvers
(caching nameservers) to customers
• Clients seem to have the ability to use a functional one
• Once outage, we should expect a large amount of
queries during and just after the outage
• A warning to those who has a security device in front of
nameservers
• Clients are synced up unintentionally
• ‘alarm clock’ or scheduled tasks
• This particular case is not an issue at this moment, but
it’s worth to pay attention to those behaviors
maz@iij.ad.jp 22