SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
1.
a study of
our DNS full-resolvers
Matsuzaki ‘maz’ Yoshinobu
<maz@iij.ad.jp>
2.
Topic for today
• Lesson learned from an outage of our full-resolvers
• An interesting behavior of clients
3.
Users and DNS full-resolver
full-resolver
(cache nameserver)
Most users are using ISP’s full-
resolvers as those information
are provided automatically
maz@iij.ad.jp 3
4.
DNS cache nameserver
• Usually ISPs provide 2 nameservers for customers
• Just in case
• Our assumptions here:
• Even single server was failed, another server can handle
DNS queries
• Users somehow automatically pick an usable one up for
their use
maz@iij.ad.jp 4
5.
In 2009, we had a trouble
• Trouble on cache nameservers for consumers
• Apr, 2009
• On two (all) nameservers
• 1st failure happened on
a server (ns01)
• then 2nd failure happened
on another server (ns11)
• About 12min blackout
maz@iij.ad.jp 5
6.
Failures on our cache nameservers
• ns01: 17:14:26 - 17:48:07 (33min14sec)
• ns11: 17:35:51 - 17:48:52 (13min01sec)
• During the both servers were in trouble
(12min16sec), our users couldn’t resolve
hostnames
• During this trouble, the servers couldn’t answer
14,005,644 DNS queries
maz@iij.ad.jp 6
8.
Before failure
• Clients prefer to use ns01
• Order of configuration?
• Clients sent DNS queries to
another server as well
• measuring delays?
• just in case?
maz@iij.ad.jp 8
9.
During single failure
• DNS queries to ns01 were
discarded during this period
• It seems users could still resolve
hostnames as the ns11 was alive
• No strange traffic pattern here
• Users might feel some delays
• A bit higher rate of queries in the
first 3min, and then ‘stable’ state
maz@iij.ad.jp 9
10.
During single failure (cont.
• Query rate(ns01+ns11) looks
almost the same as before
• Even though ns01 was discarding
queries during this period
• Probably most clients usually send
DNS queries to both nameservers
maz@iij.ad.jp 10
11.
During double failure (outage)
• Users couldn’t resolve hostnames
at all during this period
• Query rate suddenly increased on
both nameservers
• Those are all discarded though
• Mostly because of ‘retries’
• We observed multiple queries that
has the same QNAME
maz@iij.ad.jp 11
12.
Restoration
• Once ns01 was restored, it got
about 7times more queries than
usual for several seconds
• Web pages are composed by many
“modules”
• Single web page makes several and
more DNS queries sometimes
• Browsers’ prefetch function
• 12min was enough to flush
clients’ side DNS cache
maz@iij.ad.jp 12
13.
Restora(on (cont.
• Then ns11 was also restored
• It also got higher rate of queries for
several seconds
• Gradually the query rate were
getting ‘normal’ state as same as
the before
maz@iij.ad.jp 13
14.
Lesson learned
• Single server failure will not cause a disaster, when
users configure multiple DNS cache servers on their
device
• Probably the impact could be negligible
• During double failure (full-outage), nameservers
got more queries
• Once a server is restored from full-outage, it gets
higher query rate for a while
• In our case, 7 times more than usual for several seconds
maz@iij.ad.jp 14
15.
Redundancy is important
• DNS resolving works somehow as long as one of servers
is functional on each part of DNS
• Full-resolvers (caching nameservers)
• Authoritative nameservers
• Do have redundancy, avoid outage
• A multiple server deployment works well
• IP anycast would be also useful
• my bdnog7 talk - https://www.slideshare.net/bdnog/ip-anycasting
• Once outage, we should expect a large amount of
queries during and just after the outage
• A warning for those who has a security device in front of
nameservers
16.
Users and DNS full-resolver
full-resolver
(cache nameserver)
Many devices
on a home network
maz@iij.ad.jp 16
17.
Usual graph - 5min average
measurement date: 2018/05/16
18.
A bit different view - 1sec average
measurement date: 2018/05/16
19.
Peaks on the hour
• Minor peaks on the hour and half (like 07:30)
• “Alarm clock” wakes up the phone itself
• Some applications are also initiated by the wakeup
• I guess those are mostly coming from smartphones
• QNAMEs also hint
20.
A spike at 15sec before the hour
measurement date: 2018/05/16
21.
we’ve something different now
measurement date: 2019/10/25
22.
Summary
• It’s reasonable for us to provide 2 full-resolvers
(caching nameservers) to customers
• Clients seem to have the ability to use a functional one
• Once outage, we should expect a large amount of
queries during and just after the outage
• A warning to those who has a security device in front of
nameservers
• Clients are synced up unintentionally
• ‘alarm clock’ or scheduled tasks
• This particular case is not an issue at this moment, but
it’s worth to pay attention to those behaviors
maz@iij.ad.jp 22
0 likes
Be the first to like this
Views
Total views
323
On SlideShare
0
From Embeds
0
Number of Embeds
0
You have now unlocked unlimited access to 20M+ documents!
Unlimited Reading
Learn faster and smarter from top experts
Unlimited Downloading
Download to take your learnings offline and on the go
You also get free access to Scribd!
Instant access to millions of ebooks, audiobooks, magazines, podcasts and more.
Read and listen offline with any device.
Free access to premium services like Tuneln, Mubi and more.