Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A study of our DNS full-resolvers


Published on

A study of
our DNS full-resolvers

Published in: Technology
  • Be the first to comment

  • Be the first to like this

A study of our DNS full-resolvers

  1. 1. a study of our DNS full-resolvers Matsuzaki ‘maz’ Yoshinobu <>
  2. 2. Topic for today • Lesson learned from an outage of our full-resolvers • An interesting behavior of clients
  3. 3. Users and DNS full-resolver full-resolver (cache nameserver) Most users are using ISP’s full- resolvers as those information are provided automatically 3
  4. 4. DNS cache nameserver • Usually ISPs provide 2 nameservers for customers • Just in case • Our assumptions here: • Even single server was failed, another server can handle DNS queries • Users somehow automatically pick an usable one up for their use 4
  5. 5. In 2009, we had a trouble • Trouble on cache nameservers for consumers • Apr, 2009 • On two (all) nameservers • 1st failure happened on a server (ns01) • then 2nd failure happened on another server (ns11) • About 12min blackout 5
  6. 6. Failures on our cache nameservers • ns01: 17:14:26 - 17:48:07 (33min14sec) • ns11: 17:35:51 - 17:48:52 (13min01sec) • During the both servers were in trouble (12min16sec), our users couldn’t resolve hostnames • During this trouble, the servers couldn’t answer 14,005,644 DNS queries 6
  7. 7. The query graph 7
  8. 8. Before failure • Clients prefer to use ns01 • Order of configuration? • Clients sent DNS queries to another server as well • measuring delays? • just in case? 8
  9. 9. During single failure • DNS queries to ns01 were discarded during this period • It seems users could still resolve hostnames as the ns11 was alive • No strange traffic pattern here • Users might feel some delays • A bit higher rate of queries in the first 3min, and then ‘stable’ state 9
  10. 10. During single failure (cont. • Query rate(ns01+ns11) looks almost the same as before • Even though ns01 was discarding queries during this period • Probably most clients usually send DNS queries to both nameservers 10
  11. 11. During double failure (outage) • Users couldn’t resolve hostnames at all during this period • Query rate suddenly increased on both nameservers • Those are all discarded though • Mostly because of ‘retries’ • We observed multiple queries that has the same QNAME 11
  12. 12. Restoration • Once ns01 was restored, it got about 7times more queries than usual for several seconds • Web pages are composed by many “modules” • Single web page makes several and more DNS queries sometimes • Browsers’ prefetch function • 12min was enough to flush clients’ side DNS cache 12
  13. 13. Restora(on (cont. • Then ns11 was also restored • It also got higher rate of queries for several seconds • Gradually the query rate were getting ‘normal’ state as same as the before 13
  14. 14. Lesson learned • Single server failure will not cause a disaster, when users configure multiple DNS cache servers on their device • Probably the impact could be negligible • During double failure (full-outage), nameservers got more queries • Once a server is restored from full-outage, it gets higher query rate for a while • In our case, 7 times more than usual for several seconds 14
  15. 15. Redundancy is important • DNS resolving works somehow as long as one of servers is functional on each part of DNS • Full-resolvers (caching nameservers) • Authoritative nameservers • Do have redundancy, avoid outage • A multiple server deployment works well • IP anycast would be also useful • my bdnog7 talk - • Once outage, we should expect a large amount of queries during and just after the outage • A warning for those who has a security device in front of nameservers
  16. 16. Users and DNS full-resolver full-resolver (cache nameserver) Many devices on a home network 16
  17. 17. Usual graph - 5min average measurement date: 2018/05/16
  18. 18. A bit different view - 1sec average measurement date: 2018/05/16
  19. 19. Peaks on the hour • Minor peaks on the hour and half (like 07:30) • “Alarm clock” wakes up the phone itself • Some applications are also initiated by the wakeup • I guess those are mostly coming from smartphones • QNAMEs also hint
  20. 20. A spike at 15sec before the hour measurement date: 2018/05/16
  21. 21. we’ve something different now measurement date: 2019/10/25
  22. 22. Summary • It’s reasonable for us to provide 2 full-resolvers (caching nameservers) to customers • Clients seem to have the ability to use a functional one • Once outage, we should expect a large amount of queries during and just after the outage • A warning to those who has a security device in front of nameservers • Clients are synced up unintentionally • ‘alarm clock’ or scheduled tasks • This particular case is not an issue at this moment, but it’s worth to pay attention to those behaviors 22