Over the last few years I have built a DNS management system. Initially started as an Event Sourcing application built in Akka, the system had to be re-architected multiple times to address unforeseen issues stemming from new requirements, operational issues, and developer pitfalls (mistakes). This talk will introduce concepts in the DNS domain and different architecture styles including Event Sourcing in Akka and Stream processing in FS2. The talk will describe the journey from inception through to the current system design, highlighting the key challenges encountered along the way and the evolution of the design to account for those challenges. I plan on using real code to demonstrate each architecture along the journey.
2. 2
AGENDA
• 1997 – 2014 – The Beginning
• 2015 – The System
• 2016 – The First Rebuild
• 2017 – The Second Rebuild
• 2019 – Lessons Learning
3. 3
1997- 2014
• I had spent 16 years building soLware
• Most systems were Java and C#
• MOOP
4. 4
Any system characterized by lots of objects, impera7ve programming,
side effects, mock objects, and wrapped around a dependency injec7on
framework.
• “What a pile of MOOP”
• “Did you write this code? A: MOOPS”
• “MOOP is why I leJ engineering for management.”
MALIGNED OBJECT ORIENTED PROGRAMMING – MOOP
5. 5
MOOP DEVELOPMENT
1. Reasoning about systems was hard
2. Systems became briSle and hard to extend
over Ume
3. Tried management, turns out people don’t
compile
6. 6
MOOP - LIKE TO THINK WE HAVE GOOD STRUCTURE
CONTROLLER RESOLVER SERVICE SERIALIZER
ASSEMBLER
PARTBUILDE
R
FOOBUILDER
VIEW
ASSEMBLER
WIDGET
ASSEMBLER
HEADER
ASSEMBLER
DETAIL
ASSEMBLER
GRAPH
ASSEMBLER
THING
REPOSITORY
WIDGET
UTILS
7. 7
MOOP – BUT IN REALITY…
CONTROLLER
RESOLVER
SERVICE
SERIALIZER
ASSEMBLER
PARTBUILDER FOOBUILDER
VIEW
ASSEMBLER WIDGET
ASSEMBLER
HEADER
ASSEMBLER
DETAIL
ASSEMBLER
GRAPH
ASSEMBLER
THING
REPOSITORY
WIDGET
UTILS
OTHER
REPOSITORY
HANDLER
PART CACHE
🐞 💩
🔥
🤞
🤔 👺
🙈
🤣
🤬
10. 1 0
MOOP – ALL THEM DEPENDENCIES NEED MOCKING…
11. 1 1
2015
• Comcast Summer of Code
• Chance to build something from scratch with
no MOOP
• DNS-as-a-Service
12. 1 2
“The Domain Name System (DNS) is a hierarchical and decentralized
naming system for computers, services, or other resources connected to
the Internet or a private network.” hPps://en.wikipedia.org/wiki/Domain_Name_System
• The protocol that runs the internet
• A very large, loosely coherent, distributed KV Map
• Likely the cause for your latest produc7on outage
WHAT IS DNS?
13. 1 3
DNS Zones are Hierarchical, each node in the hierarchy is it’s own zone
Each Zone can live on a separate DNS Server
A Zone is a group of records
WHAT IS DNS?
.NET
COMCAST.NET
SYS.COMCAST.NET WEB.COMCAST.NET
hello. 7200 A 185.230.60.211
14. 1 4
DNS Records are how we name things and answer quesUons
What is the IP address for scale.bythebay.io?
WWW101.WIXDNS.NET. 7200 A 185.230.60.211
WHAT IS DNS?
15. 1 5
DNS Records are how we name things and answer quesUons
WWW101.WIXDNS.NET. 7200 A 185.230.60.211
WHAT IS DNS?
FQDN – Fully Qualified Domain Name. Includes the record
name and zone name…the “key” in our big map
16. 1 6
DNS Records are how we name things and answer quesUons
WWW101.WIXDNS.NET. 7200 A 185.230.60.211
WHAT IS DNS?
Record name – first part of the FQDN
17. 1 7
DNS Records are how we name things and answer quesUons
WWW101.WIXDNS.NET. 7200 A 185.230.60.211
WHAT IS DNS?
Zone Name – the DNS zone where
the record lives
18. 1 8
DNS Records are how we name things and answer quesUons
WWW101.WIXDNS.NET. 7200 A 185.230.60.211
WHAT IS DNS?
TTL (in seconds) – how long this
DNS record is cached for
19. 1 9
DNS Records are how we name things and answer quesUons
WWW101.WIXDNS.NET. 7200 A 185.230.60.211
WHAT IS DNS?
Record Type – indicator of what data this
record holds…the “type” of value in our Map
20. 2 0
Record Types dictate the record data, for example IPv6 (AAAA)
WWW101.WIXDNS.NET. 7200 AAAA GARRBHHLAHHHH!!!
WHAT IS DNS?
Record Type – indicator of what data this
record holds…the “type” of value in our Map
21. 2 1
DNS Zone is the primary enUty in our system, a group of records
WHAT IS DNS?
Zone Record
1
*
wixdns.net www101 7200 A 185.230.60.211
www102 7200 A 185.230.70.106
www102 7200 AAAA GARRBHHHLAHH
22. 2 2
VERSION 1 –
REQUIREMENTS
• Summer of Code ~ 3 months to build it
• Users – internal engineering teams
• 1000s of DNS zones
• 100s of records per zone
• Records change infrequently
• Audit of changes was a must!
23. 2 3
1. I already had some experience with Scala + Akka
2. Team already familiar with Scala and Akka
3. Not Java
4. Should be able to build it MOOP free
VERSION 1 – APPROACH – SCALA + AKKA
24. 2 4
1. No state exists in database, only events
2. When changes occur, events are saved in a journal
3. When an enUty is loaded, events are replayed to generate state
VERSION 1 – DESIGN - EVENT SOURCING
25. 2 5
VERSION 1 - EVENT SOURCING
Frontend
DNS
Backend
Zone
Journal
1. Add(record) 2. Add(record)
3. NoUfyComplete(change)
4. Persist Event2
1
26. 2 6
VERSION 1 - EVENT SOURCING
Frontend
DNS
Backend
Zone
Journal
1. Add(record) 2. Add(record)
3. NoUfyComplete(change)
4. Persist Event2
1
* State is updated in memory
27. 2 7
2. Replay Events
VERSION 1 - EVENT SOURCING – COLD LOAD ENTITY
State
Frontend Zone
Journal
1. Load
3. Generate State
3
n
2
1
…
* State is generated
from past events on
load
28. 2 8
Akka has this out-of-the-box with “Akka Persistence”
1. Zone state (zone + records + history) in memory
2. Each DNS Zone is its own Actor
3. Persistence comes for free - DynamoDB
4. Akka Cluster Sharding handles failure and message delivery
VERSION 1 – AKKA + EVENT SOURCING
29. 2 9
VERSION 1 - EVENT SOURCING – IN MEMORY STATE
EVENTS ARE SAVED IN MEMORY
APPLY AN EVENT TO GENERATE NEW STATE
32. 3 2
VERSION 1 - EVENT SOURCING – LOAD DNS ZONE
MAGIC
✨
33. 3 3
VERSION 1 – AKKA CLUSTER FOR HA USING GOSSIP
AKKA
CLUSTER
NODE-2
NODE-1
NODE-3
no, what about her?
oooo, tell me more…
did you hear about Flo?
34. 3 4
VERSION 1 – AKKA CLUSTER SHARDING
AKKA
CLUSTER
SHARD
ZONE ACTOR
bar.com
Each zone actor is
present exactly
once in the cluster
35. 3 5
VERSION 1 – CLUSTER SHARDING MESSAGE DELIVERY
AKKA
CLUSTER
SHARD
ZONE ACTOR
FORWARD bar.com
Cluster sharding
guarantees
message delivery
to same actor
Add(foo.bar.com)
36. 3 6
VERSION 1 – HIGH AVAILABILITY
AKKA
CLUSTER
SHARD
ZONE ACTOR
When one node in
the cluster fails…
37. 3 7
VERSION 1 – HIGH AVAILABILITY
AKKA
CLUSTER
SHARD
ZONE ACTOR
Shards are
automa7cally
distributed to
surviving nodes
39. 3 9
AKKA EVENT SOURCING
RECAP
• 0 -> PROD in 3 months
• System was super simple (so liSle code)
• Akka Community was fantasUc
• Akka has a tool for everything
• No MOOP!
Vegas! Cha-Ching! CC BY 2.0 by specialbrew85
40. 4 0
2016
• DNS Engineering found out about our liSle
project
• Wanted to use it for all of Comcast
• Comcast has one of the largest DNS footprints
in the world
41. 4 1
1. 1000s of Zones à 1MMs of Zones
2. 100s of Records Per Zone à 1MMs of Records per Zone (largest)
3. Records change X per year à X per week
4. New Requirement: ThroSle updates 5 updates per second per
zone
VERSION 1 – NEW REQUIREMENTS
42. 4 2
1. Replaying 1MMs of changes would way take too long
• 1MM changes / 5000 items per second = 200 seconds
2. Holding all of that state would require tons of memory and
machines
3. Snapshowng would not work
• Loading 5 million things sUll takes a long Ume
• Would force an alternaUve approach to accessing change
history
VERSION 1 – EVENT SOURCING CHALLENGES
43. 4 3
1. Stay with Akka as it had been so good to us
2. Replace Event Sourcing and Akka Persistence with our own
persistence layer built on DynamoDB
3. Keep Akka Clustering + Cluster Sharding
• Messages are routed to the same zone actor for throSling
• Maintain High Availability
VERSION 2 - APPROACH
44. 4 4
VERSION 2 – NEW DESIGN – AKKA + CQRS
Service
Change
Processor
Zone
DynamoDB
1. Command 2. Dispatch
4. Persist Event
5. Update State in DB
Read
ModelQueries
WRITE
SIDE
READ
SIDE
DNS
3. Apply
45. 4 5
VERSION 1 – CLUSTER SHARDING MESSAGE DELIVERY
AKKA
CLUSTER
SHARD
ZONE ACTOR
FORWARD bar.com
Cluster sharding
guarantees
message delivery
to same actor
Add(foo.bar.com)
46. 4 6
VERSION 2 – THROTTLING
Service
Change
Processor
Zone
DynamoDB
1. Command 2. Dispatch
4. Persist Event
5. Update State in DB
Read
ModelQueries
WRITE
SIDE
READ
SIDE
DNS
3. Apply
THROTTLER
48. 4 8
VERSION 2 – ACTOR EVERYTHING
Service
Change
Processor
Zone
DynamoDB
1. Command 2. Dispatch
4. Persist Event
5. Update State in DB
Read
ModelQueries
WRITE
SIDE
READ
SIDE
DNS
3. Apply
THROTTLER
🎭 🎭
🎭
🎭
🎭
🎭
49. 4 9
1. HA+Message Delivery stayed largely the same – Akka Cluster
2. ThroSling came out of the box – Akka has lots of tools
3. Support our new scale
4. A lot more code – 2 months to rebuild
• Persistence Layer for DynamoDB
• Command processing, more manual liLing
5. Wish we had known this up front!
VERSION 2 - RECAP
50. 5 0
2017
• In ProducUon for over a year
• Growing customer base
• Growing code base (features)
• Was great unUl it wasn’t
51. 5 1
VERSION 2 –ISSUES 🙀
1. Networking Issues
2. Actor Issues
52. 5 2
VERSION 2 – NETWORKING ISSUES
1. Good: Akka Clustering is great at managing a cluster of
machines
• Gossip protocol to maintain cluster membership
2. Bad: The Network
• The network doesn’t like you
• The Achilles heel of every distributed system
53. 5 3
VERSION 2 – OUR CLUSTER..
AKKA
CLUSTER
SHARD
ZONE ACTOR
54. 5 4
VERSION 2 – SPLIT BRAIN
AKKA
CLUSTER
AKKA
CLUSTER
SPLIT
BRAIN
* During a
network parUUon,
mulUple clusters
can form
55. 5 5
VERSION 2 – ISSUES
AKKA
CLUSTER
AKKA
CLUSTER
SPLIT
BRAIN
DNS
SERVER
bar.com
bar.com
ZONE LIVES HERE
ZONE LIVES HERE
57. 5 7
VERSION 2 – SPLIT BRAIN
AKKA
CLUSTER
AKKA
CLUSTER
SPLIT
BRAIN
Add foo.bar.com
???
DNS
SERVER
Changes…..
Add foo.bar.com
bar.com
bar.com
58. 5 8
VERSION 2 – ACTOR ISSUES
1. Good: Actors bring sanity to the world of concurrency
• Can view everything as one thread at a Ume
2. Bad:
• Actors are non-determinisUc by their nature
• MOOP creping its way in
59. 5 9
VERSION 2 – ACTORS
DNS
def apply(change)
Zone
Typically, we think about invoking behavior by calling methods,
immediate request response
60. 6 0
VERSION 2 – ACTORS
DNS
def tell(change)
Zone
But with actors, we are pung a message in a mailbox hoping
the actor will process it
📬✉
def receive(Any)
🎭 🎭
✉
61. 6 1
VERSION 2 – ACTOR ISSUES
Actors (Akka) Modes of communicaUon
1. tell – ! public void tell(Object any)
2. ask – ? public Future<Object> ask(Object any)
68. 6 8
VERSION 2 – ACTOR ISSUES
AKKA TIMEOUT EXCEPTION!!!
69. 6 9
1. Akka clustering + cluster sharding gave us HA and Message
Delivery
• Split brain causes strange issues in the system
2. Actors bring sanity to the world of concurrency
• Our system was largely sequenUal!
• Lack of determinism made the system hard to understand and extend
3. Persistence layer was gewng unwieldy with DynamoDB
VERSION 3 – STATE OF AFFAIRS
71. 7 1
1. In 2017, Scala Pet Store allowed me to prove that FP is the anU-
venom to MOOP
2. Simplify everything! Bring back determinism! Bring back the
joy!
VERSION 3 - REBUILD AGAIN
72. 7 2
1. Remove Akka Clustering (HA + Message Delivery)
2. Replace Actors with FuncUons
3. Change our database
VERSION 3 - REBUILD AGAIN
73. 7 3
VERSION 3 – USE A MESSAGE QUEUE
Frontend
1. Send
2. Receive
SQS
Frontend
Frontend
1. Send
1. Send
Backend
Backend
Backend
2. Receive
2. Receive
• SQS is almost free for our workloads
• At Least Once Message Delivery
74. 7 4
VERSION 3 – USE FS2 FOR COMMAND PROCESSING
Frontend
DynamoDB
1. Send 2. Receive
4. Update State
FS2
DNS
SERVER
3. Apply Change
SQS
*Rate Limiter + Retry
Rate LimiUng:
- 40 messages per second max per node
- x 6 producUon nodes
- 240 messages per second max Globally
- Less than Max is ok
76. 7 6
VERSION 3 – FS2 DECLARATIVE SYNTAX
JUST A FUNCTION
JUST A FUNCTION
JUST A FUNCTION
77. 7 7
VERSION 3 – REPLACE ACTORS WITH FUNCTIONS
BEFORE USING ACTORS…
78. 7 8
VERSION 3 – REPLACE ACTORS WITH FUNCTIONS
AFTER USING FUNCTIONS…
79. 7 9
VERSION 3 – REPLACE ACTORS WITH FUNCTIONS
AFTER USING FUNCTIONS…
80. 8 0
VERSION 3 – REPLACE ACTORS WITH FUNCTIONS
BEFORE USING ACTORS…
81. 8 1
VERSION 3 – REPLACE ACTORS WITH FUNCTIONS
AFTER USING FUNCTIONS…
82. 8 2
VERSION 3 – REPLACE ACTORS WITH FUNCTIONS
AFTER USING FUNCTIONS…
83. 8 3
VERSION 3 – CHANGE DATABASE TO MYSQL
Service
DynamoDB
1. Send 2. Receive
4. Update State
Command
Handler
DNS
SERVER
3. Apply Change
5. Read State
SQS
* FS2
MYSQL
86. 8 6
• Replace Akka Clustering + Sharding with SQS + FS2
• Eliminated 1000s of lines of code
• Simplified code – FS2 DeclaraUve
• AcUve AcUve – BeSer HA!
• Replaced Actors with FuncUons
• Brought back determinism
• New features much easier to implement
• No MOOP!
• Changed Database
• SQL is much easier, much more universal
• Gives us power not possible with DynamoDB
VERSION 3 - RECAP
87. 8 7
LESSONS LEARNED
1. SomeUmes premature opUmizaUon
is ok
2. Scala communiUes are incredible!
3. FuncUonal Programming puts the
”Fun” back in programming
• is the answer to most MOOP issues
4. Actors are great at concurrency, not
suitable for every applicaUon
5. Use the right tool for the job
1. Message queues
2. NoSQL is out, SQL is back in
✨
88. 8 8
“THE” LAST SLIDE
VinylDNS
hSps://github.com/vinyldns/vinyldns
Scala Pet Store
hSps://github.com/pauljamescleary/scala-pet-store
Network Func7ons in Rust
hSps://github.com/williamofockham/nb2
TwiSer - @pauljamescleary