This is the Expert Q&A from 2600hz and Cloudant on Database in Telecom. If you are a service provider, MSP or anyone running a VoIP switch, you should definitely check this out.
6. What is Database?
• A Record of things Remembered or Forgotten
• Used to be Unbelievably hard, now it’s just hard
sometimes
• Modern Databases are amazingly resilient
• Failure Mode still requires lots of attention
• In Distributed Environments…
• Database is inexorably linked to the network
• The network is always unreliable if public
7. Masters and Slaves
• Databases have to Replicate
• Most Databases use a form of Master-Slave
Relationship to manage replication and dedupe
• Masters are where new data is entered
• Then it’s mirrored out to the Slaves for storage
• If you lose access to the original Master, you can
convert a Slave into a Master and restore
operation
Durability
8. Other Replication Strategies
• Other strategies exist, such as…
• Master-Master (What 2600hz Uses)
• Tokenized Exchange
• Time-delimited
• The most popular methods tend to be Master-
Slave or Master-Master
Each Database has its advantages and tradeoffs. Once
again, there is no Magic Bullet.
9. Failure and Quorum
• When A Database needs to elect a new master…
• There are many different strategies
• Most involve the concept of quorum (figuring
out where the greatest number of copies
reside)
• Once Quorum is established, a new master is
elected and (hopefully) operation can resume
• Quorum is different in Master-Master (Explain)
10. Cap Theorem
Databases can have (at most) 2 out of 3 of the following:
•Consistency
•Availability
•Partition Tolerance
Modern Database Management is balancing between
Consistency and Availability because all modern
networks are unreliable
12. What is Important in a Database?
• Reliable Storage of Data?
• Fast Retrieval of Data?
• Fast Saving of Data?
• Resilience during failures?
• <other>
13. Examples
• Buying tickets from ticketmaster
• What’s important and why?
• Withdrawing money from a bank?
• Storing Call Forwarding Settings?
• Storing a List of Favorite Stocks?
Each Scenario has a different set of requirements and
constraints. There is no silver bullet; if you could
write one database for all these scenarios, you’d
be rich.
14. Which Database is Better?
• STUPID QUESTION
• But I thought there were no stupid questions?
• This is the only stupid question.
• The fight of which database is better is almost
always silly
• Databases are a tool, to get a job done
• Like the previous examples, each job is different
• Each database stresses different pros/cons
16. Trouble With Databases
• HUGE TOPIC (We’re only going to cover a little)
• Network Partitions
• Layer 1 disasters
• Flapping Internet (Special Class of Network
Partitions)
17. Network Partitions
• Common in Distributed Databases
• When Databases lose contact with each other they can
partition
• Caused by unreliable or faulty network connections
• Databases can behave very weirdly when in partitions
Arguably, most of what a database admin does is prepare for
network partitions and how to resolve them.
20. Split-Brain
• During a partition, some databases will elect N masters, one
for each partition in the network.
• When the partition is fixed, unless there is a pre-defined
restoral procedure, there will be conflicts
• Databases have all kinds of strategies for handling WAN Split-
brain failure, but you should understand them
Key Takeaway: No Database is perfect. Understand the
automation but also understand the manual intervention
procedure.
22. Layer 1 Failures
• Rut Roh
• Actual Physical Disaster
• No easy way out except…
• Don’t be in a Datacenter that’s hit by a disaster
OR
• Be Nimble enough to Evade Disaster
23. Evading Disaster
• We’re not Magicians, we can’t simply predict disasters
• The next best thing is being able to move and move fast
• Kazoo requires one line of code to move
• Kazoo moves fast
• Moving the Database fast is awesome (Thanks BigCouch!)
During Hurricane Sandy, we cut our Datacenters away from
Downtown New York to a Datacenter above the 100 year
flood plain on the East Coast. Result: No Downtime.
24. No Silver Bullets
• Layer 1 disasters are a humbling experience
• Don’t rely on DataCenters in the Path of a Storm
• Flooding will brick datacenters that have generators below
ground
• To avoid being powerless in a disaster…
• Plan, Test, Analyze, Repeat
• Check out Netflix Simian Army for examples of tests
25. Flapping
• Is it up? Is it Down? Around and Around it Goes, where it
stops nobody knows…
• Flapping Internet is a special case of network partition or lose
connectivity
• Flapping connections lose contact with other servers and then
appear to come back online before going off
Why is this bad?
26. Fixing Flapping
• I’m trying to fix a partition
• The Network keeps going up and down
• As I repair my cluster, it keeps starting to repair and failing (by
attempting to reintegrate the unreliable nodes)
Flapping nodes make everything awful
27. Why is the Network Difficult?
“Detecting network failures is hard. Since our only knowledge of
the other nodes passes through the network, delays are
indistinguishable from failure. This is the fundamental problem of
the network partition: latency high enough to be considered a
failure. When partitions arise, we have no way to
determine what happened on the other nodes: are they alive?
Dead? Did they receive our message? Did they try to respond?
Literally no one knows. When the network finally heals, we'll
have to re-establish the connection and try to work out what
happened–perhaps recovering from an inconsistent state.”
-Kyle Kingsbury, Aphyr.com
28. Why is the Network Difficult?
“Detecting network failures is hard. Since our only knowledge of
the other nodes passes through the network, delays are
indistinguishable from failure. This is the fundamental problem of
the network partition: latency high enough to be considered a
failure. When partitions arise, we have no way to
determine what happened on the other nodes: are they alive?
Dead? Did they receive our message? Did they try to respond?
Literally no one knows. When the network finally heals, we'll
have to re-establish the connection and try to work out what
happened–perhaps recovering from an inconsistent state.”
-Kyle Kingsbury, Aphyr.com
29. Why is the Network Difficult?
“Detecting network failures is hard. Since our only knowledge of
the other nodes passes through the network, delays are
indistinguishable from failure. This is the fundamental problem of
the network partition: latency high enough to be considered a
failure. When partitions arise, we have no way to
determine what happened on the other nodes: are they alive?
Dead? Did they receive our message? Did they try to respond?
Literally no one knows. When the network finally heals, we'll
have to re-establish the connection and try to work out what
happened–perhaps recovering from an inconsistent state.”
-Kyle Kingsbury, Aphyr.com
30. What does 2600hz use?
• Cloudant BigCouch
• NoSQL Database
• Master-Master
• Very sensibly designed for our use case
31. Why BigCouch?
DEMANDS
1.On the Fly Schema Changes
2.Scale in a distributed fashion
3.Configuration changes will
happen as we grow
4.Has to be equipment
agnostic
5.Accessible Raw Data View
6.Simple to Install and Keep up
7.It can’t fail, ergo Fault-
Tolerance
8.Multi-Master writes
9.Simple (to cluster, to
TRADEOFFS
1.Eventual Consistency is OK
2.Nodes going offline randomly
3.Multi-server only
Why are we ok with these
tradeoffs? They suit our use
case.
32. Let’s take some time to pontificate about
Database at scale…
What are the first things you think of when
you get errors reported from the Database?
What’s your Thought Process?
33. • Database is where you put stuff
• You want your Database not to
die
• 2600hz uses BigCouch because
it’s really awesome technology
• Great for our Use Case
• Easy to Administrate
• Resilient and quick-to-restore
Recap
When do we come in and provide the support? Possile examples?
Sponsered features?...they have access to current and future features for free.
Sponsered features?...they have access to current and future features for free.
Yealink stuff: make sure you send the right firmware and then the right config file. If you send the wrong config file, or send the file too early, you can brick the phone. 50 handsets is the threshold for DHCP66