Adventures in Building a Database
or
Abstraction is All We Have
Alex Scotti
Bloomberg LP
• Comdb2 is
• A Highly Available Clustered Relational
Database System
• Developed at Bloomberg
• Uses much open source, portions of
BerkeleyDB, SQLite, others
• Much custom code
• Stores 95% of all the data in Bloomberg
• Looking back on what made Comdb2 a success at
Bloomberg, I saw 4 big abstractions that we got
right
• Interestingly enough, I see only 1 of these
abstractions is common place in every system
today
• We started building a system to meet the goals I’ll
be referring to today as abstractions
• With the larger goal of letting application code be
simpler, faster to write, and more reliable
Abstraction is key
• Raise the level of abstraction as much as possible
(before performance becomes absolutely
unacceptable)
• Theres always a ‘different way’ to solve any problem
• Try to be in a spot where 95% of applications don't
need that different way
• Inflection point - system can never be fast enough for
everything. Chasing last percent involves usability
sacrifices making things worse for other 99!
Abstraction 1: Relational
Model
• We started off in 2004 building what would now
be recognized as a NoSQL system
• It had no schemas
• It had no data types
• It had almost nothing, aside from High Availability
• Clients could store anything, very “flexible”
• “Flexible” quickly becomes a euphemism for
“fragile” or “dangerous” or “a huge mess”
• Programs are storing binary data in the database?
• Do they all agree on how it’s encoded?
• Did the app writers understand endian issues?
• What about alignment issues?
• How do they change things when the apps are all
so tightly coupled now?
• How do you query your data?
• Write programs to navigate through the data row
by row using explicit indexes
• That’s hard
• Also fragile. What if we change the index
structure?
• That’s also slow. Lots of round trips to a remote
server
• We quickly realized the abstraction provided by
SQL was not to be ignored! (To be fair, all the
NoSQL systems are now realizing this also)
• Applications could be written with 1 line of code
expressing the same logic as what took hundreds
before
• Without errors!
• Without fragile coupling to database layout!
• With improved performance!
• “Rigid” schemas are a GOOD THING
• Some will have you believe “rigid” means “inflexible”
but it’s actually the opposite!
• Having strong typing and well defined schemas
allows for safe, flexible changes to running systems
without breaking things or downtime
• Pervasive type conversion whenever possible is key
• View “data types” as just another form of “constraint”
- The type has nothing to do with the data
• Abstract away the representation of data as
much as possible (Codd rules 8,9)
• Abstract away the physical locations of data as
much as possible (Codd rule 11)
• Support online schema changes for ALL
POSSIBLE CHANGES
• Don’t ever make a table unavailable, or read
only
Abstraction 2: Perfect
Availability
• What does it mean to be highly available?
• Let’s start by understanding what it means to not
be highly available
• What things do we take for granted working with
“non distributed systems?”
• Let’s imagine the world’s worst programming
language
INTEGER I
I=5
PRINT I 0
INTEGER I
ASSIGN: I=5
RC = GETASSIGNMENTRC()
IF (RC = 0)
PRINT I
ELSE
GOTO ASSIGN
5
INTEGER I, J
ASSIGN: I = 4
RC = GETASSIGNMENTRC()
IF (RC = 0)
J = I * 3
ELSE
GOTO ASSIGN
PRINT J
8
INTEGER I, J
ASSIGN: I = 4
RC = GETASSIGNMENTRC()
IF (RC = 0)
MULT: J = I * 3
RC2 = GETMULTRC()
IF RC2 != 0
GOTO MULT
ELSE
GOTO ASSIGN
PRINT J
Now we loop forever…
Looks like multiplication
is down again
INTEGER I, J
ASSIGN: I = 4
RC = GETASSIGNMENTRC()
IF (RC = 0)
MULT: J = I * 3
RC2 = GETMULTRC()
IF RC2 != 0
RETRY++
IF (RETRY < 10)
GOTO MULT // RETRY IT
ELSE
// IT’S DOWN LET’S DO BY HAND
X = 0
WHILE (X < 3)
J = J + I
X++
ENDWHILE
ELSE
GOTO ASSIGN
PRINT J
Oh jeez..
I hope loops are
working today.
I forgot to check
• This is silly
• Nobody would use such a terrible language!
• Yet this is exactly what programs that use
unreliable services start to look like
• Not even exaggerating
• Error handling and “fall back strategies”
dominate even the simplest applications if they
need to be “robust.”
• How to simplify?
• Make systems more available
• Let’s make a guarantee that the system will come back if you retry
enough
• Eliminate need for “alternate strategies” in applications
• And that’s the current state of affairs of HA databases
• Good, but not great. Lots of error handing and retrying in every
app
• Great thing about code that doesn't run except rarely?
• It never works
• HA Database contract
• When you get a good return code from commit,
your data is stored ‘durably.’ (hopefully in more
than one place)
• If the server you are talking to goes down,
another one should be available (soon or now)
for you to connect to and hopefully you should
still find your data there
• Let’s simply it further
• What if we transparently reconnect to other
servers when one fails and guarantee that the
data will be there?
• We just simplified the apps even more.
• Now they get occasional bad return codes and
call some database ‘retry’
• Still ugly code, but harder to screw up
• Can we do better?
• The “perfect availability” abstraction!
• Delete all the error handing and retrying from
your applications, let’s assume the DB is as
reliable as multiplication
• DB won’t give you an unexpected error from
server failure anymore (even when the server
fails)
• HA SQL
• Client transparently negotiates point in time when it
connects
• Client transparently reconnects to other node on
failure, using point in time to get back to exactly
where it was
• Client transparently requests in flight SQL (SELECT)
to be resumed after the EXACT ROW last delivered
• Client transparently re-issues writes (INSERT, etc)
• No bad return code back to the application from
any possible state of a transaction
• If in the middle of running a query
• If uncommitted writes
• If packet lost on COMMIT request/response
• No HA SQL database currently able to do this,
aside from Comdb2
• ACID
• What does the D mean?
• After you COMMIT you “cant lose the data.”
• Are you “durable” if you need to wait for the system to
come back after a crash?
• What if you need to “swing” to a backup server?
• What if the data center exploded? Did you lose the
data?
• HA really is just another way of looking at D.
Abstraction 3: No
Concurrency
• Concurrency causes many problems for
applications
• Very hard to reason about, very easy to make
errors
• The ideal system to program to (not the fastest!)
is one with no concurrency at all
• Serializability Theory
• We want to have concurrency in our database for
performance reasons
• We don’t want to have concurrency problems
• Systems that don't have concurrency by definition
have no concurrency problems
• If we can show a system that has concurrency to be
somehow “equivalent” to the one with no concurrency
then it too has no concurrency problems!
• Equivalent Histories
• Consider the “output” of the database a “history”
• A system with no concurrency can only produce a limited set of
possible histories from a given set of input
• Why more than one?
• Concurrent requests to the system may execute in non
deterministic order even though one at a time
• If the system WITH concurrency produces one of those histories, it
runs that workload with no concurrency problems
• If the system with concurrency produces a history that could have
come from the non concurrent system for ALL POSSIBLE INPUTS
then the concurrent system has no concurrency problems!
• A system like that is called “serializable”
• The system is concurrent but to the application
(user) acts like a system that has no
concurrency
• The abstraction is the system has no
concurrency
• Serializable systems are simple to reason about
and easy to write applications for
• Test / “prove” that your application works
correctly with one user - then it works fine as it
scales up
• at least it doesn’t have concurrency problems!
UPDATE dots SET color =
‘black’ WHERE color =
‘white’
UPDATE dots SET color =
‘white’ WHERE color =
‘black’
black white black white black white
black black black black black black
white white white white white white
OR
white black white black white black
VS
• Comdb2 not the only serializable SQL DB
• However…
• In a cluster of PostrgreSQL, not serializable on
any nodes other than “master.”
• In a cluster of Percona MySQL, not serializable
on any node
• Full read/write serializability for any app
connected to any node in a Comdb2 cluster
Abstraction 4: One big single
computer
• Single System Image (SSI)
• Distributed systems often “reveal” their nature to
application programmers
• Not desirable. Programmers don’t want to
think about the fact that software is running on
multiple machines
• What goes wrong?
• X=3
• Write X=4
Read X
• Do you get 4?
• What if your read happened on a different
machine?
• What if you told someone else to do that read?
• What if they did that read on a different machine?
• Not going to torture you again with WORLDS
WORST PROGRAMMING LANGUAGE
• but you can guess where this would go
• digresses into insanity quickly
• Most clustered database systems fail the simple
“read follows writes” test in one way or another
• PostgreSQL with full sync replication fails when
read occurs on other machine
• Percona MySQL Cluster fails when other
program told to perform read
• Comdb2 passes
• Clusterwide coherency model ensures that
after commit, data will always be visible to any
process on any computer
• How?
• On commit, wait for LSN to be acked by all nodes in
cluster
• Only ack an LSN after
• You have received the data
• You have processed lock list in the commit record
• Obtained all the locks needed to ensure that no
external observer can prove that this transaction
was NOT COMMITTED
• Racing reads will block on locks
• The data was NOT committed
• The btrees in memory containing the data were
NOT updated
• The log records containing the descriptions of the
changes to these btrees were NOT even
processed
• We ack back immediately after grabbing the
locks which are listed in the commit record
• “On commit, wait for LSN to be acked by all nodes
in cluster”
• But…
• We claimed High (perfect even!) Availability as a
design goal
• This design causes endless blocking when
nodes are down
• And total performance to be equivalent to the
worst performing machine!
• Coherency Algorithm:
• Wait for Commit LSN to be acked by all nodes
• But don’t wait forever
• Just wait longer than heartbeat time
• Get back first ack.. get back more acks..
• Use each time each node took so far in a heuristic to calculate how
long this should take in total
• Cull the outliers, don’t wait for them
• Drop connection to them!
• Mark them ‘incoherent’ and don’t wait for them when in this state
• On node that lost connection:
• Mark yourself as ‘incoherent’
• Refuse to serve ANY requests
• not read only - no service
• “read only” tempting but wrong
• write followed by read gets stale data when the read ends up
on node that just got marked incoherent!
• If you’re crashed - well you are crashed, you don't do anything
• If you're alive, get back into the cluster. Follow protocol to
become coherent and serve requests again
• Becoming coherent
• Same protocol used when a node starts cold or when it
recovers from being marked incoherent (transient
timeout)
• Watch the LSNs the cluster is up to on the nodes
• When your LSN is getting ‘close’, request other node to
‘loop back’ an internal transaction, waiting for the LSN to
be acked
• If that succeeds (without timeout!) then you can be
marked coherent (you are processing live data) and you
now can service requests
4 Great Abstractions
• Physical / Logical Independence (Relational
Model)
• Perfect Availability (HA SQL)
• No Concurrency (serializable)
• One single computer (SSI)
• Allows for simplified application logic, more
reliable applications, faster deployment

Allyourbase

  • 1.
    Adventures in Buildinga Database or Abstraction is All We Have Alex Scotti Bloomberg LP
  • 2.
    • Comdb2 is •A Highly Available Clustered Relational Database System • Developed at Bloomberg • Uses much open source, portions of BerkeleyDB, SQLite, others • Much custom code • Stores 95% of all the data in Bloomberg
  • 3.
    • Looking backon what made Comdb2 a success at Bloomberg, I saw 4 big abstractions that we got right • Interestingly enough, I see only 1 of these abstractions is common place in every system today • We started building a system to meet the goals I’ll be referring to today as abstractions • With the larger goal of letting application code be simpler, faster to write, and more reliable
  • 4.
    Abstraction is key •Raise the level of abstraction as much as possible (before performance becomes absolutely unacceptable) • Theres always a ‘different way’ to solve any problem • Try to be in a spot where 95% of applications don't need that different way • Inflection point - system can never be fast enough for everything. Chasing last percent involves usability sacrifices making things worse for other 99!
  • 5.
    Abstraction 1: Relational Model •We started off in 2004 building what would now be recognized as a NoSQL system • It had no schemas • It had no data types • It had almost nothing, aside from High Availability • Clients could store anything, very “flexible”
  • 6.
    • “Flexible” quicklybecomes a euphemism for “fragile” or “dangerous” or “a huge mess” • Programs are storing binary data in the database? • Do they all agree on how it’s encoded? • Did the app writers understand endian issues? • What about alignment issues? • How do they change things when the apps are all so tightly coupled now?
  • 7.
    • How doyou query your data? • Write programs to navigate through the data row by row using explicit indexes • That’s hard • Also fragile. What if we change the index structure? • That’s also slow. Lots of round trips to a remote server
  • 8.
    • We quicklyrealized the abstraction provided by SQL was not to be ignored! (To be fair, all the NoSQL systems are now realizing this also) • Applications could be written with 1 line of code expressing the same logic as what took hundreds before • Without errors! • Without fragile coupling to database layout! • With improved performance!
  • 9.
    • “Rigid” schemasare a GOOD THING • Some will have you believe “rigid” means “inflexible” but it’s actually the opposite! • Having strong typing and well defined schemas allows for safe, flexible changes to running systems without breaking things or downtime • Pervasive type conversion whenever possible is key • View “data types” as just another form of “constraint” - The type has nothing to do with the data
  • 10.
    • Abstract awaythe representation of data as much as possible (Codd rules 8,9) • Abstract away the physical locations of data as much as possible (Codd rule 11) • Support online schema changes for ALL POSSIBLE CHANGES • Don’t ever make a table unavailable, or read only
  • 11.
    Abstraction 2: Perfect Availability •What does it mean to be highly available? • Let’s start by understanding what it means to not be highly available • What things do we take for granted working with “non distributed systems?” • Let’s imagine the world’s worst programming language
  • 12.
  • 13.
    INTEGER I ASSIGN: I=5 RC= GETASSIGNMENTRC() IF (RC = 0) PRINT I ELSE GOTO ASSIGN 5
  • 14.
    INTEGER I, J ASSIGN:I = 4 RC = GETASSIGNMENTRC() IF (RC = 0) J = I * 3 ELSE GOTO ASSIGN PRINT J 8
  • 15.
    INTEGER I, J ASSIGN:I = 4 RC = GETASSIGNMENTRC() IF (RC = 0) MULT: J = I * 3 RC2 = GETMULTRC() IF RC2 != 0 GOTO MULT ELSE GOTO ASSIGN PRINT J Now we loop forever… Looks like multiplication is down again
  • 16.
    INTEGER I, J ASSIGN:I = 4 RC = GETASSIGNMENTRC() IF (RC = 0) MULT: J = I * 3 RC2 = GETMULTRC() IF RC2 != 0 RETRY++ IF (RETRY < 10) GOTO MULT // RETRY IT ELSE // IT’S DOWN LET’S DO BY HAND X = 0 WHILE (X < 3) J = J + I X++ ENDWHILE ELSE GOTO ASSIGN PRINT J Oh jeez.. I hope loops are working today. I forgot to check
  • 17.
    • This issilly • Nobody would use such a terrible language! • Yet this is exactly what programs that use unreliable services start to look like • Not even exaggerating • Error handling and “fall back strategies” dominate even the simplest applications if they need to be “robust.”
  • 18.
    • How tosimplify? • Make systems more available • Let’s make a guarantee that the system will come back if you retry enough • Eliminate need for “alternate strategies” in applications • And that’s the current state of affairs of HA databases • Good, but not great. Lots of error handing and retrying in every app • Great thing about code that doesn't run except rarely? • It never works
  • 19.
    • HA Databasecontract • When you get a good return code from commit, your data is stored ‘durably.’ (hopefully in more than one place) • If the server you are talking to goes down, another one should be available (soon or now) for you to connect to and hopefully you should still find your data there
  • 20.
    • Let’s simplyit further • What if we transparently reconnect to other servers when one fails and guarantee that the data will be there? • We just simplified the apps even more. • Now they get occasional bad return codes and call some database ‘retry’ • Still ugly code, but harder to screw up
  • 21.
    • Can wedo better? • The “perfect availability” abstraction! • Delete all the error handing and retrying from your applications, let’s assume the DB is as reliable as multiplication • DB won’t give you an unexpected error from server failure anymore (even when the server fails)
  • 22.
    • HA SQL •Client transparently negotiates point in time when it connects • Client transparently reconnects to other node on failure, using point in time to get back to exactly where it was • Client transparently requests in flight SQL (SELECT) to be resumed after the EXACT ROW last delivered • Client transparently re-issues writes (INSERT, etc)
  • 23.
    • No badreturn code back to the application from any possible state of a transaction • If in the middle of running a query • If uncommitted writes • If packet lost on COMMIT request/response • No HA SQL database currently able to do this, aside from Comdb2
  • 24.
    • ACID • Whatdoes the D mean? • After you COMMIT you “cant lose the data.” • Are you “durable” if you need to wait for the system to come back after a crash? • What if you need to “swing” to a backup server? • What if the data center exploded? Did you lose the data? • HA really is just another way of looking at D.
  • 25.
    Abstraction 3: No Concurrency •Concurrency causes many problems for applications • Very hard to reason about, very easy to make errors • The ideal system to program to (not the fastest!) is one with no concurrency at all
  • 26.
    • Serializability Theory •We want to have concurrency in our database for performance reasons • We don’t want to have concurrency problems • Systems that don't have concurrency by definition have no concurrency problems • If we can show a system that has concurrency to be somehow “equivalent” to the one with no concurrency then it too has no concurrency problems!
  • 27.
    • Equivalent Histories •Consider the “output” of the database a “history” • A system with no concurrency can only produce a limited set of possible histories from a given set of input • Why more than one? • Concurrent requests to the system may execute in non deterministic order even though one at a time • If the system WITH concurrency produces one of those histories, it runs that workload with no concurrency problems • If the system with concurrency produces a history that could have come from the non concurrent system for ALL POSSIBLE INPUTS then the concurrent system has no concurrency problems!
  • 28.
    • A systemlike that is called “serializable” • The system is concurrent but to the application (user) acts like a system that has no concurrency • The abstraction is the system has no concurrency
  • 29.
    • Serializable systemsare simple to reason about and easy to write applications for • Test / “prove” that your application works correctly with one user - then it works fine as it scales up • at least it doesn’t have concurrency problems!
  • 30.
    UPDATE dots SETcolor = ‘black’ WHERE color = ‘white’ UPDATE dots SET color = ‘white’ WHERE color = ‘black’ black white black white black white black black black black black black white white white white white white OR white black white black white black VS
  • 31.
    • Comdb2 notthe only serializable SQL DB • However… • In a cluster of PostrgreSQL, not serializable on any nodes other than “master.” • In a cluster of Percona MySQL, not serializable on any node • Full read/write serializability for any app connected to any node in a Comdb2 cluster
  • 32.
    Abstraction 4: Onebig single computer • Single System Image (SSI) • Distributed systems often “reveal” their nature to application programmers • Not desirable. Programmers don’t want to think about the fact that software is running on multiple machines • What goes wrong?
  • 33.
    • X=3 • WriteX=4 Read X • Do you get 4? • What if your read happened on a different machine? • What if you told someone else to do that read? • What if they did that read on a different machine?
  • 34.
    • Not goingto torture you again with WORLDS WORST PROGRAMMING LANGUAGE • but you can guess where this would go • digresses into insanity quickly
  • 35.
    • Most clustereddatabase systems fail the simple “read follows writes” test in one way or another • PostgreSQL with full sync replication fails when read occurs on other machine • Percona MySQL Cluster fails when other program told to perform read
  • 36.
    • Comdb2 passes •Clusterwide coherency model ensures that after commit, data will always be visible to any process on any computer • How?
  • 37.
    • On commit,wait for LSN to be acked by all nodes in cluster • Only ack an LSN after • You have received the data • You have processed lock list in the commit record • Obtained all the locks needed to ensure that no external observer can prove that this transaction was NOT COMMITTED • Racing reads will block on locks
  • 38.
    • The datawas NOT committed • The btrees in memory containing the data were NOT updated • The log records containing the descriptions of the changes to these btrees were NOT even processed • We ack back immediately after grabbing the locks which are listed in the commit record
  • 39.
    • “On commit,wait for LSN to be acked by all nodes in cluster” • But… • We claimed High (perfect even!) Availability as a design goal • This design causes endless blocking when nodes are down • And total performance to be equivalent to the worst performing machine!
  • 40.
    • Coherency Algorithm: •Wait for Commit LSN to be acked by all nodes • But don’t wait forever • Just wait longer than heartbeat time • Get back first ack.. get back more acks.. • Use each time each node took so far in a heuristic to calculate how long this should take in total • Cull the outliers, don’t wait for them • Drop connection to them! • Mark them ‘incoherent’ and don’t wait for them when in this state
  • 41.
    • On nodethat lost connection: • Mark yourself as ‘incoherent’ • Refuse to serve ANY requests • not read only - no service • “read only” tempting but wrong • write followed by read gets stale data when the read ends up on node that just got marked incoherent! • If you’re crashed - well you are crashed, you don't do anything • If you're alive, get back into the cluster. Follow protocol to become coherent and serve requests again
  • 42.
    • Becoming coherent •Same protocol used when a node starts cold or when it recovers from being marked incoherent (transient timeout) • Watch the LSNs the cluster is up to on the nodes • When your LSN is getting ‘close’, request other node to ‘loop back’ an internal transaction, waiting for the LSN to be acked • If that succeeds (without timeout!) then you can be marked coherent (you are processing live data) and you now can service requests
  • 43.
    4 Great Abstractions •Physical / Logical Independence (Relational Model) • Perfect Availability (HA SQL) • No Concurrency (serializable) • One single computer (SSI) • Allows for simplified application logic, more reliable applications, faster deployment