todo lo que siempre quisiste
saber sobre:
Bases de datos
distribuídas de alta
disponibilidad
Javier Ramírez
@supercoco9
https://teowaki.com/services
MADRID · NOV 27-28 · 2015
IBM Data Center
in Japan during
and after
an earthquake
A squirrel did take out half of our
Santa Clara data centre two years back
Mike Christian, Yahoo Director of Engineering
Hayastan
Shakarian
a.k.a.
The Spade
Hacker
Cut-off
Armenia
from
the Internet
for almost
one day*
* By accident, while scavenging copper
I have no idea what
the internet is
Some data center outages reported in 2015:
* Amazon Web Services
* Apple iCloud
* Microsoft Azure
* IBM Softlayer
* Google Cloud Platform
* And of course every hosting with scheduled
maintenance operations (rackspace, digital
ocean, ovh...)
Complex systems can and will fail
You better distribute your data, or else...
Also, distributed databases can perform
better and run on cheaper hardware than
centralised ones
Most basic level:
Backup
And keep the copy
on a separate data centre*
* Vodafone once lost one year
of data on a fire because of this
Next
Level:
Replicas
(master-slave)
A main server sends a binary
log of changes to one or more
replicas
* Also known as Write Ahead Log or WAL
Master-slave is good but
* All the operations are replicated on all
slaves
* Good scalability on reads, but not on writes
* Cannot function during a network partition
* Single point of failure (SPOF)
Next Level:
Multi-Master Cluster
(master-master)
Every server can accept reads
or writes, and send its binary
log to all the other servers
* also referred as update-anywhere
Multi-master is great, but:
* All the operations are replicated on all masters.
* When synchronous, high latency (Consistency
achieved via locks, coordination and serializable
transactions)
* When asynchronous, typically poor conflict
resolution
*Hard to scale up or down automatically
What I want:
* A system that always can work, even with
network partitions
* That scales out both reads and writes
* On cheap commodity diverse hardware
* Running locally to your users (low latency)
* Can grow/shrink elastically and survive
server failures
Then you need to let go of
many convenient things you
take for granted in databases
Availability
Partition
Tolerance
Consistency
CA
AP
CP
CAP Theorem
Everything is a trade-off
Next Level:
Distributed Data
stores
Distributed DB design decisions
* data (keys) distribution
* data replication/durability
* conflict resolution
* membership
* status of the other peers
* operation under partitions and
during unavailability of peers
* incremental scalability
Data distribution
Consistent hashing based on the key
Usually implies operations work on single keys. Some
solutions, like Redis, allow the clients to group related
keys consistently. Some solutions, like BigTable, allow to
collocate data by group or family.
Queries are frequently limited to query by key or by
secondary indexes (say bye to the power of SQL)
Data distribution. The Ring
Data Replication
How many replicas of each? Typically at least 3, so in case of
conflicts there can be a quorum
Often, the distribution of keys is done taking into account the
physical location of nodes, so replicas live in different racks or
different datacentres
Replication: durability
If we want to have a durable system, we need at least to
make sure the data is replicated in at least 2 nodes before
confirming the transaction to the client.
This is called the write quorum, and in many cases it can be
configured individually.
Not all data are equally important, and not all systems have the
same R/W ratio.
Systems can be configured to be “always writable” or
“always readable”.
Conflict resolution
Can be done at Write time or at Read
time.
As long as R + W > N it's possible to
reach a quorum
Conflicts
I see a record that I thought was
deleted
I created a record but cannot see it
I have different values in two nodes
Something should be unique, but it's not
Conflict resolution strategies
Quorum-based systems: Paxos,
RAFT. Require coordination
of processes with continuous elections
of leaders and consensus.
Worse latency
Last Write Wins (LWW): Doesn't
require coordination. Good latency
But, what does “Last” mean?
* Google spanner uses atomic clocks
and servers with GPS clocks to
synchronize time
* Cassandra tries to sync clocks and
divides updates in small parts to
minimize conflict
* Dynamo-like use vector clocks
Vector clocks
* Don't need to sync time
* There are several
versions of a same item
* Need consolidation
to prune size
* Usually client needs to
fix the conflict and update
Alternatives to conflict resolution
* Conflict-Free-Replicated-Datatypes(CRDT).
Counters, Hashes, Maps
* Allowing for strong consistency on keys from the same
family
* The Uber solution with serialized tokens
* Some solutions are implementing immutability,
so no conflicts
* Peter David Bailis paper on Coordination Avoidance using
Read Atomic Multi-Partition transactions (Nov/15)
membership
gossip
infection-like
protocols
Gossip
A centralised server is a SPOF
Communicating state with each node is very time consuming
and doesn't support partitions
Gossip protocols communicate pairs of random nodes at
regular frequent intervals and exchange information.
Based on that information exchange, a new status is agreed
Gossip example
Incremental scalability
When a new node enters the system, the rest of nodes notice
via gossip.
The node claims a partition of the ring and asks
the replicas of the same partition to send data to it.
When the rest of nodes decide (after gossiping) that a node
has left the system and it's not a temporary failure, the data
assigned to the partitions of that node is copied to more
replicas to reach the N copies.
All the process is automatic and transparent.
Operation under partition:
Hinted Handoff
On a network partition, it can happen that we have less than
W nodes of the same segment in the current partition.
In this case, the data is replicated to W nodes, even if that
node wasn't responsible for the segment. The data is kept
with a “hint”, and stored in a special area.
Periodically, the server will try to contact the original
destination and will “hand off” the data to it.
Operation under partition:
Hinted Handoff
Anti Entropy
A system with handoffs can be chaotic and not very
effective
Anti Entropy is implemented to make sure hints are
handed off or synchronized to other nodes
Anti entropy is usually achieved by using Merkle Trees, a
hash of hashes structure very efficient to compare
differences between nodes
All this features mean your clients need to
be aware of some internals of the system
Clients must
* Know which close nodes are responsible for each
segment of the ring, and hash locally**
* Be aware of when nodes become available or
unavailable**
* Decide on durability
* Handle conflict resolution, unless under LWW
** some solutions offer a load balancer proxy to abstract the client
from that complexity, but trading off latency
now you know how it works
* A system that always can work, even with
network partitions
* That scales out both reads and writes
* On cheap commodity diverse hardware
* Running locally to your users (low latency)
* Can grow/shrink elastically and survive
server failures
Extra level: Build your
own distributed database
Netflix dynomite, built in Java
Uber ringpop, built in JavaScript
Not
Scared
Of You
Anymore
aprendoaprogramar.com
… y si tienes hijas o hijos en edad escolar
Find related links at
https://teowaki.com/teams/javier-community/link-categories/distributed-systems
Gracias!
Javier Ramírez
@supercoco9
need help with distributed or big data?
https://teowaki.com/services

Highly available distributed databases, how they work, javier ramirez at teowaki

  • 1.
    todo lo quesiempre quisiste saber sobre: Bases de datos distribuídas de alta disponibilidad Javier Ramírez @supercoco9 https://teowaki.com/services MADRID · NOV 27-28 · 2015
  • 2.
    IBM Data Center inJapan during and after an earthquake
  • 5.
    A squirrel didtake out half of our Santa Clara data centre two years back Mike Christian, Yahoo Director of Engineering
  • 7.
  • 8.
    Cut-off Armenia from the Internet for almost oneday* * By accident, while scavenging copper
  • 9.
    I have noidea what the internet is
  • 10.
    Some data centeroutages reported in 2015: * Amazon Web Services * Apple iCloud * Microsoft Azure * IBM Softlayer * Google Cloud Platform * And of course every hosting with scheduled maintenance operations (rackspace, digital ocean, ovh...)
  • 11.
    Complex systems canand will fail
  • 12.
    You better distributeyour data, or else... Also, distributed databases can perform better and run on cheaper hardware than centralised ones
  • 13.
  • 14.
    And keep thecopy on a separate data centre* * Vodafone once lost one year of data on a fire because of this
  • 15.
  • 16.
    A main serversends a binary log of changes to one or more replicas * Also known as Write Ahead Log or WAL
  • 17.
    Master-slave is goodbut * All the operations are replicated on all slaves * Good scalability on reads, but not on writes * Cannot function during a network partition * Single point of failure (SPOF)
  • 18.
  • 19.
    Every server canaccept reads or writes, and send its binary log to all the other servers * also referred as update-anywhere
  • 20.
    Multi-master is great,but: * All the operations are replicated on all masters. * When synchronous, high latency (Consistency achieved via locks, coordination and serializable transactions) * When asynchronous, typically poor conflict resolution *Hard to scale up or down automatically
  • 21.
    What I want: *A system that always can work, even with network partitions * That scales out both reads and writes * On cheap commodity diverse hardware * Running locally to your users (low latency) * Can grow/shrink elastically and survive server failures
  • 22.
    Then you needto let go of many convenient things you take for granted in databases
  • 23.
  • 24.
  • 26.
    Distributed DB designdecisions * data (keys) distribution * data replication/durability * conflict resolution * membership * status of the other peers * operation under partitions and during unavailability of peers * incremental scalability
  • 27.
    Data distribution Consistent hashingbased on the key Usually implies operations work on single keys. Some solutions, like Redis, allow the clients to group related keys consistently. Some solutions, like BigTable, allow to collocate data by group or family. Queries are frequently limited to query by key or by secondary indexes (say bye to the power of SQL)
  • 28.
  • 29.
    Data Replication How manyreplicas of each? Typically at least 3, so in case of conflicts there can be a quorum Often, the distribution of keys is done taking into account the physical location of nodes, so replicas live in different racks or different datacentres
  • 30.
    Replication: durability If wewant to have a durable system, we need at least to make sure the data is replicated in at least 2 nodes before confirming the transaction to the client. This is called the write quorum, and in many cases it can be configured individually. Not all data are equally important, and not all systems have the same R/W ratio. Systems can be configured to be “always writable” or “always readable”.
  • 31.
    Conflict resolution Can bedone at Write time or at Read time. As long as R + W > N it's possible to reach a quorum
  • 32.
    Conflicts I see arecord that I thought was deleted I created a record but cannot see it I have different values in two nodes Something should be unique, but it's not
  • 33.
    Conflict resolution strategies Quorum-basedsystems: Paxos, RAFT. Require coordination of processes with continuous elections of leaders and consensus. Worse latency Last Write Wins (LWW): Doesn't require coordination. Good latency
  • 34.
    But, what does“Last” mean? * Google spanner uses atomic clocks and servers with GPS clocks to synchronize time * Cassandra tries to sync clocks and divides updates in small parts to minimize conflict * Dynamo-like use vector clocks
  • 35.
    Vector clocks * Don'tneed to sync time * There are several versions of a same item * Need consolidation to prune size * Usually client needs to fix the conflict and update
  • 36.
    Alternatives to conflictresolution * Conflict-Free-Replicated-Datatypes(CRDT). Counters, Hashes, Maps * Allowing for strong consistency on keys from the same family * The Uber solution with serialized tokens * Some solutions are implementing immutability, so no conflicts * Peter David Bailis paper on Coordination Avoidance using Read Atomic Multi-Partition transactions (Nov/15)
  • 37.
  • 38.
    Gossip A centralised serveris a SPOF Communicating state with each node is very time consuming and doesn't support partitions Gossip protocols communicate pairs of random nodes at regular frequent intervals and exchange information. Based on that information exchange, a new status is agreed
  • 39.
  • 40.
    Incremental scalability When anew node enters the system, the rest of nodes notice via gossip. The node claims a partition of the ring and asks the replicas of the same partition to send data to it. When the rest of nodes decide (after gossiping) that a node has left the system and it's not a temporary failure, the data assigned to the partitions of that node is copied to more replicas to reach the N copies. All the process is automatic and transparent.
  • 41.
    Operation under partition: HintedHandoff On a network partition, it can happen that we have less than W nodes of the same segment in the current partition. In this case, the data is replicated to W nodes, even if that node wasn't responsible for the segment. The data is kept with a “hint”, and stored in a special area. Periodically, the server will try to contact the original destination and will “hand off” the data to it.
  • 42.
  • 43.
    Anti Entropy A systemwith handoffs can be chaotic and not very effective Anti Entropy is implemented to make sure hints are handed off or synchronized to other nodes Anti entropy is usually achieved by using Merkle Trees, a hash of hashes structure very efficient to compare differences between nodes
  • 44.
    All this featuresmean your clients need to be aware of some internals of the system
  • 45.
    Clients must * Knowwhich close nodes are responsible for each segment of the ring, and hash locally** * Be aware of when nodes become available or unavailable** * Decide on durability * Handle conflict resolution, unless under LWW ** some solutions offer a load balancer proxy to abstract the client from that complexity, but trading off latency
  • 46.
    now you knowhow it works * A system that always can work, even with network partitions * That scales out both reads and writes * On cheap commodity diverse hardware * Running locally to your users (low latency) * Can grow/shrink elastically and survive server failures
  • 47.
    Extra level: Buildyour own distributed database Netflix dynomite, built in Java Uber ringpop, built in JavaScript
  • 48.
  • 49.
    aprendoaprogramar.com … y sitienes hijas o hijos en edad escolar
  • 50.
    Find related linksat https://teowaki.com/teams/javier-community/link-categories/distributed-systems Gracias! Javier Ramírez @supercoco9 need help with distributed or big data? https://teowaki.com/services