SCALING & HIGH AVAILABILITY
OF THE PLATFORM
Max & Vitaly
CAP theorem
 Presented as a conjuncture at PODC 2000 (Brewer's conjecture)
 Formalized and proved in 2002 by Nancy Lynch and Seth Gilbert (MIT)
 Consistency, Availability and Partition- Tolerance cannot be achieved all at the same time in a
distributed system
 There is a tradeoff between these 3 properties
1. Consistency (all nodes see the same data at the same time)
2. Availability (every request receives a response about
whether it succeeded or failed)
3. Partition tolerance (the system continues to operate despite
arbitrary partitioning due to network failures)
Definition In simple terms:
in an asynchronous network that performs as expected, where
messages may be lost (partition-tolerance), it is impossible to
implement a service that provides consistent data and responds
eventually to every request (availability) under every pattern of
message loss
Consistency:
• Data is consistent and the same for all nodes.
• All the nodes in the system see the same state of the data vi
• Every request to non-failing node should be processed and
receive response whether it failed or succeeded
Availability:
Partition tolerance:
•If some nodes crash / communication fails, service still
performs as expected
In simple words:
● Consistency & Availability = some guaranties of data loss
● Consistency & Partitioning = scaling
Why do we need to care about this?
Stop theory! Real examples
• RDBMS (mysql, postgres)
• NoSQL (redis)
• RabbitMQ
• Eureka
• Black-box systems testing. Bugs reproduced in Jepsen are
observable in production, not theoretical. But tests are
nondeterministic, and they cannot prove correctness, only find
errors.
• Testing under distributed systems failure modes: faulty
networks, unsynchronized clocks, and partial failure. Test suites
only evaluate the behavior of healthy clusters
• Generative testing: systemc constructs random operations,
apply them to the system, and constructs a concurrent history
of their results. That history is checked against a model to
establish its correctness. Generative (or property-based) tests
often reveal edge cases with subtle combinations of inputs.
Jepsen (http://jepsen.io/)
Jepsen (http://jepsen.io/)
Jepsen, just add brackets...
RDBMS (again theory)
• Standardized with SQL
• Ubiquitous – widely used and understood
• Supports transactions
• High availability is achieved via Replication
• Master – Master
• Master – Slave
• Synchronous/Asynchronous
Why RDBMS is AC: ACID
Atomicity of an operation(transaction)
• "All or nothing“ – If part fails, the entire transaction fails.
Consistency
• Database will remain in a valid state after the transaction.
• Means adhering to the database rules (key, uniqueness,
etc.)
Isolation
• 2 Simultaneous transactions cannot interfere one with the
other. (Executed as if executed sequentially)
Durability
• Once a transaction is commited, it remains so indefinitely,
even after power loss or crash. (no caching) Definition – ACID
ACID in Dist. Systems
• Proved problematic in big dist systems
• How to guarantee ACID properties ?
• Atomicity requires more thought - e.g. two-phase
commit (and 3-phase commit, PAXOS…)
• Isolation requires to hold all of its locks for the entire
transaction duration - High Lock Contention !
• Complex
• Prone to failure - algorithm should handle
• Failure = outage during write.
• Comes with High overhead commits.
Reminder: speak about atomicity/locks
in java: Withdraw example
Does it means that we can’t scale
RDBMS out of the box?
But we have PG cluster!
But In PG cluster only one node can write.
According Amazone research it brings 5% overhead for
master node + network delay and replica delay for 2 PC
commit. So it can balance only reading via pgpool
PG cluster is not about balancing load (at least writing)
Okay, at least we have ACID
Right?
Well… almost. Even though the Postgres server is always
consistent, the distributed system composed of the server
and client together may not be consistent. It’s possible for
the client and server to disagree about whether or not a
transaction took place.
PG cluster
Postgres' commit protocol, like most relational databases, is a special case of
two-phase commit, or 2PC. In the first phase, the client votes to commit (or
abort) the current transaction, and sends that message to the server. The server
checks to see whether its consistency constraints allow the transaction to
proceed, and if so, it votes to commit. It writes the transaction to storage and
informs the client that the commit has taken place (or failed, as the case may
be.) Now both the client and server agree on the outcome of the transaction.
What happens if the message acknowledging the commit is dropped before the
client receives it? Then the client doesn’t know whether the commit succeeded
or not! The 2PC protocol says that we must wait for the acknowledgement
message to arrive in order to decide the outcome. Waiting forever isn’t realistic
for real systems, so at some point the client will time out and declare an error
occurred. The commit protocol is now in an indeterminate state.
PG cluster + Jepsen + Withdraw
example
https://aphyr.com/posts/282-jepsen-postgres
But we have pg_shard for scaling load
https://www.citusdata.com/citus-products/pg-shard/pg-
shard-quick-start-guide
Yes but Postgres with pg_shard is not ACID!
Limitations:
• Transactional semantics for queries that span across
multiple shards - For example, you're a financial institution
and you sharded your data based on customer_id. You'd
now like to withdraw money from one customer's account
and debit it to another one's account, in a single transaction
block.
• Unique constraints on columns other than the partition
key, or foreign key constraints.
• Distributed JOINs also aren't supported in pg_shard
pg_shard
Frequently Asked Questions
How does pg_shard handle INSERT/UPDATE/DELETE commands?
pg_shard requires that any modifications (INSERTs, UPDATEs, or DELETEs) involve exactly one shard.
In the UPDATE and DELETE case, this means commands must include a WHERE qualification on the partition column that restricts
the query to a single shard. Such qualifications usually take the form of an equality clause on the tables partition column.
As for INSERT commands, the partition column of the row being inserted must be specified using an expression
that can be reduced to a constant. For instance, a value such as 3, or even char_length('bob') would be suitable,
though rand() would not. In additions, INSERT commands must specify exactly one row to be inserted.
Note that the above restriction implies that commands similar to "INSERT INTO table SELECT col_one, col_two
FROM other_table" are not currently supported.
From an implementation standpoint, pg_shard determines the shard involved in a given INSERT, UPDATE,
or DELETE command and then rewrites the SQL of that command to reference the shard table.
The rewritten SQL is then sent to the placements for that shard to complete processing of the command.
How exactly does pg_shard distribute my data?
Rather than using hosts as the unit of distribution, pg_shard creates many small shards and places them across many hosts in a round-robin fashion.
For example, a user might have eight hosts in their cluster but 256 shards with a replication factor of two. Shard one would be created on hosts A and B, shard two on B and
C, and so forth.
The advantage of this approach is that the additional load incurred after a host failure is spread among many other hosts instead of falling entirely on a single replica.
But Mysql Galera has master-master
cluster approach!
Multi-master replication means that applications update the
same tables on different masters, and the changes replicate
automatically between those masters.
Row-Based Replication to Avoid Data Drift
Replication depends on deterministic updates--a transaction that changes 10 rows on the original master
should change exactly the same rows when it executes against a replica. Unfortunately many SQL
statements that are deterministic in master/slave replication are non-deterministic in multi-master
topologies. Consider the following example, which gives a 10% raise to employees in department #35.
UPDATE emp SET salary = salary * 1.1 WHERE dep_id = 35;
If all masters add employees, then the number of employees who actually get the raise will vary depending on
whether such additions have replicated to all masters. Your servers will very likely become inconsistent with
statement replication. The fix is to enable row-based replication using binlog-format=row in my.cnf. Row
replication transfers the exact row updates from each master to the others and eliminates ambiguity.
But this reduce performance dramatically.
Mysql Galera
Prevent Key Collisions on INSERTs
For applications that use auto-increment keys, MySQL offers a useful trick to ensure that such keys do not
collide between masters using the auto-increment-increment and auto-increment-offset parameters in
my.cnf. The following example ensures that auto-increment keys start at 1 and increment by 4 to give values
like 1, 5, 9, etc. on this server.
server-id=1
auto-increment-offset = 1
auto-increment-increment = 4
This works so long as your applications use auto-increment keys faithfully. However, any table that either does
not have a primary key or where the key is not an auto-increment field is suspect. You need to hunt them
down and ensure the application generates a proper key that does not collide across masters, for example
using UUIDs or by putting the server ID into the key. Here is a query on the MySQL information schema to
help locate tables that do not have an auto-increment primary key.
Mysql Galera
Semantic Conflicts in Applications
MySQL replication can resolve conflicts. You need to avoid them in your applications. Here are a few tips as
you go about this.
First, avoid obvious conflicts. These include inserting data with the same keys on different masters (described
above), updating rows in two places at once, or deleting rows that are updated elsewhere. Any of these can
cause errors that will break replication or cause your masters to become out of sync. The good news is that
many of these problems are not hard to detect and eliminate using properly formatted transactions. The
bad news is that these are the easy conflicts. There are others that are much harder to address.
For example, accounting systems need to generate unbroken sequences of numbers for invoices. A common
approach is to use a table that holds the next invoice number and increment it in the same transaction that
creates a new invoice. Another accounting example is reports that need to read the value of accounts
consistently, for example at monthly close. Neither example works off-the-shelf in a multi-master system
with asynchronous replication, as they both require some form of synchronization to ensure global
consistency across masters. Or salary and balance task. These and other such cases may force substantial
application changes. Some applications simply do not work with multi-master topologies for this reason.
Mysql Galera
Have a Plan for Sorting Out Mixed Up Data
Master/slave replication has its discontents, but at least sorting out messed up replicas is simple: re-provision from another slave
or the master. No so with multi-master topologies--you can easily get into a situation where all masters have transactions you
need to preserve and the only way to sort things out is to track down differences and update masters directly. Here are some
thoughts on how to do this.
1. Ensure you have tools to detect inconsistencies. Tungsten has built-in consistency checking with the 'trepctl check'
command. You can also use the Percona Toolkit pt-table-checksum to find differences. Be forewarned that neither of
these works especially well on large tables and may give false results if more than one master is active when you run them.
2. Consider relaxing foreign key constraints. I love foreign keys because they keep data in sync. However, they can also
create problems for fixing messed up data, because the constraints may break replication or make it difficult to go table-
by-table when synchronizing across masters. There is an argument for being a little more relaxed in multi-master settings.
3. Switch masters off if possible. Fixing problems is a lot easier if you can quiesce applications on all but one master.
4. Know how to fix data. Being handy with SQL is very helpful for fixing up problems. I find SELECT INTO OUTFILE and LOAD
DATA INFILE quite handy for moving changes between masters. Don't forget SET SESSION LOG_FILE_BIN=0 to keep
changes from being logged and breaking replication elsewhere. There are also various synchronization tools like pt-table-
sync, but I do not know enough about them to make recommendations.
5. At this point it's probably worth mentioning commercial support. Unless you are a replication guru, it is very comforting to
have somebody to call when you are dealing with messed up masters. Even better, expert advice early on can help you
avoid problems in the first place.
Mysql Galera + Jepsen + Withdraw
https://aphyr.com/posts/327-jepsen-mariadb-galera-cluster
Imagine a system of two bank accounts, each with a balance of
$10.
SET SESSION TRANSACTION ISOLATION LEVEL SERIALIZABLE
set autocommit=0
select * from accounts where id = 0
select * from accounts where id = 1
UPDATE accounts SET balance = 8 WHERE id = 0
UPDATE accounts SET balance = 12 WHERE id = 1
COMMIT
Mysql Galera + Jepsen + Withdraw
Case 1: T1 commits before T2’s start time. Operations from T1 and T2 cannot
interleave, by Lemma 1, because their intervals do not overlap.
Case 2: T1 and T2 operate on disjoint sets of accounts. They serialize trivially.
Case 3: T1 and T2 operate on intersecting sets of accounts, and T1 commits before T2
commits. Then T1 wrote data that T2 also wrote, and committed in T2’s interval,
which violates First-committer-wins. T2 must abort.
Case 4: T1 and T2 operate on intersecting sets of accounts, and T1 commits after T2
commits. Then T2 wrote data that T1 also wrote, and committed in T1’s interval,
which violates First-committer-wins. T1 must abort.
Mysql Galera + Jepsen + Withdraw
Read-only transactions trivially serialize with one another. Do they serialize with
respect to transfer transactions? The answer is yes: since every read-only transaction
sees only committed data in a Snapshot Isolation system, and commits no data itself,
it must appear to take place atomically at some time between other transactions.
SET SESSION TRANSACTION ISOLATION LEVEL SERIALIZABLE
set autocommit=0
select * from accounts
COMMIT
Mysql Galera + Jepsen + Withdraw
Mysql Galera conclusion
The transfer transactions should have kept the total amount of
money at $20, but by the end of the test the totals all sum to
$22. And in this run, 25% of the funds in the system
mysteriously vanish. These results remain stable after all other
transactions have ended–they are not a concurrency anomaly.
Dirty reads!
No first-committer-wins, no snapshot isolation. No snapshot
isolation, well… I’m not sure exactly what Galera does
guarantee.
Master-Master works for append only DB
http://scale-out-blog.blogspot.com/2012/04/if-you-must-
deploy-multi-master.html
http://www.onlamp.com/2016/04/20/advanced-mysql-
We know that
Instagram uses Postgres,
pinterest uses mysql!
True!
https://engineering.pinterest.com/blog/sharding-pinterest-
how-we-scaled-our-mysql-fleet
>>In 2011, we hit traction. By some estimates, we were growing
faster than any other previous startup. Around September
2011, every piece of our infrastructure was over capacity. We
had several NoSQL technologies, all of which eventually broke
catastrophically. We also had a boatload of MySQL slaves we
were using for reads, which makes lots of irritating bugs,
especially with caching.
Pinterest
How we sharded
Whatever we were going build needed to meet our needs and be stable, performant and repairable. In other
words, it needed to not suck, and so we chose a mature technology as our base to build on, MySQL. We
intentionally ran away from auto-scaling newer technology like MongoDB, Cassandra and Membase, because
their maturity was simply not far enough along (and they were crashing in spectacular ways on us!).
Aside: I still recommend startups avoid the fancy new stuff — try really hard to just use MySQL. Trust me. I
have the scars to prove it.
MySQL is mature, stable and it just works. Not only do we use it, but it’s also used by plenty of other
companies pushing even bigger scale. MySQL supports our need for ordering data requests, selecting certain
ranges of data and row-level transactions. It has a hell of a lot more features, but we don’t need or use them.
But, MySQL is a single box solution, hence the need to shard our data. Here’s our solution:
We started with eight EC2 servers running one MySQL instance each:
Pinterest
How we sharded
So how do we distribute our data to these shards?
We created a 64 bit ID that contains the shard ID, the type of the containing data, and where this data is in the
table (local ID). The shard ID is 16 bits, type ID is 10 bits and local ID is 36 bits. The savvy additionology
experts out there will notice that only adds to 62 bits. My past in compiler and chip design has taught me
that reserve bits are worth their weight in gold. So we have two (set to zero).
ID = (shard ID << 46) | (type ID << 36) | (local ID<<0)
RabbitMQ
RabbitMQ is a distributed message queue,
and is probably the most popular open-source implementation
of the AMQP messaging protocol. It supports a wealth of
durability, routing, and fanout strategies, and combines excellent
documentation with well-designed protocol extensions.
RabbitMQ + CAP
RabbitMQ cluster + CAP
According table there is a choice between CP and CA, but in real
life CP means loss data
from http://www.rabbitmq.com/partitions.html
RabbitMQ clusters do not tolerate network partitions well. If you
are thinking of clustering across a WAN, don't. You should use
federation or the shovel instead.
However, sometimes accidents happen.
RabbitMQ stores information about queues, exchanges, bindings
etc in Erlang's distributed database, Mnesia.
RabbitMQ cluster and partitions
RabbitMQ also offers three ways to deal with network partitions automatically: pause-minority mode, pause-
if-all-down mode and autoheal mode. (The default behaviour is referred to as ignore mode).
In pause-minority mode RabbitMQ will automatically pause cluster nodes which determine themselves to be in
a minority (i.e. fewer or equal than half the total number of nodes) after seeing other nodes go down. It
therefore chooses partition tolerance over availability from the CAP theorem. This ensures that in the event of
a network partition, at most the nodes in a single partition will continue to run. The minority nodes will pause
as soon as a partition starts, and will start again when the partition ends.
In pause-if-all-down mode, RabbitMQ will automatically pause cluster nodes which cannot reach any of the
listed nodes. In other words, all the listed nodes must be down for RabbitMQ to pause a cluster node. This is
close to the pause-minority mode, however, it allows an administrator to decide which nodes to prefer, instead
of relying on the context. For instance, if the cluster is made of two nodes in rack A and two nodes in rack B,
and the link between racks is lost, pause-minority mode will pause all nodes. In pause-if-all-down mode, if the
administrator listed the two nodes in rack A, only nodes in rack B will pause. Note that it is possible the listed
nodes get split across both sides of a partition: in this situation, no node will pause. That is why there is an
additional ignore/autoheal argument to indicate how to recover from the partition.
In autoheal mode RabbitMQ will automatically decide on a winning partition if a partition is deemed to have
occurred, and will restart all nodes that are not in the winning partition. Unlike pause_minority mode it
therefore takes effect when a partition ends, rather than when one starts.
The winning partition is the one which has the most clients connected (or if this produces a draw, the one with
the most nodes; and if that still produces a draw then one of the partitions is chosen in an unspecified way).
How to scale?
Federation
Federation allows an exchange or queue on one broker to receive messages published to an exchange or queue on another (the
brokers may be individual machines, or clusters). Communication is via AMQP (with optional SSL), so for two exchanges or queues
to federate they must be granted appropriate users and permissions.
Federated exchanges are connected with one way point-to-point links. By default, messages will only be forwarded over a
federation link once, but this can be increased to allow for more complex routing topologies. Some messages may not be
forwarded over the link; if a message would not be routed to a queue after reaching the federated exchange, it will not be
forwarded in the first place.
Federated queues are similarly connected with one way point-to-point links. Messages will be moved between federated queues
an arbitrary number of times to follow the consumers.
Typically you would use federation to link brokers across the internet for pub/sub messaging and work queueing.
The Shovel
Connecting brokers with the shovel is conceptually similar to connecting them with federation. However, the shovel works at a
lower level.
Whereas federation aims to provide opinionated distribution of exchanges and queues, the shovel simply consumes messages
from a queue on one broker, and forwards them to an exchange on another.
Typically you would use the shovel to link brokers across the internet when you need more control than federation provides.
How to scale?
Horizontally!
We offer to use more simple way of scaling instead of Federation or shovel
Just start N clusters (like mysql or postgres):
Gateways RabbitMqRabbitMqRabbitMqGatewaysGateways
GatewaysGatewaysBackends
Gateways RabbitMqRabbitMqRabbitMqGatewaysGateways
GatewaysGatewaysBackends
Gateways RabbitMqRabbitMqRabbitMqGatewaysGateways
GatewaysGatewaysBackends
How to scale*?
https://insidethecpu.com/2014/11/17/load-balancing-a-
rabbitmq-cluster/
Redis
1. Redis fast!
2. Redis lost data! (CP)
Redis fast?
Exceptionally Fast : Redis is very fast and can perform about
110000 SETs per second, about 81000 GETs per second (one
thread)
1. Operations are atomic : All the Redis operations are atomic,
which ensures that if two clients concurrently access Redis
server will get the updated value. discuss about CAS in java.
Redis fast?
Access by value O(1), by score O(log(N)). For numerical
members, the value is the score. For string members, the score is
a hash of the string.
Redis scalable?
Yes!
due to simple format of data storage (key -> value), where every
entry uses hash for searching, very simple to shard by hash range
or value range by , no additional effort comparing to mongodb
(speak about mongodb indexes) for example.
approaches:
1. Proxy assisted partitioning means that our clients send requests to a proxy that is able to speak the Redis
protocol, instead of sending requests directly to the right Redis instance. The proxy will make sure to
forward our request to the right Redis instance accordingly to the configured partitioning schema, and
will send the replies back to the client. The Redis and Memcached proxy Twemproxy implements proxy
assisted partitioning.
2. Query routing means that you can send your query to a random instance, and the instance will make
sure to forward your query to the right node. Redis Cluster implements an hybrid form of query routing,
with the help of the client (the request is not directly forwarded from a Redis instance to another, but
the client gets redirected to the right node).
Redis scalable?
Yes!
due to simple format of data storage (key -> value), where every entry uses hash for searching, very simple to
shard by hash range or value range by , no additional effort comparing to mongodb (speak about mongodb
indexes) for example.
approaches:
1. crc32: Proxy assisted partitioning means that our clients send requests to a proxy that is able to speak
the Redis protocol, instead of sending requests directly to the right Redis instance. The proxy will make
sure to forward our request to the right Redis instance accordingly to the configured partitioning schema,
and will send the replies back to the client. The Redis and Memcached proxy Twemproxy implements
proxy assisted partitioning.
2. Redis Cluster: Query routing means that you can send your query to a random instance, and the instance
will make sure to forward your query to the right node. Redis Cluster implements an hybrid form of query
routing, with the help of the client (the request is not directly forwarded from a Redis instance to
another, but the client gets redirected to the right node).
Discuss how to configure this! & Presharding
http://redis.io/topics/cluster-tutorial
http://redis.io/topics/partitioning
http://docs.spring.io/spring-data/redis/docs/current/reference/html/#redis:sentinel
What about HA
Redis offers asynchronous primary->secondary replication. A single server is chosen as
the primary, which can accept writes. It relays its state changes to secondary servers,
which follow along. Asynchronous means that you don’t have to wait for a write to be
replicated before the primary returns a response to the client.
1. Sentinel
Sentinel tries to establish a quorum between Sentinel nodes, agree on which Redis
servers are alive, and promote any which appear to have failed. If we colocate the
Sentinel nodes with the Redis nodes, this should allow us to promote a new primary in
the majority component (should one exist).
2. Redis cluster (discuss about slots)!
http://redis.io/topics/replication
http://redis.io/topics/sentinel
http://redis.io/topics/cluster-tutorial
http://redis.io/topics/sentinel-clients
Redis + Jepsen
https://aphyr.com/posts/283-call-me-maybe-redis
Eureka (pure AP algorithm)
Once the server starts receiving traffic, all of the operations that is performed on the server is
replicated to all of the peer nodes that the server knows about. If an operation fails for some
reason, the information is reconciled on the next heartbeat that also gets replicated between
servers.
When the Eureka server comes up, it tries to get all of the instance registry information from a
neighboring node. If there is a problem getting the information from a node, the server tries all of
the peers before it gives up. If the server is able to successfully get all of the instances, it sets the
renewal threshold that it should be receiving based on that information. If any time, the renewals
falls below the percent configured for that value (below 85% within 15 mins), the server stops
expiring instances to protect the current instance registry information.
It is called as self-preservation mode and is primarily used as a protection in scenarios where
there is a network partition between a group of clients and the Eureka Server. In these scenarios,
the server tries to protect the information it already has. There may be scenarios in case of a
mass outage that this may cause the clients to get the instances that do not exist anymore. The
clients must make sure they are resilient to eureka server returning an instance that is non-
existent or un-responsive. The best protection in these scenarios is to timeout quickly and try
other servers.
What we do in balancer, gateway (file service, rabbitmq), backends (rabbitmq)
In the case, where the server is not able get the registry information from the neighboring node,
it waits for a few minutes (5 mins) so that the clients can register their information.
Eureka (AP)
What happens during network outages between Peers?
In the case of network outages between peers, following things may happen
1. The heartbeat replications between peers may fail and the server detects this
situation and enters into a self-preservation mode protecting the current state.
2. The situation autocorrects itself after the network connectivity is restored to a
stable state. When the peers are able to communicate fine, the registration
information is automatically transferred to the servers that do not have them.
The bottom line is, during the network outages, the server tries to be as resilient as
possible, but there is a possibility of clients having different views of the servers during
that time
Zookeeper based on PAXOS algorithm and provides CA
That is mean that it uses transactions for sharing state and can’t
provide partition tolerance
While eureka sends entire state all the time
Transactions?
Eureka vs Zookeeper CAP
1. Eureka integrates better with other NetflixOSS components
(Ribbon especially).
2. ZooKeeper is hard. We've gotten pretty good at it, but it
requires care and feeding.
https://tech.knewton.com/blog/2014/12/eureka-shouldnt-use-
zookeeper-service-discovery/
Eureka vs Zookeeper
https://tech.knewton.com/blog/2014/12/eureka-shouldnt-use-
zookeeper-service-discovery/
https://github.com/Netflix/eureka/wiki/Understanding-Eureka-
Peer-to-Peer-Communication
https://groups.google.com/forum/#!topic/eureka_netflix/LXKWo
D14RFY
Eureka links
Push service
1. Stateless
2. Locks
3. Performance
Configuration service
HA via DNS balancing
Each Component Scaling Capability
Type CAP Best for
Platform module Independent; stateless HA & Performance
Redis CP Performance
Weave DNS AP HA w/o consistency
Docker Swarm CA HA
RabbitMQ Queues replicated
across nodes
HA & slight Performance
Eureka AP HA w/o consistency
Conf service Stateless HA
Reminder
1. L1 cache reference 0.3 ns
2. Branch mispredict 3 ns
3. L2 cache reference 7 ns
4. Mutex lock/unlock 80 ns
5. Main memory reference 100 ns
6. Compress 1K bytes with Zippy 10,000 ns
7. Send 2K bytes over 1 Gbps network 20,000 ns
8. Read 1 MB sequentially from memory 250,000 ns
9. Round trip within same datacenter 500,000 ns
10.Disk seek 10,000,000 ns
11.Read 1 MB sequentially from network 5,000,000 ns
12.Read 1 MB sequentially from disk 30,000,000 ns
13.Send packet CA->Netherlands->CA 150,000,000 ns
Reminder 2
Ensure your design works if scale changes by 10X or 20X
but the right solution for X often not optimal for 100X
Eventual Consistency
Eventual Consistency - BASE Along with the CAP conjuncture, Brewer suggested a new consistency
model - BASE (Basically Available, Soft state, Eventual consistency) • BASE model gives up on
Consistency from the CAP Theorem. • This model is optimistic and accepts eventual consistency, in
contrast to ACID. o Given enough time, all nodes will be consistent and every request will result with
same responses. • Brewer points out that ACID and BASE are two extremes and one can have a
range of options in choosing the balance between consistency and availability. (consistency models).
Basically Available - the system does guarantee availability, in terms of the CAP theorem. It is always
available, but subsets of data may become unavailable for short periods of time. • Soft state - State of
system may change over time, even without input. Data does not have to be consistent. • Eventual
Consistency - System will become consistent eventually in the future. ACID, on the contrary, enforces
consistency immediately after any operation.

CAP: Scaling, HA

  • 1.
    SCALING & HIGHAVAILABILITY OF THE PLATFORM Max & Vitaly
  • 2.
    CAP theorem  Presentedas a conjuncture at PODC 2000 (Brewer's conjecture)  Formalized and proved in 2002 by Nancy Lynch and Seth Gilbert (MIT)  Consistency, Availability and Partition- Tolerance cannot be achieved all at the same time in a distributed system  There is a tradeoff between these 3 properties 1. Consistency (all nodes see the same data at the same time) 2. Availability (every request receives a response about whether it succeeded or failed) 3. Partition tolerance (the system continues to operate despite arbitrary partitioning due to network failures)
  • 3.
    Definition In simpleterms: in an asynchronous network that performs as expected, where messages may be lost (partition-tolerance), it is impossible to implement a service that provides consistent data and responds eventually to every request (availability) under every pattern of message loss
  • 4.
    Consistency: • Data isconsistent and the same for all nodes. • All the nodes in the system see the same state of the data vi
  • 5.
    • Every requestto non-failing node should be processed and receive response whether it failed or succeeded Availability:
  • 6.
    Partition tolerance: •If somenodes crash / communication fails, service still performs as expected
  • 8.
    In simple words: ●Consistency & Availability = some guaranties of data loss ● Consistency & Partitioning = scaling Why do we need to care about this?
  • 9.
    Stop theory! Realexamples • RDBMS (mysql, postgres) • NoSQL (redis) • RabbitMQ • Eureka
  • 10.
    • Black-box systemstesting. Bugs reproduced in Jepsen are observable in production, not theoretical. But tests are nondeterministic, and they cannot prove correctness, only find errors. • Testing under distributed systems failure modes: faulty networks, unsynchronized clocks, and partial failure. Test suites only evaluate the behavior of healthy clusters • Generative testing: systemc constructs random operations, apply them to the system, and constructs a concurrent history of their results. That history is checked against a model to establish its correctness. Generative (or property-based) tests often reveal edge cases with subtle combinations of inputs. Jepsen (http://jepsen.io/)
  • 11.
  • 12.
    Jepsen, just addbrackets...
  • 13.
    RDBMS (again theory) •Standardized with SQL • Ubiquitous – widely used and understood • Supports transactions • High availability is achieved via Replication • Master – Master • Master – Slave • Synchronous/Asynchronous
  • 14.
    Why RDBMS isAC: ACID Atomicity of an operation(transaction) • "All or nothing“ – If part fails, the entire transaction fails. Consistency • Database will remain in a valid state after the transaction. • Means adhering to the database rules (key, uniqueness, etc.) Isolation • 2 Simultaneous transactions cannot interfere one with the other. (Executed as if executed sequentially) Durability • Once a transaction is commited, it remains so indefinitely, even after power loss or crash. (no caching) Definition – ACID
  • 15.
    ACID in Dist.Systems • Proved problematic in big dist systems • How to guarantee ACID properties ? • Atomicity requires more thought - e.g. two-phase commit (and 3-phase commit, PAXOS…) • Isolation requires to hold all of its locks for the entire transaction duration - High Lock Contention ! • Complex • Prone to failure - algorithm should handle • Failure = outage during write. • Comes with High overhead commits.
  • 16.
    Reminder: speak aboutatomicity/locks in java: Withdraw example
  • 17.
    Does it meansthat we can’t scale RDBMS out of the box?
  • 18.
    But we havePG cluster! But In PG cluster only one node can write. According Amazone research it brings 5% overhead for master node + network delay and replica delay for 2 PC commit. So it can balance only reading via pgpool PG cluster is not about balancing load (at least writing) Okay, at least we have ACID Right? Well… almost. Even though the Postgres server is always consistent, the distributed system composed of the server and client together may not be consistent. It’s possible for the client and server to disagree about whether or not a transaction took place.
  • 19.
    PG cluster Postgres' commitprotocol, like most relational databases, is a special case of two-phase commit, or 2PC. In the first phase, the client votes to commit (or abort) the current transaction, and sends that message to the server. The server checks to see whether its consistency constraints allow the transaction to proceed, and if so, it votes to commit. It writes the transaction to storage and informs the client that the commit has taken place (or failed, as the case may be.) Now both the client and server agree on the outcome of the transaction. What happens if the message acknowledging the commit is dropped before the client receives it? Then the client doesn’t know whether the commit succeeded or not! The 2PC protocol says that we must wait for the acknowledgement message to arrive in order to decide the outcome. Waiting forever isn’t realistic for real systems, so at some point the client will time out and declare an error occurred. The commit protocol is now in an indeterminate state.
  • 20.
    PG cluster +Jepsen + Withdraw example https://aphyr.com/posts/282-jepsen-postgres
  • 21.
    But we havepg_shard for scaling load https://www.citusdata.com/citus-products/pg-shard/pg- shard-quick-start-guide Yes but Postgres with pg_shard is not ACID! Limitations: • Transactional semantics for queries that span across multiple shards - For example, you're a financial institution and you sharded your data based on customer_id. You'd now like to withdraw money from one customer's account and debit it to another one's account, in a single transaction block. • Unique constraints on columns other than the partition key, or foreign key constraints. • Distributed JOINs also aren't supported in pg_shard
  • 22.
    pg_shard Frequently Asked Questions Howdoes pg_shard handle INSERT/UPDATE/DELETE commands? pg_shard requires that any modifications (INSERTs, UPDATEs, or DELETEs) involve exactly one shard. In the UPDATE and DELETE case, this means commands must include a WHERE qualification on the partition column that restricts the query to a single shard. Such qualifications usually take the form of an equality clause on the tables partition column. As for INSERT commands, the partition column of the row being inserted must be specified using an expression that can be reduced to a constant. For instance, a value such as 3, or even char_length('bob') would be suitable, though rand() would not. In additions, INSERT commands must specify exactly one row to be inserted. Note that the above restriction implies that commands similar to "INSERT INTO table SELECT col_one, col_two FROM other_table" are not currently supported. From an implementation standpoint, pg_shard determines the shard involved in a given INSERT, UPDATE, or DELETE command and then rewrites the SQL of that command to reference the shard table. The rewritten SQL is then sent to the placements for that shard to complete processing of the command. How exactly does pg_shard distribute my data? Rather than using hosts as the unit of distribution, pg_shard creates many small shards and places them across many hosts in a round-robin fashion. For example, a user might have eight hosts in their cluster but 256 shards with a replication factor of two. Shard one would be created on hosts A and B, shard two on B and C, and so forth. The advantage of this approach is that the additional load incurred after a host failure is spread among many other hosts instead of falling entirely on a single replica.
  • 23.
    But Mysql Galerahas master-master cluster approach! Multi-master replication means that applications update the same tables on different masters, and the changes replicate automatically between those masters. Row-Based Replication to Avoid Data Drift Replication depends on deterministic updates--a transaction that changes 10 rows on the original master should change exactly the same rows when it executes against a replica. Unfortunately many SQL statements that are deterministic in master/slave replication are non-deterministic in multi-master topologies. Consider the following example, which gives a 10% raise to employees in department #35. UPDATE emp SET salary = salary * 1.1 WHERE dep_id = 35; If all masters add employees, then the number of employees who actually get the raise will vary depending on whether such additions have replicated to all masters. Your servers will very likely become inconsistent with statement replication. The fix is to enable row-based replication using binlog-format=row in my.cnf. Row replication transfers the exact row updates from each master to the others and eliminates ambiguity. But this reduce performance dramatically.
  • 24.
    Mysql Galera Prevent KeyCollisions on INSERTs For applications that use auto-increment keys, MySQL offers a useful trick to ensure that such keys do not collide between masters using the auto-increment-increment and auto-increment-offset parameters in my.cnf. The following example ensures that auto-increment keys start at 1 and increment by 4 to give values like 1, 5, 9, etc. on this server. server-id=1 auto-increment-offset = 1 auto-increment-increment = 4 This works so long as your applications use auto-increment keys faithfully. However, any table that either does not have a primary key or where the key is not an auto-increment field is suspect. You need to hunt them down and ensure the application generates a proper key that does not collide across masters, for example using UUIDs or by putting the server ID into the key. Here is a query on the MySQL information schema to help locate tables that do not have an auto-increment primary key.
  • 25.
    Mysql Galera Semantic Conflictsin Applications MySQL replication can resolve conflicts. You need to avoid them in your applications. Here are a few tips as you go about this. First, avoid obvious conflicts. These include inserting data with the same keys on different masters (described above), updating rows in two places at once, or deleting rows that are updated elsewhere. Any of these can cause errors that will break replication or cause your masters to become out of sync. The good news is that many of these problems are not hard to detect and eliminate using properly formatted transactions. The bad news is that these are the easy conflicts. There are others that are much harder to address. For example, accounting systems need to generate unbroken sequences of numbers for invoices. A common approach is to use a table that holds the next invoice number and increment it in the same transaction that creates a new invoice. Another accounting example is reports that need to read the value of accounts consistently, for example at monthly close. Neither example works off-the-shelf in a multi-master system with asynchronous replication, as they both require some form of synchronization to ensure global consistency across masters. Or salary and balance task. These and other such cases may force substantial application changes. Some applications simply do not work with multi-master topologies for this reason.
  • 26.
    Mysql Galera Have aPlan for Sorting Out Mixed Up Data Master/slave replication has its discontents, but at least sorting out messed up replicas is simple: re-provision from another slave or the master. No so with multi-master topologies--you can easily get into a situation where all masters have transactions you need to preserve and the only way to sort things out is to track down differences and update masters directly. Here are some thoughts on how to do this. 1. Ensure you have tools to detect inconsistencies. Tungsten has built-in consistency checking with the 'trepctl check' command. You can also use the Percona Toolkit pt-table-checksum to find differences. Be forewarned that neither of these works especially well on large tables and may give false results if more than one master is active when you run them. 2. Consider relaxing foreign key constraints. I love foreign keys because they keep data in sync. However, they can also create problems for fixing messed up data, because the constraints may break replication or make it difficult to go table- by-table when synchronizing across masters. There is an argument for being a little more relaxed in multi-master settings. 3. Switch masters off if possible. Fixing problems is a lot easier if you can quiesce applications on all but one master. 4. Know how to fix data. Being handy with SQL is very helpful for fixing up problems. I find SELECT INTO OUTFILE and LOAD DATA INFILE quite handy for moving changes between masters. Don't forget SET SESSION LOG_FILE_BIN=0 to keep changes from being logged and breaking replication elsewhere. There are also various synchronization tools like pt-table- sync, but I do not know enough about them to make recommendations. 5. At this point it's probably worth mentioning commercial support. Unless you are a replication guru, it is very comforting to have somebody to call when you are dealing with messed up masters. Even better, expert advice early on can help you avoid problems in the first place.
  • 27.
    Mysql Galera +Jepsen + Withdraw https://aphyr.com/posts/327-jepsen-mariadb-galera-cluster Imagine a system of two bank accounts, each with a balance of $10. SET SESSION TRANSACTION ISOLATION LEVEL SERIALIZABLE set autocommit=0 select * from accounts where id = 0 select * from accounts where id = 1 UPDATE accounts SET balance = 8 WHERE id = 0 UPDATE accounts SET balance = 12 WHERE id = 1 COMMIT
  • 28.
    Mysql Galera +Jepsen + Withdraw Case 1: T1 commits before T2’s start time. Operations from T1 and T2 cannot interleave, by Lemma 1, because their intervals do not overlap. Case 2: T1 and T2 operate on disjoint sets of accounts. They serialize trivially. Case 3: T1 and T2 operate on intersecting sets of accounts, and T1 commits before T2 commits. Then T1 wrote data that T2 also wrote, and committed in T2’s interval, which violates First-committer-wins. T2 must abort. Case 4: T1 and T2 operate on intersecting sets of accounts, and T1 commits after T2 commits. Then T2 wrote data that T1 also wrote, and committed in T1’s interval, which violates First-committer-wins. T1 must abort.
  • 29.
    Mysql Galera +Jepsen + Withdraw Read-only transactions trivially serialize with one another. Do they serialize with respect to transfer transactions? The answer is yes: since every read-only transaction sees only committed data in a Snapshot Isolation system, and commits no data itself, it must appear to take place atomically at some time between other transactions. SET SESSION TRANSACTION ISOLATION LEVEL SERIALIZABLE set autocommit=0 select * from accounts COMMIT
  • 30.
    Mysql Galera +Jepsen + Withdraw
  • 31.
    Mysql Galera conclusion Thetransfer transactions should have kept the total amount of money at $20, but by the end of the test the totals all sum to $22. And in this run, 25% of the funds in the system mysteriously vanish. These results remain stable after all other transactions have ended–they are not a concurrency anomaly. Dirty reads! No first-committer-wins, no snapshot isolation. No snapshot isolation, well… I’m not sure exactly what Galera does guarantee. Master-Master works for append only DB http://scale-out-blog.blogspot.com/2012/04/if-you-must- deploy-multi-master.html http://www.onlamp.com/2016/04/20/advanced-mysql-
  • 33.
    We know that Instagramuses Postgres, pinterest uses mysql! True! https://engineering.pinterest.com/blog/sharding-pinterest- how-we-scaled-our-mysql-fleet >>In 2011, we hit traction. By some estimates, we were growing faster than any other previous startup. Around September 2011, every piece of our infrastructure was over capacity. We had several NoSQL technologies, all of which eventually broke catastrophically. We also had a boatload of MySQL slaves we were using for reads, which makes lots of irritating bugs, especially with caching.
  • 34.
    Pinterest How we sharded Whateverwe were going build needed to meet our needs and be stable, performant and repairable. In other words, it needed to not suck, and so we chose a mature technology as our base to build on, MySQL. We intentionally ran away from auto-scaling newer technology like MongoDB, Cassandra and Membase, because their maturity was simply not far enough along (and they were crashing in spectacular ways on us!). Aside: I still recommend startups avoid the fancy new stuff — try really hard to just use MySQL. Trust me. I have the scars to prove it. MySQL is mature, stable and it just works. Not only do we use it, but it’s also used by plenty of other companies pushing even bigger scale. MySQL supports our need for ordering data requests, selecting certain ranges of data and row-level transactions. It has a hell of a lot more features, but we don’t need or use them. But, MySQL is a single box solution, hence the need to shard our data. Here’s our solution: We started with eight EC2 servers running one MySQL instance each:
  • 35.
    Pinterest How we sharded Sohow do we distribute our data to these shards? We created a 64 bit ID that contains the shard ID, the type of the containing data, and where this data is in the table (local ID). The shard ID is 16 bits, type ID is 10 bits and local ID is 36 bits. The savvy additionology experts out there will notice that only adds to 62 bits. My past in compiler and chip design has taught me that reserve bits are worth their weight in gold. So we have two (set to zero). ID = (shard ID << 46) | (type ID << 36) | (local ID<<0)
  • 36.
    RabbitMQ RabbitMQ is adistributed message queue, and is probably the most popular open-source implementation of the AMQP messaging protocol. It supports a wealth of durability, routing, and fanout strategies, and combines excellent documentation with well-designed protocol extensions.
  • 37.
  • 38.
    RabbitMQ cluster +CAP According table there is a choice between CP and CA, but in real life CP means loss data from http://www.rabbitmq.com/partitions.html RabbitMQ clusters do not tolerate network partitions well. If you are thinking of clustering across a WAN, don't. You should use federation or the shovel instead. However, sometimes accidents happen. RabbitMQ stores information about queues, exchanges, bindings etc in Erlang's distributed database, Mnesia.
  • 39.
    RabbitMQ cluster andpartitions RabbitMQ also offers three ways to deal with network partitions automatically: pause-minority mode, pause- if-all-down mode and autoheal mode. (The default behaviour is referred to as ignore mode). In pause-minority mode RabbitMQ will automatically pause cluster nodes which determine themselves to be in a minority (i.e. fewer or equal than half the total number of nodes) after seeing other nodes go down. It therefore chooses partition tolerance over availability from the CAP theorem. This ensures that in the event of a network partition, at most the nodes in a single partition will continue to run. The minority nodes will pause as soon as a partition starts, and will start again when the partition ends. In pause-if-all-down mode, RabbitMQ will automatically pause cluster nodes which cannot reach any of the listed nodes. In other words, all the listed nodes must be down for RabbitMQ to pause a cluster node. This is close to the pause-minority mode, however, it allows an administrator to decide which nodes to prefer, instead of relying on the context. For instance, if the cluster is made of two nodes in rack A and two nodes in rack B, and the link between racks is lost, pause-minority mode will pause all nodes. In pause-if-all-down mode, if the administrator listed the two nodes in rack A, only nodes in rack B will pause. Note that it is possible the listed nodes get split across both sides of a partition: in this situation, no node will pause. That is why there is an additional ignore/autoheal argument to indicate how to recover from the partition. In autoheal mode RabbitMQ will automatically decide on a winning partition if a partition is deemed to have occurred, and will restart all nodes that are not in the winning partition. Unlike pause_minority mode it therefore takes effect when a partition ends, rather than when one starts. The winning partition is the one which has the most clients connected (or if this produces a draw, the one with the most nodes; and if that still produces a draw then one of the partitions is chosen in an unspecified way).
  • 40.
    How to scale? Federation Federationallows an exchange or queue on one broker to receive messages published to an exchange or queue on another (the brokers may be individual machines, or clusters). Communication is via AMQP (with optional SSL), so for two exchanges or queues to federate they must be granted appropriate users and permissions. Federated exchanges are connected with one way point-to-point links. By default, messages will only be forwarded over a federation link once, but this can be increased to allow for more complex routing topologies. Some messages may not be forwarded over the link; if a message would not be routed to a queue after reaching the federated exchange, it will not be forwarded in the first place. Federated queues are similarly connected with one way point-to-point links. Messages will be moved between federated queues an arbitrary number of times to follow the consumers. Typically you would use federation to link brokers across the internet for pub/sub messaging and work queueing. The Shovel Connecting brokers with the shovel is conceptually similar to connecting them with federation. However, the shovel works at a lower level. Whereas federation aims to provide opinionated distribution of exchanges and queues, the shovel simply consumes messages from a queue on one broker, and forwards them to an exchange on another. Typically you would use the shovel to link brokers across the internet when you need more control than federation provides.
  • 41.
    How to scale? Horizontally! Weoffer to use more simple way of scaling instead of Federation or shovel Just start N clusters (like mysql or postgres): Gateways RabbitMqRabbitMqRabbitMqGatewaysGateways GatewaysGatewaysBackends Gateways RabbitMqRabbitMqRabbitMqGatewaysGateways GatewaysGatewaysBackends Gateways RabbitMqRabbitMqRabbitMqGatewaysGateways GatewaysGatewaysBackends
  • 42.
  • 43.
    Redis 1. Redis fast! 2.Redis lost data! (CP)
  • 44.
    Redis fast? Exceptionally Fast: Redis is very fast and can perform about 110000 SETs per second, about 81000 GETs per second (one thread) 1. Operations are atomic : All the Redis operations are atomic, which ensures that if two clients concurrently access Redis server will get the updated value. discuss about CAS in java.
  • 45.
    Redis fast? Access byvalue O(1), by score O(log(N)). For numerical members, the value is the score. For string members, the score is a hash of the string.
  • 46.
    Redis scalable? Yes! due tosimple format of data storage (key -> value), where every entry uses hash for searching, very simple to shard by hash range or value range by , no additional effort comparing to mongodb (speak about mongodb indexes) for example. approaches: 1. Proxy assisted partitioning means that our clients send requests to a proxy that is able to speak the Redis protocol, instead of sending requests directly to the right Redis instance. The proxy will make sure to forward our request to the right Redis instance accordingly to the configured partitioning schema, and will send the replies back to the client. The Redis and Memcached proxy Twemproxy implements proxy assisted partitioning. 2. Query routing means that you can send your query to a random instance, and the instance will make sure to forward your query to the right node. Redis Cluster implements an hybrid form of query routing, with the help of the client (the request is not directly forwarded from a Redis instance to another, but the client gets redirected to the right node).
  • 47.
    Redis scalable? Yes! due tosimple format of data storage (key -> value), where every entry uses hash for searching, very simple to shard by hash range or value range by , no additional effort comparing to mongodb (speak about mongodb indexes) for example. approaches: 1. crc32: Proxy assisted partitioning means that our clients send requests to a proxy that is able to speak the Redis protocol, instead of sending requests directly to the right Redis instance. The proxy will make sure to forward our request to the right Redis instance accordingly to the configured partitioning schema, and will send the replies back to the client. The Redis and Memcached proxy Twemproxy implements proxy assisted partitioning. 2. Redis Cluster: Query routing means that you can send your query to a random instance, and the instance will make sure to forward your query to the right node. Redis Cluster implements an hybrid form of query routing, with the help of the client (the request is not directly forwarded from a Redis instance to another, but the client gets redirected to the right node). Discuss how to configure this! & Presharding http://redis.io/topics/cluster-tutorial http://redis.io/topics/partitioning http://docs.spring.io/spring-data/redis/docs/current/reference/html/#redis:sentinel
  • 48.
    What about HA Redisoffers asynchronous primary->secondary replication. A single server is chosen as the primary, which can accept writes. It relays its state changes to secondary servers, which follow along. Asynchronous means that you don’t have to wait for a write to be replicated before the primary returns a response to the client. 1. Sentinel Sentinel tries to establish a quorum between Sentinel nodes, agree on which Redis servers are alive, and promote any which appear to have failed. If we colocate the Sentinel nodes with the Redis nodes, this should allow us to promote a new primary in the majority component (should one exist). 2. Redis cluster (discuss about slots)! http://redis.io/topics/replication http://redis.io/topics/sentinel http://redis.io/topics/cluster-tutorial http://redis.io/topics/sentinel-clients
  • 49.
  • 50.
    Eureka (pure APalgorithm) Once the server starts receiving traffic, all of the operations that is performed on the server is replicated to all of the peer nodes that the server knows about. If an operation fails for some reason, the information is reconciled on the next heartbeat that also gets replicated between servers. When the Eureka server comes up, it tries to get all of the instance registry information from a neighboring node. If there is a problem getting the information from a node, the server tries all of the peers before it gives up. If the server is able to successfully get all of the instances, it sets the renewal threshold that it should be receiving based on that information. If any time, the renewals falls below the percent configured for that value (below 85% within 15 mins), the server stops expiring instances to protect the current instance registry information. It is called as self-preservation mode and is primarily used as a protection in scenarios where there is a network partition between a group of clients and the Eureka Server. In these scenarios, the server tries to protect the information it already has. There may be scenarios in case of a mass outage that this may cause the clients to get the instances that do not exist anymore. The clients must make sure they are resilient to eureka server returning an instance that is non- existent or un-responsive. The best protection in these scenarios is to timeout quickly and try other servers. What we do in balancer, gateway (file service, rabbitmq), backends (rabbitmq) In the case, where the server is not able get the registry information from the neighboring node, it waits for a few minutes (5 mins) so that the clients can register their information.
  • 51.
    Eureka (AP) What happensduring network outages between Peers? In the case of network outages between peers, following things may happen 1. The heartbeat replications between peers may fail and the server detects this situation and enters into a self-preservation mode protecting the current state. 2. The situation autocorrects itself after the network connectivity is restored to a stable state. When the peers are able to communicate fine, the registration information is automatically transferred to the servers that do not have them. The bottom line is, during the network outages, the server tries to be as resilient as possible, but there is a possibility of clients having different views of the servers during that time
  • 52.
    Zookeeper based onPAXOS algorithm and provides CA That is mean that it uses transactions for sharing state and can’t provide partition tolerance While eureka sends entire state all the time Transactions? Eureka vs Zookeeper CAP
  • 53.
    1. Eureka integratesbetter with other NetflixOSS components (Ribbon especially). 2. ZooKeeper is hard. We've gotten pretty good at it, but it requires care and feeding. https://tech.knewton.com/blog/2014/12/eureka-shouldnt-use- zookeeper-service-discovery/ Eureka vs Zookeeper
  • 54.
  • 55.
    Push service 1. Stateless 2.Locks 3. Performance
  • 56.
  • 57.
    Each Component ScalingCapability Type CAP Best for Platform module Independent; stateless HA & Performance Redis CP Performance Weave DNS AP HA w/o consistency Docker Swarm CA HA RabbitMQ Queues replicated across nodes HA & slight Performance Eureka AP HA w/o consistency Conf service Stateless HA
  • 58.
    Reminder 1. L1 cachereference 0.3 ns 2. Branch mispredict 3 ns 3. L2 cache reference 7 ns 4. Mutex lock/unlock 80 ns 5. Main memory reference 100 ns 6. Compress 1K bytes with Zippy 10,000 ns 7. Send 2K bytes over 1 Gbps network 20,000 ns 8. Read 1 MB sequentially from memory 250,000 ns 9. Round trip within same datacenter 500,000 ns 10.Disk seek 10,000,000 ns 11.Read 1 MB sequentially from network 5,000,000 ns 12.Read 1 MB sequentially from disk 30,000,000 ns 13.Send packet CA->Netherlands->CA 150,000,000 ns
  • 59.
    Reminder 2 Ensure yourdesign works if scale changes by 10X or 20X but the right solution for X often not optimal for 100X
  • 60.
    Eventual Consistency Eventual Consistency- BASE Along with the CAP conjuncture, Brewer suggested a new consistency model - BASE (Basically Available, Soft state, Eventual consistency) • BASE model gives up on Consistency from the CAP Theorem. • This model is optimistic and accepts eventual consistency, in contrast to ACID. o Given enough time, all nodes will be consistent and every request will result with same responses. • Brewer points out that ACID and BASE are two extremes and one can have a range of options in choosing the balance between consistency and availability. (consistency models). Basically Available - the system does guarantee availability, in terms of the CAP theorem. It is always available, but subsets of data may become unavailable for short periods of time. • Soft state - State of system may change over time, even without input. Data does not have to be consistent. • Eventual Consistency - System will become consistent eventually in the future. ACID, on the contrary, enforces consistency immediately after any operation.