SlideShare a Scribd company logo
SCALING & HIGH AVAILABILITY
OF THE PLATFORM
Max & Vitaly
CAP theorem
 Presented as a conjuncture at PODC 2000 (Brewer's conjecture)
 Formalized and proved in 2002 by Nancy Lynch and Seth Gilbert (MIT)
 Consistency, Availability and Partition- Tolerance cannot be achieved all at the same time in a
distributed system
 There is a tradeoff between these 3 properties
1. Consistency (all nodes see the same data at the same time)
2. Availability (every request receives a response about
whether it succeeded or failed)
3. Partition tolerance (the system continues to operate despite
arbitrary partitioning due to network failures)
Definition In simple terms:
in an asynchronous network that performs as expected, where
messages may be lost (partition-tolerance), it is impossible to
implement a service that provides consistent data and responds
eventually to every request (availability) under every pattern of
message loss
Consistency:
• Data is consistent and the same for all nodes.
• All the nodes in the system see the same state of the data vi
• Every request to non-failing node should be processed and
receive response whether it failed or succeeded
Availability:
Partition tolerance:
•If some nodes crash / communication fails, service still
performs as expected
In simple words:
● Consistency & Availability = some guaranties of data loss
● Consistency & Partitioning = scaling
Why do we need to care about this?
Stop theory! Real examples
• RDBMS (mysql, postgres)
• NoSQL (redis)
• RabbitMQ
• Eureka
• Black-box systems testing. Bugs reproduced in Jepsen are
observable in production, not theoretical. But tests are
nondeterministic, and they cannot prove correctness, only find
errors.
• Testing under distributed systems failure modes: faulty
networks, unsynchronized clocks, and partial failure. Test suites
only evaluate the behavior of healthy clusters
• Generative testing: systemc constructs random operations,
apply them to the system, and constructs a concurrent history
of their results. That history is checked against a model to
establish its correctness. Generative (or property-based) tests
often reveal edge cases with subtle combinations of inputs.
Jepsen (http://jepsen.io/)
Jepsen (http://jepsen.io/)
Jepsen, just add brackets...
RDBMS (again theory)
• Standardized with SQL
• Ubiquitous – widely used and understood
• Supports transactions
• High availability is achieved via Replication
• Master – Master
• Master – Slave
• Synchronous/Asynchronous
Why RDBMS is AC: ACID
Atomicity of an operation(transaction)
• "All or nothing“ – If part fails, the entire transaction fails.
Consistency
• Database will remain in a valid state after the transaction.
• Means adhering to the database rules (key, uniqueness,
etc.)
Isolation
• 2 Simultaneous transactions cannot interfere one with the
other. (Executed as if executed sequentially)
Durability
• Once a transaction is commited, it remains so indefinitely,
even after power loss or crash. (no caching) Definition – ACID
ACID in Dist. Systems
• Proved problematic in big dist systems
• How to guarantee ACID properties ?
• Atomicity requires more thought - e.g. two-phase
commit (and 3-phase commit, PAXOS…)
• Isolation requires to hold all of its locks for the entire
transaction duration - High Lock Contention !
• Complex
• Prone to failure - algorithm should handle
• Failure = outage during write.
• Comes with High overhead commits.
Reminder: speak about atomicity/locks
in java: Withdraw example
Does it means that we can’t scale
RDBMS out of the box?
But we have PG cluster!
But In PG cluster only one node can write.
According Amazone research it brings 5% overhead for
master node + network delay and replica delay for 2 PC
commit. So it can balance only reading via pgpool
PG cluster is not about balancing load (at least writing)
Okay, at least we have ACID
Right?
Well… almost. Even though the Postgres server is always
consistent, the distributed system composed of the server
and client together may not be consistent. It’s possible for
the client and server to disagree about whether or not a
transaction took place.
PG cluster
Postgres' commit protocol, like most relational databases, is a special case of
two-phase commit, or 2PC. In the first phase, the client votes to commit (or
abort) the current transaction, and sends that message to the server. The server
checks to see whether its consistency constraints allow the transaction to
proceed, and if so, it votes to commit. It writes the transaction to storage and
informs the client that the commit has taken place (or failed, as the case may
be.) Now both the client and server agree on the outcome of the transaction.
What happens if the message acknowledging the commit is dropped before the
client receives it? Then the client doesn’t know whether the commit succeeded
or not! The 2PC protocol says that we must wait for the acknowledgement
message to arrive in order to decide the outcome. Waiting forever isn’t realistic
for real systems, so at some point the client will time out and declare an error
occurred. The commit protocol is now in an indeterminate state.
PG cluster + Jepsen + Withdraw
example
https://aphyr.com/posts/282-jepsen-postgres
But we have pg_shard for scaling load
https://www.citusdata.com/citus-products/pg-shard/pg-
shard-quick-start-guide
Yes but Postgres with pg_shard is not ACID!
Limitations:
• Transactional semantics for queries that span across
multiple shards - For example, you're a financial institution
and you sharded your data based on customer_id. You'd
now like to withdraw money from one customer's account
and debit it to another one's account, in a single transaction
block.
• Unique constraints on columns other than the partition
key, or foreign key constraints.
• Distributed JOINs also aren't supported in pg_shard
pg_shard
Frequently Asked Questions
How does pg_shard handle INSERT/UPDATE/DELETE commands?
pg_shard requires that any modifications (INSERTs, UPDATEs, or DELETEs) involve exactly one shard.
In the UPDATE and DELETE case, this means commands must include a WHERE qualification on the partition column that restricts
the query to a single shard. Such qualifications usually take the form of an equality clause on the tables partition column.
As for INSERT commands, the partition column of the row being inserted must be specified using an expression
that can be reduced to a constant. For instance, a value such as 3, or even char_length('bob') would be suitable,
though rand() would not. In additions, INSERT commands must specify exactly one row to be inserted.
Note that the above restriction implies that commands similar to "INSERT INTO table SELECT col_one, col_two
FROM other_table" are not currently supported.
From an implementation standpoint, pg_shard determines the shard involved in a given INSERT, UPDATE,
or DELETE command and then rewrites the SQL of that command to reference the shard table.
The rewritten SQL is then sent to the placements for that shard to complete processing of the command.
How exactly does pg_shard distribute my data?
Rather than using hosts as the unit of distribution, pg_shard creates many small shards and places them across many hosts in a round-robin fashion.
For example, a user might have eight hosts in their cluster but 256 shards with a replication factor of two. Shard one would be created on hosts A and B, shard two on B and
C, and so forth.
The advantage of this approach is that the additional load incurred after a host failure is spread among many other hosts instead of falling entirely on a single replica.
But Mysql Galera has master-master
cluster approach!
Multi-master replication means that applications update the
same tables on different masters, and the changes replicate
automatically between those masters.
Row-Based Replication to Avoid Data Drift
Replication depends on deterministic updates--a transaction that changes 10 rows on the original master
should change exactly the same rows when it executes against a replica. Unfortunately many SQL
statements that are deterministic in master/slave replication are non-deterministic in multi-master
topologies. Consider the following example, which gives a 10% raise to employees in department #35.
UPDATE emp SET salary = salary * 1.1 WHERE dep_id = 35;
If all masters add employees, then the number of employees who actually get the raise will vary depending on
whether such additions have replicated to all masters. Your servers will very likely become inconsistent with
statement replication. The fix is to enable row-based replication using binlog-format=row in my.cnf. Row
replication transfers the exact row updates from each master to the others and eliminates ambiguity.
But this reduce performance dramatically.
Mysql Galera
Prevent Key Collisions on INSERTs
For applications that use auto-increment keys, MySQL offers a useful trick to ensure that such keys do not
collide between masters using the auto-increment-increment and auto-increment-offset parameters in
my.cnf. The following example ensures that auto-increment keys start at 1 and increment by 4 to give values
like 1, 5, 9, etc. on this server.
server-id=1
auto-increment-offset = 1
auto-increment-increment = 4
This works so long as your applications use auto-increment keys faithfully. However, any table that either does
not have a primary key or where the key is not an auto-increment field is suspect. You need to hunt them
down and ensure the application generates a proper key that does not collide across masters, for example
using UUIDs or by putting the server ID into the key. Here is a query on the MySQL information schema to
help locate tables that do not have an auto-increment primary key.
Mysql Galera
Semantic Conflicts in Applications
MySQL replication can resolve conflicts. You need to avoid them in your applications. Here are a few tips as
you go about this.
First, avoid obvious conflicts. These include inserting data with the same keys on different masters (described
above), updating rows in two places at once, or deleting rows that are updated elsewhere. Any of these can
cause errors that will break replication or cause your masters to become out of sync. The good news is that
many of these problems are not hard to detect and eliminate using properly formatted transactions. The
bad news is that these are the easy conflicts. There are others that are much harder to address.
For example, accounting systems need to generate unbroken sequences of numbers for invoices. A common
approach is to use a table that holds the next invoice number and increment it in the same transaction that
creates a new invoice. Another accounting example is reports that need to read the value of accounts
consistently, for example at monthly close. Neither example works off-the-shelf in a multi-master system
with asynchronous replication, as they both require some form of synchronization to ensure global
consistency across masters. Or salary and balance task. These and other such cases may force substantial
application changes. Some applications simply do not work with multi-master topologies for this reason.
Mysql Galera
Have a Plan for Sorting Out Mixed Up Data
Master/slave replication has its discontents, but at least sorting out messed up replicas is simple: re-provision from another slave
or the master. No so with multi-master topologies--you can easily get into a situation where all masters have transactions you
need to preserve and the only way to sort things out is to track down differences and update masters directly. Here are some
thoughts on how to do this.
1. Ensure you have tools to detect inconsistencies. Tungsten has built-in consistency checking with the 'trepctl check'
command. You can also use the Percona Toolkit pt-table-checksum to find differences. Be forewarned that neither of
these works especially well on large tables and may give false results if more than one master is active when you run them.
2. Consider relaxing foreign key constraints. I love foreign keys because they keep data in sync. However, they can also
create problems for fixing messed up data, because the constraints may break replication or make it difficult to go table-
by-table when synchronizing across masters. There is an argument for being a little more relaxed in multi-master settings.
3. Switch masters off if possible. Fixing problems is a lot easier if you can quiesce applications on all but one master.
4. Know how to fix data. Being handy with SQL is very helpful for fixing up problems. I find SELECT INTO OUTFILE and LOAD
DATA INFILE quite handy for moving changes between masters. Don't forget SET SESSION LOG_FILE_BIN=0 to keep
changes from being logged and breaking replication elsewhere. There are also various synchronization tools like pt-table-
sync, but I do not know enough about them to make recommendations.
5. At this point it's probably worth mentioning commercial support. Unless you are a replication guru, it is very comforting to
have somebody to call when you are dealing with messed up masters. Even better, expert advice early on can help you
avoid problems in the first place.
Mysql Galera + Jepsen + Withdraw
https://aphyr.com/posts/327-jepsen-mariadb-galera-cluster
Imagine a system of two bank accounts, each with a balance of
$10.
SET SESSION TRANSACTION ISOLATION LEVEL SERIALIZABLE
set autocommit=0
select * from accounts where id = 0
select * from accounts where id = 1
UPDATE accounts SET balance = 8 WHERE id = 0
UPDATE accounts SET balance = 12 WHERE id = 1
COMMIT
Mysql Galera + Jepsen + Withdraw
Case 1: T1 commits before T2’s start time. Operations from T1 and T2 cannot
interleave, by Lemma 1, because their intervals do not overlap.
Case 2: T1 and T2 operate on disjoint sets of accounts. They serialize trivially.
Case 3: T1 and T2 operate on intersecting sets of accounts, and T1 commits before T2
commits. Then T1 wrote data that T2 also wrote, and committed in T2’s interval,
which violates First-committer-wins. T2 must abort.
Case 4: T1 and T2 operate on intersecting sets of accounts, and T1 commits after T2
commits. Then T2 wrote data that T1 also wrote, and committed in T1’s interval,
which violates First-committer-wins. T1 must abort.
Mysql Galera + Jepsen + Withdraw
Read-only transactions trivially serialize with one another. Do they serialize with
respect to transfer transactions? The answer is yes: since every read-only transaction
sees only committed data in a Snapshot Isolation system, and commits no data itself,
it must appear to take place atomically at some time between other transactions.
SET SESSION TRANSACTION ISOLATION LEVEL SERIALIZABLE
set autocommit=0
select * from accounts
COMMIT
Mysql Galera + Jepsen + Withdraw
Mysql Galera conclusion
The transfer transactions should have kept the total amount of
money at $20, but by the end of the test the totals all sum to
$22. And in this run, 25% of the funds in the system
mysteriously vanish. These results remain stable after all other
transactions have ended–they are not a concurrency anomaly.
Dirty reads!
No first-committer-wins, no snapshot isolation. No snapshot
isolation, well… I’m not sure exactly what Galera does
guarantee.
Master-Master works for append only DB
http://scale-out-blog.blogspot.com/2012/04/if-you-must-
deploy-multi-master.html
http://www.onlamp.com/2016/04/20/advanced-mysql-
We know that
Instagram uses Postgres,
pinterest uses mysql!
True!
https://engineering.pinterest.com/blog/sharding-pinterest-
how-we-scaled-our-mysql-fleet
>>In 2011, we hit traction. By some estimates, we were growing
faster than any other previous startup. Around September
2011, every piece of our infrastructure was over capacity. We
had several NoSQL technologies, all of which eventually broke
catastrophically. We also had a boatload of MySQL slaves we
were using for reads, which makes lots of irritating bugs,
especially with caching.
Pinterest
How we sharded
Whatever we were going build needed to meet our needs and be stable, performant and repairable. In other
words, it needed to not suck, and so we chose a mature technology as our base to build on, MySQL. We
intentionally ran away from auto-scaling newer technology like MongoDB, Cassandra and Membase, because
their maturity was simply not far enough along (and they were crashing in spectacular ways on us!).
Aside: I still recommend startups avoid the fancy new stuff — try really hard to just use MySQL. Trust me. I
have the scars to prove it.
MySQL is mature, stable and it just works. Not only do we use it, but it’s also used by plenty of other
companies pushing even bigger scale. MySQL supports our need for ordering data requests, selecting certain
ranges of data and row-level transactions. It has a hell of a lot more features, but we don’t need or use them.
But, MySQL is a single box solution, hence the need to shard our data. Here’s our solution:
We started with eight EC2 servers running one MySQL instance each:
Pinterest
How we sharded
So how do we distribute our data to these shards?
We created a 64 bit ID that contains the shard ID, the type of the containing data, and where this data is in the
table (local ID). The shard ID is 16 bits, type ID is 10 bits and local ID is 36 bits. The savvy additionology
experts out there will notice that only adds to 62 bits. My past in compiler and chip design has taught me
that reserve bits are worth their weight in gold. So we have two (set to zero).
ID = (shard ID << 46) | (type ID << 36) | (local ID<<0)
RabbitMQ
RabbitMQ is a distributed message queue,
and is probably the most popular open-source implementation
of the AMQP messaging protocol. It supports a wealth of
durability, routing, and fanout strategies, and combines excellent
documentation with well-designed protocol extensions.
RabbitMQ + CAP
RabbitMQ cluster + CAP
According table there is a choice between CP and CA, but in real
life CP means loss data
from http://www.rabbitmq.com/partitions.html
RabbitMQ clusters do not tolerate network partitions well. If you
are thinking of clustering across a WAN, don't. You should use
federation or the shovel instead.
However, sometimes accidents happen.
RabbitMQ stores information about queues, exchanges, bindings
etc in Erlang's distributed database, Mnesia.
RabbitMQ cluster and partitions
RabbitMQ also offers three ways to deal with network partitions automatically: pause-minority mode, pause-
if-all-down mode and autoheal mode. (The default behaviour is referred to as ignore mode).
In pause-minority mode RabbitMQ will automatically pause cluster nodes which determine themselves to be in
a minority (i.e. fewer or equal than half the total number of nodes) after seeing other nodes go down. It
therefore chooses partition tolerance over availability from the CAP theorem. This ensures that in the event of
a network partition, at most the nodes in a single partition will continue to run. The minority nodes will pause
as soon as a partition starts, and will start again when the partition ends.
In pause-if-all-down mode, RabbitMQ will automatically pause cluster nodes which cannot reach any of the
listed nodes. In other words, all the listed nodes must be down for RabbitMQ to pause a cluster node. This is
close to the pause-minority mode, however, it allows an administrator to decide which nodes to prefer, instead
of relying on the context. For instance, if the cluster is made of two nodes in rack A and two nodes in rack B,
and the link between racks is lost, pause-minority mode will pause all nodes. In pause-if-all-down mode, if the
administrator listed the two nodes in rack A, only nodes in rack B will pause. Note that it is possible the listed
nodes get split across both sides of a partition: in this situation, no node will pause. That is why there is an
additional ignore/autoheal argument to indicate how to recover from the partition.
In autoheal mode RabbitMQ will automatically decide on a winning partition if a partition is deemed to have
occurred, and will restart all nodes that are not in the winning partition. Unlike pause_minority mode it
therefore takes effect when a partition ends, rather than when one starts.
The winning partition is the one which has the most clients connected (or if this produces a draw, the one with
the most nodes; and if that still produces a draw then one of the partitions is chosen in an unspecified way).
How to scale?
Federation
Federation allows an exchange or queue on one broker to receive messages published to an exchange or queue on another (the
brokers may be individual machines, or clusters). Communication is via AMQP (with optional SSL), so for two exchanges or queues
to federate they must be granted appropriate users and permissions.
Federated exchanges are connected with one way point-to-point links. By default, messages will only be forwarded over a
federation link once, but this can be increased to allow for more complex routing topologies. Some messages may not be
forwarded over the link; if a message would not be routed to a queue after reaching the federated exchange, it will not be
forwarded in the first place.
Federated queues are similarly connected with one way point-to-point links. Messages will be moved between federated queues
an arbitrary number of times to follow the consumers.
Typically you would use federation to link brokers across the internet for pub/sub messaging and work queueing.
The Shovel
Connecting brokers with the shovel is conceptually similar to connecting them with federation. However, the shovel works at a
lower level.
Whereas federation aims to provide opinionated distribution of exchanges and queues, the shovel simply consumes messages
from a queue on one broker, and forwards them to an exchange on another.
Typically you would use the shovel to link brokers across the internet when you need more control than federation provides.
How to scale?
Horizontally!
We offer to use more simple way of scaling instead of Federation or shovel
Just start N clusters (like mysql or postgres):
Gateways RabbitMqRabbitMqRabbitMqGatewaysGateways
GatewaysGatewaysBackends
Gateways RabbitMqRabbitMqRabbitMqGatewaysGateways
GatewaysGatewaysBackends
Gateways RabbitMqRabbitMqRabbitMqGatewaysGateways
GatewaysGatewaysBackends
How to scale*?
https://insidethecpu.com/2014/11/17/load-balancing-a-
rabbitmq-cluster/
Redis
1. Redis fast!
2. Redis lost data! (CP)
Redis fast?
Exceptionally Fast : Redis is very fast and can perform about
110000 SETs per second, about 81000 GETs per second (one
thread)
1. Operations are atomic : All the Redis operations are atomic,
which ensures that if two clients concurrently access Redis
server will get the updated value. discuss about CAS in java.
Redis fast?
Access by value O(1), by score O(log(N)). For numerical
members, the value is the score. For string members, the score is
a hash of the string.
Redis scalable?
Yes!
due to simple format of data storage (key -> value), where every
entry uses hash for searching, very simple to shard by hash range
or value range by , no additional effort comparing to mongodb
(speak about mongodb indexes) for example.
approaches:
1. Proxy assisted partitioning means that our clients send requests to a proxy that is able to speak the Redis
protocol, instead of sending requests directly to the right Redis instance. The proxy will make sure to
forward our request to the right Redis instance accordingly to the configured partitioning schema, and
will send the replies back to the client. The Redis and Memcached proxy Twemproxy implements proxy
assisted partitioning.
2. Query routing means that you can send your query to a random instance, and the instance will make
sure to forward your query to the right node. Redis Cluster implements an hybrid form of query routing,
with the help of the client (the request is not directly forwarded from a Redis instance to another, but
the client gets redirected to the right node).
Redis scalable?
Yes!
due to simple format of data storage (key -> value), where every entry uses hash for searching, very simple to
shard by hash range or value range by , no additional effort comparing to mongodb (speak about mongodb
indexes) for example.
approaches:
1. crc32: Proxy assisted partitioning means that our clients send requests to a proxy that is able to speak
the Redis protocol, instead of sending requests directly to the right Redis instance. The proxy will make
sure to forward our request to the right Redis instance accordingly to the configured partitioning schema,
and will send the replies back to the client. The Redis and Memcached proxy Twemproxy implements
proxy assisted partitioning.
2. Redis Cluster: Query routing means that you can send your query to a random instance, and the instance
will make sure to forward your query to the right node. Redis Cluster implements an hybrid form of query
routing, with the help of the client (the request is not directly forwarded from a Redis instance to
another, but the client gets redirected to the right node).
Discuss how to configure this! & Presharding
http://redis.io/topics/cluster-tutorial
http://redis.io/topics/partitioning
http://docs.spring.io/spring-data/redis/docs/current/reference/html/#redis:sentinel
What about HA
Redis offers asynchronous primary->secondary replication. A single server is chosen as
the primary, which can accept writes. It relays its state changes to secondary servers,
which follow along. Asynchronous means that you don’t have to wait for a write to be
replicated before the primary returns a response to the client.
1. Sentinel
Sentinel tries to establish a quorum between Sentinel nodes, agree on which Redis
servers are alive, and promote any which appear to have failed. If we colocate the
Sentinel nodes with the Redis nodes, this should allow us to promote a new primary in
the majority component (should one exist).
2. Redis cluster (discuss about slots)!
http://redis.io/topics/replication
http://redis.io/topics/sentinel
http://redis.io/topics/cluster-tutorial
http://redis.io/topics/sentinel-clients
Redis + Jepsen
https://aphyr.com/posts/283-call-me-maybe-redis
Eureka (pure AP algorithm)
Once the server starts receiving traffic, all of the operations that is performed on the server is
replicated to all of the peer nodes that the server knows about. If an operation fails for some
reason, the information is reconciled on the next heartbeat that also gets replicated between
servers.
When the Eureka server comes up, it tries to get all of the instance registry information from a
neighboring node. If there is a problem getting the information from a node, the server tries all of
the peers before it gives up. If the server is able to successfully get all of the instances, it sets the
renewal threshold that it should be receiving based on that information. If any time, the renewals
falls below the percent configured for that value (below 85% within 15 mins), the server stops
expiring instances to protect the current instance registry information.
It is called as self-preservation mode and is primarily used as a protection in scenarios where
there is a network partition between a group of clients and the Eureka Server. In these scenarios,
the server tries to protect the information it already has. There may be scenarios in case of a
mass outage that this may cause the clients to get the instances that do not exist anymore. The
clients must make sure they are resilient to eureka server returning an instance that is non-
existent or un-responsive. The best protection in these scenarios is to timeout quickly and try
other servers.
What we do in balancer, gateway (file service, rabbitmq), backends (rabbitmq)
In the case, where the server is not able get the registry information from the neighboring node,
it waits for a few minutes (5 mins) so that the clients can register their information.
Eureka (AP)
What happens during network outages between Peers?
In the case of network outages between peers, following things may happen
1. The heartbeat replications between peers may fail and the server detects this
situation and enters into a self-preservation mode protecting the current state.
2. The situation autocorrects itself after the network connectivity is restored to a
stable state. When the peers are able to communicate fine, the registration
information is automatically transferred to the servers that do not have them.
The bottom line is, during the network outages, the server tries to be as resilient as
possible, but there is a possibility of clients having different views of the servers during
that time
Zookeeper based on PAXOS algorithm and provides CA
That is mean that it uses transactions for sharing state and can’t
provide partition tolerance
While eureka sends entire state all the time
Transactions?
Eureka vs Zookeeper CAP
1. Eureka integrates better with other NetflixOSS components
(Ribbon especially).
2. ZooKeeper is hard. We've gotten pretty good at it, but it
requires care and feeding.
https://tech.knewton.com/blog/2014/12/eureka-shouldnt-use-
zookeeper-service-discovery/
Eureka vs Zookeeper
https://tech.knewton.com/blog/2014/12/eureka-shouldnt-use-
zookeeper-service-discovery/
https://github.com/Netflix/eureka/wiki/Understanding-Eureka-
Peer-to-Peer-Communication
https://groups.google.com/forum/#!topic/eureka_netflix/LXKWo
D14RFY
Eureka links
Push service
1. Stateless
2. Locks
3. Performance
Configuration service
HA via DNS balancing
Each Component Scaling Capability
Type CAP Best for
Platform module Independent; stateless HA & Performance
Redis CP Performance
Weave DNS AP HA w/o consistency
Docker Swarm CA HA
RabbitMQ Queues replicated
across nodes
HA & slight Performance
Eureka AP HA w/o consistency
Conf service Stateless HA
Reminder
1. L1 cache reference 0.3 ns
2. Branch mispredict 3 ns
3. L2 cache reference 7 ns
4. Mutex lock/unlock 80 ns
5. Main memory reference 100 ns
6. Compress 1K bytes with Zippy 10,000 ns
7. Send 2K bytes over 1 Gbps network 20,000 ns
8. Read 1 MB sequentially from memory 250,000 ns
9. Round trip within same datacenter 500,000 ns
10.Disk seek 10,000,000 ns
11.Read 1 MB sequentially from network 5,000,000 ns
12.Read 1 MB sequentially from disk 30,000,000 ns
13.Send packet CA->Netherlands->CA 150,000,000 ns
Reminder 2
Ensure your design works if scale changes by 10X or 20X
but the right solution for X often not optimal for 100X
Eventual Consistency
Eventual Consistency - BASE Along with the CAP conjuncture, Brewer suggested a new consistency
model - BASE (Basically Available, Soft state, Eventual consistency) • BASE model gives up on
Consistency from the CAP Theorem. • This model is optimistic and accepts eventual consistency, in
contrast to ACID. o Given enough time, all nodes will be consistent and every request will result with
same responses. • Brewer points out that ACID and BASE are two extremes and one can have a
range of options in choosing the balance between consistency and availability. (consistency models).
Basically Available - the system does guarantee availability, in terms of the CAP theorem. It is always
available, but subsets of data may become unavailable for short periods of time. • Soft state - State of
system may change over time, even without input. Data does not have to be consistent. • Eventual
Consistency - System will become consistent eventually in the future. ACID, on the contrary, enforces
consistency immediately after any operation.

More Related Content

What's hot

PoC: Using a Group Communication System to improve MySQL Replication HA
PoC: Using a Group Communication System to improve MySQL Replication HAPoC: Using a Group Communication System to improve MySQL Replication HA
PoC: Using a Group Communication System to improve MySQL Replication HA
Ulf Wendel
 
The mysqlnd replication and load balancing plugin
The mysqlnd replication and load balancing pluginThe mysqlnd replication and load balancing plugin
The mysqlnd replication and load balancing plugin
Ulf Wendel
 
合并到 XtraDB 存储引擎集群
合并到 XtraDB 存储引擎集群合并到 XtraDB 存储引擎集群
合并到 XtraDB 存储引擎集群
YUCHENG HU
 
Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...
Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...
Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...
Continuent
 
Webinar slides: Introducing Galera 3.0 - Now supporting MySQL 5.6
Webinar slides: Introducing Galera 3.0 - Now supporting MySQL 5.6Webinar slides: Introducing Galera 3.0 - Now supporting MySQL 5.6
Webinar slides: Introducing Galera 3.0 - Now supporting MySQL 5.6
Severalnines
 
MySQL Group Replication
MySQL Group ReplicationMySQL Group Replication
MySQL Group Replication
Ulf Wendel
 
Until successful scope in mule
Until successful scope in muleUntil successful scope in mule
Until successful scope in mule
Ankit Lawaniya
 
Zero Downtime Schema Changes - Galera Cluster - Best Practices
Zero Downtime Schema Changes - Galera Cluster - Best PracticesZero Downtime Schema Changes - Galera Cluster - Best Practices
Zero Downtime Schema Changes - Galera Cluster - Best Practices
Severalnines
 
MySQL Multi Master Replication
MySQL Multi Master ReplicationMySQL Multi Master Replication
MySQL Multi Master Replication
Moshe Kaplan
 
Galera Cluster 3.0 Features
Galera Cluster 3.0 FeaturesGalera Cluster 3.0 Features
Galera Cluster 3.0 Features
Codership Oy - Creators of Galera Cluster
 
MySQL native driver for PHP (mysqlnd) - Introduction and overview, Edition 2011
MySQL native driver for PHP (mysqlnd) - Introduction and overview, Edition 2011MySQL native driver for PHP (mysqlnd) - Introduction and overview, Edition 2011
MySQL native driver for PHP (mysqlnd) - Introduction and overview, Edition 2011
Ulf Wendel
 
Introduction to Galera Cluster
Introduction to Galera ClusterIntroduction to Galera Cluster
Introduction to Galera Cluster
Codership Oy - Creators of Galera Cluster
 
Galera Replication Demystified: How Does It Work?
Galera Replication Demystified: How Does It Work?Galera Replication Demystified: How Does It Work?
Galera Replication Demystified: How Does It Work?
Frederic Descamps
 
MySQL 5.7 clustering: The developer perspective
MySQL 5.7 clustering: The developer perspectiveMySQL 5.7 clustering: The developer perspective
MySQL 5.7 clustering: The developer perspective
Ulf Wendel
 
Galera Cluster - Node Recovery - Webinar slides
Galera Cluster - Node Recovery - Webinar slidesGalera Cluster - Node Recovery - Webinar slides
Galera Cluster - Node Recovery - Webinar slides
Severalnines
 
HTTP Plugin for MySQL!
HTTP Plugin for MySQL!HTTP Plugin for MySQL!
HTTP Plugin for MySQL!
Ulf Wendel
 
Introduction to Galera
Introduction to GaleraIntroduction to Galera
Introduction to Galera
Henrik Ingo
 
Fine-tuning Group Replication for Performance
Fine-tuning Group Replication for PerformanceFine-tuning Group Replication for Performance
Fine-tuning Group Replication for Performance
Vitor Oliveira
 
How oracle 12c flexes its muscles against oracle 11g r2 final
How oracle 12c flexes its muscles against oracle 11g r2 finalHow oracle 12c flexes its muscles against oracle 11g r2 final
How oracle 12c flexes its muscles against oracle 11g r2 final
Ajith Narayanan
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be there
Gwen (Chen) Shapira
 

What's hot (20)

PoC: Using a Group Communication System to improve MySQL Replication HA
PoC: Using a Group Communication System to improve MySQL Replication HAPoC: Using a Group Communication System to improve MySQL Replication HA
PoC: Using a Group Communication System to improve MySQL Replication HA
 
The mysqlnd replication and load balancing plugin
The mysqlnd replication and load balancing pluginThe mysqlnd replication and load balancing plugin
The mysqlnd replication and load balancing plugin
 
合并到 XtraDB 存储引擎集群
合并到 XtraDB 存储引擎集群合并到 XtraDB 存储引擎集群
合并到 XtraDB 存储引擎集群
 
Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...
Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...
Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...
 
Webinar slides: Introducing Galera 3.0 - Now supporting MySQL 5.6
Webinar slides: Introducing Galera 3.0 - Now supporting MySQL 5.6Webinar slides: Introducing Galera 3.0 - Now supporting MySQL 5.6
Webinar slides: Introducing Galera 3.0 - Now supporting MySQL 5.6
 
MySQL Group Replication
MySQL Group ReplicationMySQL Group Replication
MySQL Group Replication
 
Until successful scope in mule
Until successful scope in muleUntil successful scope in mule
Until successful scope in mule
 
Zero Downtime Schema Changes - Galera Cluster - Best Practices
Zero Downtime Schema Changes - Galera Cluster - Best PracticesZero Downtime Schema Changes - Galera Cluster - Best Practices
Zero Downtime Schema Changes - Galera Cluster - Best Practices
 
MySQL Multi Master Replication
MySQL Multi Master ReplicationMySQL Multi Master Replication
MySQL Multi Master Replication
 
Galera Cluster 3.0 Features
Galera Cluster 3.0 FeaturesGalera Cluster 3.0 Features
Galera Cluster 3.0 Features
 
MySQL native driver for PHP (mysqlnd) - Introduction and overview, Edition 2011
MySQL native driver for PHP (mysqlnd) - Introduction and overview, Edition 2011MySQL native driver for PHP (mysqlnd) - Introduction and overview, Edition 2011
MySQL native driver for PHP (mysqlnd) - Introduction and overview, Edition 2011
 
Introduction to Galera Cluster
Introduction to Galera ClusterIntroduction to Galera Cluster
Introduction to Galera Cluster
 
Galera Replication Demystified: How Does It Work?
Galera Replication Demystified: How Does It Work?Galera Replication Demystified: How Does It Work?
Galera Replication Demystified: How Does It Work?
 
MySQL 5.7 clustering: The developer perspective
MySQL 5.7 clustering: The developer perspectiveMySQL 5.7 clustering: The developer perspective
MySQL 5.7 clustering: The developer perspective
 
Galera Cluster - Node Recovery - Webinar slides
Galera Cluster - Node Recovery - Webinar slidesGalera Cluster - Node Recovery - Webinar slides
Galera Cluster - Node Recovery - Webinar slides
 
HTTP Plugin for MySQL!
HTTP Plugin for MySQL!HTTP Plugin for MySQL!
HTTP Plugin for MySQL!
 
Introduction to Galera
Introduction to GaleraIntroduction to Galera
Introduction to Galera
 
Fine-tuning Group Replication for Performance
Fine-tuning Group Replication for PerformanceFine-tuning Group Replication for Performance
Fine-tuning Group Replication for Performance
 
How oracle 12c flexes its muscles against oracle 11g r2 final
How oracle 12c flexes its muscles against oracle 11g r2 finalHow oracle 12c flexes its muscles against oracle 11g r2 final
How oracle 12c flexes its muscles against oracle 11g r2 final
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be there
 

Viewers also liked

Ortho Inservice Review Lecture Ppt
Ortho Inservice Review Lecture PptOrtho Inservice Review Lecture Ppt
Ortho Inservice Review Lecture Ppt
GWU
 
Øredev 2011 - JVM JIT for Dummies (What the JVM Does With Your Bytecode When ...
Øredev 2011 - JVM JIT for Dummies (What the JVM Does With Your Bytecode When ...Øredev 2011 - JVM JIT for Dummies (What the JVM Does With Your Bytecode When ...
Øredev 2011 - JVM JIT for Dummies (What the JVM Does With Your Bytecode When ...
Charles Nutter
 
Galera cluster for high availability
Galera cluster for high availability Galera cluster for high availability
Galera cluster for high availability
Mydbops
 
ProxySQL - High Performance and HA Proxy for MySQL
ProxySQL - High Performance and HA Proxy for MySQLProxySQL - High Performance and HA Proxy for MySQL
ProxySQL - High Performance and HA Proxy for MySQL
René Cannaò
 
The Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScale
The Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScaleThe Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScale
The Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScale
Colin Charles
 
MySQL Cluster performance best practices
MySQL Cluster performance best practicesMySQL Cluster performance best practices
MySQL Cluster performance best practices
Mat Keep
 
Best practices for MySQL/MariaDB Server/Percona Server High Availability
Best practices for MySQL/MariaDB Server/Percona Server High AvailabilityBest practices for MySQL/MariaDB Server/Percona Server High Availability
Best practices for MySQL/MariaDB Server/Percona Server High Availability
Colin Charles
 
The MySQL Server Ecosystem in 2016
The MySQL Server Ecosystem in 2016The MySQL Server Ecosystem in 2016
The MySQL Server Ecosystem in 2016
Colin Charles
 
Best practices for MySQL High Availability
Best practices for MySQL High AvailabilityBest practices for MySQL High Availability
Best practices for MySQL High Availability
Colin Charles
 
Galera Cluster for MySQL vs MySQL (NDB) Cluster: A High Level Comparison
Galera Cluster for MySQL vs MySQL (NDB) Cluster: A High Level Comparison Galera Cluster for MySQL vs MySQL (NDB) Cluster: A High Level Comparison
Galera Cluster for MySQL vs MySQL (NDB) Cluster: A High Level Comparison
Severalnines
 
MySQL High Availability Solutions - Feb 2015 webinar
MySQL High Availability Solutions - Feb 2015 webinarMySQL High Availability Solutions - Feb 2015 webinar
MySQL High Availability Solutions - Feb 2015 webinar
Andrew Morgan
 

Viewers also liked (11)

Ortho Inservice Review Lecture Ppt
Ortho Inservice Review Lecture PptOrtho Inservice Review Lecture Ppt
Ortho Inservice Review Lecture Ppt
 
Øredev 2011 - JVM JIT for Dummies (What the JVM Does With Your Bytecode When ...
Øredev 2011 - JVM JIT for Dummies (What the JVM Does With Your Bytecode When ...Øredev 2011 - JVM JIT for Dummies (What the JVM Does With Your Bytecode When ...
Øredev 2011 - JVM JIT for Dummies (What the JVM Does With Your Bytecode When ...
 
Galera cluster for high availability
Galera cluster for high availability Galera cluster for high availability
Galera cluster for high availability
 
ProxySQL - High Performance and HA Proxy for MySQL
ProxySQL - High Performance and HA Proxy for MySQLProxySQL - High Performance and HA Proxy for MySQL
ProxySQL - High Performance and HA Proxy for MySQL
 
The Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScale
The Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScaleThe Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScale
The Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScale
 
MySQL Cluster performance best practices
MySQL Cluster performance best practicesMySQL Cluster performance best practices
MySQL Cluster performance best practices
 
Best practices for MySQL/MariaDB Server/Percona Server High Availability
Best practices for MySQL/MariaDB Server/Percona Server High AvailabilityBest practices for MySQL/MariaDB Server/Percona Server High Availability
Best practices for MySQL/MariaDB Server/Percona Server High Availability
 
The MySQL Server Ecosystem in 2016
The MySQL Server Ecosystem in 2016The MySQL Server Ecosystem in 2016
The MySQL Server Ecosystem in 2016
 
Best practices for MySQL High Availability
Best practices for MySQL High AvailabilityBest practices for MySQL High Availability
Best practices for MySQL High Availability
 
Galera Cluster for MySQL vs MySQL (NDB) Cluster: A High Level Comparison
Galera Cluster for MySQL vs MySQL (NDB) Cluster: A High Level Comparison Galera Cluster for MySQL vs MySQL (NDB) Cluster: A High Level Comparison
Galera Cluster for MySQL vs MySQL (NDB) Cluster: A High Level Comparison
 
MySQL High Availability Solutions - Feb 2015 webinar
MySQL High Availability Solutions - Feb 2015 webinarMySQL High Availability Solutions - Feb 2015 webinar
MySQL High Availability Solutions - Feb 2015 webinar
 

Similar to CAP: Scaling, HA

Retaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate LimitingRetaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate Limiting
ScyllaDB
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
Ding Li
 
Csc concepts
Csc conceptsCsc concepts
Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?
Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?
Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?
Clustrix
 
MySQL HA with PaceMaker
MySQL HA with  PaceMakerMySQL HA with  PaceMaker
MySQL HA with PaceMaker
Kris Buytaert
 
Locking and Race Conditions in Web Applications
Locking and Race Conditions in Web ApplicationsLocking and Race Conditions in Web Applications
Locking and Race Conditions in Web Applications
Andrew Kandels
 
HbaseHivePigbyRohitDubey
HbaseHivePigbyRohitDubeyHbaseHivePigbyRohitDubey
HbaseHivePigbyRohitDubey
Rohit Dubey
 
Taking Full Advantage of Galera Multi Master Cluster
Taking Full Advantage of Galera Multi Master ClusterTaking Full Advantage of Galera Multi Master Cluster
Taking Full Advantage of Galera Multi Master Cluster
Codership Oy - Creators of Galera Cluster
 
No sql (not only sql)
No sql                 (not only sql)No sql                 (not only sql)
No sql (not only sql)
Priyodarshini Dhar
 
All you didn't know about the CAP theorem
All you didn't know about the CAP theoremAll you didn't know about the CAP theorem
All you didn't know about the CAP theorem
Kanstantsin Hontarau
 
MySQL HA Alternatives 2010
MySQL  HA  Alternatives 2010MySQL  HA  Alternatives 2010
MySQL HA Alternatives 2010
Kris Buytaert
 
Talon systems - Distributed multi master replication strategy
Talon systems - Distributed multi master replication strategyTalon systems - Distributed multi master replication strategy
Talon systems - Distributed multi master replication strategy
Saptarshi Chatterjee
 
Hbase hive pig
Hbase hive pigHbase hive pig
Hbase hive pig
Xuhong Zhang
 
UNIT II (1).pptx
UNIT II (1).pptxUNIT II (1).pptx
UNIT II (1).pptx
gopi venkat
 
Intro to Databases
Intro to DatabasesIntro to Databases
Intro to Databases
Sargun Dhillon
 
MongoDB
MongoDBMongoDB
MongoDB
fsbrooke
 
No sql exploration keyvaluestore
No sql exploration   keyvaluestoreNo sql exploration   keyvaluestore
No sql exploration keyvaluestore
Balaji Srinivasaraghavan
 
Distribute Key Value Store
Distribute Key Value StoreDistribute Key Value Store
Distribute Key Value Store
Santal Li
 
Distribute key value_store
Distribute key value_storeDistribute key value_store
Distribute key value_store
drewz lin
 
SPL_ALL_EN.pptx
SPL_ALL_EN.pptxSPL_ALL_EN.pptx
SPL_ALL_EN.pptx
政宏 张
 

Similar to CAP: Scaling, HA (20)

Retaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate LimitingRetaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate Limiting
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
 
Csc concepts
Csc conceptsCsc concepts
Csc concepts
 
Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?
Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?
Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?
 
MySQL HA with PaceMaker
MySQL HA with  PaceMakerMySQL HA with  PaceMaker
MySQL HA with PaceMaker
 
Locking and Race Conditions in Web Applications
Locking and Race Conditions in Web ApplicationsLocking and Race Conditions in Web Applications
Locking and Race Conditions in Web Applications
 
HbaseHivePigbyRohitDubey
HbaseHivePigbyRohitDubeyHbaseHivePigbyRohitDubey
HbaseHivePigbyRohitDubey
 
Taking Full Advantage of Galera Multi Master Cluster
Taking Full Advantage of Galera Multi Master ClusterTaking Full Advantage of Galera Multi Master Cluster
Taking Full Advantage of Galera Multi Master Cluster
 
No sql (not only sql)
No sql                 (not only sql)No sql                 (not only sql)
No sql (not only sql)
 
All you didn't know about the CAP theorem
All you didn't know about the CAP theoremAll you didn't know about the CAP theorem
All you didn't know about the CAP theorem
 
MySQL HA Alternatives 2010
MySQL  HA  Alternatives 2010MySQL  HA  Alternatives 2010
MySQL HA Alternatives 2010
 
Talon systems - Distributed multi master replication strategy
Talon systems - Distributed multi master replication strategyTalon systems - Distributed multi master replication strategy
Talon systems - Distributed multi master replication strategy
 
Hbase hive pig
Hbase hive pigHbase hive pig
Hbase hive pig
 
UNIT II (1).pptx
UNIT II (1).pptxUNIT II (1).pptx
UNIT II (1).pptx
 
Intro to Databases
Intro to DatabasesIntro to Databases
Intro to Databases
 
MongoDB
MongoDBMongoDB
MongoDB
 
No sql exploration keyvaluestore
No sql exploration   keyvaluestoreNo sql exploration   keyvaluestore
No sql exploration keyvaluestore
 
Distribute Key Value Store
Distribute Key Value StoreDistribute Key Value Store
Distribute Key Value Store
 
Distribute key value_store
Distribute key value_storeDistribute key value_store
Distribute key value_store
 
SPL_ALL_EN.pptx
SPL_ALL_EN.pptxSPL_ALL_EN.pptx
SPL_ALL_EN.pptx
 

Recently uploaded

ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.
Maitrey Patel
 
Modelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - AmsterdamModelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - Amsterdam
Alberto Brandolini
 
Boost Your Savings with These Money Management Apps
Boost Your Savings with These Money Management AppsBoost Your Savings with These Money Management Apps
Boost Your Savings with These Money Management Apps
Jhone kinadey
 
DevOps Consulting Company | Hire DevOps Services
DevOps Consulting Company | Hire DevOps ServicesDevOps Consulting Company | Hire DevOps Services
DevOps Consulting Company | Hire DevOps Services
seospiralmantra
 
Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !
Marcin Chrost
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
kgyxske
 
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
Bert Jan Schrijver
 
🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻
🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻
🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻
campbellclarkson
 
Liberarsi dai framework con i Web Component.pptx
Liberarsi dai framework con i Web Component.pptxLiberarsi dai framework con i Web Component.pptx
Liberarsi dai framework con i Web Component.pptx
Massimo Artizzu
 
42 Ways to Generate Real Estate Leads - Sellxpert
42 Ways to Generate Real Estate Leads - Sellxpert42 Ways to Generate Real Estate Leads - Sellxpert
42 Ways to Generate Real Estate Leads - Sellxpert
vaishalijagtap12
 
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
kalichargn70th171
 
WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...
WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...
WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...
Luigi Fugaro
 
All you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVMAll you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVM
Alina Yurenko
 
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
kalichargn70th171
 
美洲杯赔率投注网【​网址​🎉3977·EE​🎉】
美洲杯赔率投注网【​网址​🎉3977·EE​🎉】美洲杯赔率投注网【​网址​🎉3977·EE​🎉】
美洲杯赔率投注网【​网址​🎉3977·EE​🎉】
widenerjobeyrl638
 
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptxOperational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
sandeepmenon62
 
Photoshop Tutorial for Beginners (2024 Edition)
Photoshop Tutorial for Beginners (2024 Edition)Photoshop Tutorial for Beginners (2024 Edition)
Photoshop Tutorial for Beginners (2024 Edition)
alowpalsadig
 
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
safelyiotech
 
WWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders AustinWWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders Austin
Patrick Weigel
 

Recently uploaded (20)

ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.
 
Modelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - AmsterdamModelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - Amsterdam
 
Boost Your Savings with These Money Management Apps
Boost Your Savings with These Money Management AppsBoost Your Savings with These Money Management Apps
Boost Your Savings with These Money Management Apps
 
DevOps Consulting Company | Hire DevOps Services
DevOps Consulting Company | Hire DevOps ServicesDevOps Consulting Company | Hire DevOps Services
DevOps Consulting Company | Hire DevOps Services
 
Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
 
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
 
🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻
🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻
🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻
 
Liberarsi dai framework con i Web Component.pptx
Liberarsi dai framework con i Web Component.pptxLiberarsi dai framework con i Web Component.pptx
Liberarsi dai framework con i Web Component.pptx
 
42 Ways to Generate Real Estate Leads - Sellxpert
42 Ways to Generate Real Estate Leads - Sellxpert42 Ways to Generate Real Estate Leads - Sellxpert
42 Ways to Generate Real Estate Leads - Sellxpert
 
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
 
WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...
WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...
WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...
 
All you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVMAll you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVM
 
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
 
美洲杯赔率投注网【​网址​🎉3977·EE​🎉】
美洲杯赔率投注网【​网址​🎉3977·EE​🎉】美洲杯赔率投注网【​网址​🎉3977·EE​🎉】
美洲杯赔率投注网【​网址​🎉3977·EE​🎉】
 
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptxOperational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
 
Photoshop Tutorial for Beginners (2024 Edition)
Photoshop Tutorial for Beginners (2024 Edition)Photoshop Tutorial for Beginners (2024 Edition)
Photoshop Tutorial for Beginners (2024 Edition)
 
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
 
WWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders AustinWWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders Austin
 

CAP: Scaling, HA

  • 1. SCALING & HIGH AVAILABILITY OF THE PLATFORM Max & Vitaly
  • 2. CAP theorem  Presented as a conjuncture at PODC 2000 (Brewer's conjecture)  Formalized and proved in 2002 by Nancy Lynch and Seth Gilbert (MIT)  Consistency, Availability and Partition- Tolerance cannot be achieved all at the same time in a distributed system  There is a tradeoff between these 3 properties 1. Consistency (all nodes see the same data at the same time) 2. Availability (every request receives a response about whether it succeeded or failed) 3. Partition tolerance (the system continues to operate despite arbitrary partitioning due to network failures)
  • 3. Definition In simple terms: in an asynchronous network that performs as expected, where messages may be lost (partition-tolerance), it is impossible to implement a service that provides consistent data and responds eventually to every request (availability) under every pattern of message loss
  • 4. Consistency: • Data is consistent and the same for all nodes. • All the nodes in the system see the same state of the data vi
  • 5. • Every request to non-failing node should be processed and receive response whether it failed or succeeded Availability:
  • 6. Partition tolerance: •If some nodes crash / communication fails, service still performs as expected
  • 7.
  • 8. In simple words: ● Consistency & Availability = some guaranties of data loss ● Consistency & Partitioning = scaling Why do we need to care about this?
  • 9. Stop theory! Real examples • RDBMS (mysql, postgres) • NoSQL (redis) • RabbitMQ • Eureka
  • 10. • Black-box systems testing. Bugs reproduced in Jepsen are observable in production, not theoretical. But tests are nondeterministic, and they cannot prove correctness, only find errors. • Testing under distributed systems failure modes: faulty networks, unsynchronized clocks, and partial failure. Test suites only evaluate the behavior of healthy clusters • Generative testing: systemc constructs random operations, apply them to the system, and constructs a concurrent history of their results. That history is checked against a model to establish its correctness. Generative (or property-based) tests often reveal edge cases with subtle combinations of inputs. Jepsen (http://jepsen.io/)
  • 12. Jepsen, just add brackets...
  • 13. RDBMS (again theory) • Standardized with SQL • Ubiquitous – widely used and understood • Supports transactions • High availability is achieved via Replication • Master – Master • Master – Slave • Synchronous/Asynchronous
  • 14. Why RDBMS is AC: ACID Atomicity of an operation(transaction) • "All or nothing“ – If part fails, the entire transaction fails. Consistency • Database will remain in a valid state after the transaction. • Means adhering to the database rules (key, uniqueness, etc.) Isolation • 2 Simultaneous transactions cannot interfere one with the other. (Executed as if executed sequentially) Durability • Once a transaction is commited, it remains so indefinitely, even after power loss or crash. (no caching) Definition – ACID
  • 15. ACID in Dist. Systems • Proved problematic in big dist systems • How to guarantee ACID properties ? • Atomicity requires more thought - e.g. two-phase commit (and 3-phase commit, PAXOS…) • Isolation requires to hold all of its locks for the entire transaction duration - High Lock Contention ! • Complex • Prone to failure - algorithm should handle • Failure = outage during write. • Comes with High overhead commits.
  • 16. Reminder: speak about atomicity/locks in java: Withdraw example
  • 17. Does it means that we can’t scale RDBMS out of the box?
  • 18. But we have PG cluster! But In PG cluster only one node can write. According Amazone research it brings 5% overhead for master node + network delay and replica delay for 2 PC commit. So it can balance only reading via pgpool PG cluster is not about balancing load (at least writing) Okay, at least we have ACID Right? Well… almost. Even though the Postgres server is always consistent, the distributed system composed of the server and client together may not be consistent. It’s possible for the client and server to disagree about whether or not a transaction took place.
  • 19. PG cluster Postgres' commit protocol, like most relational databases, is a special case of two-phase commit, or 2PC. In the first phase, the client votes to commit (or abort) the current transaction, and sends that message to the server. The server checks to see whether its consistency constraints allow the transaction to proceed, and if so, it votes to commit. It writes the transaction to storage and informs the client that the commit has taken place (or failed, as the case may be.) Now both the client and server agree on the outcome of the transaction. What happens if the message acknowledging the commit is dropped before the client receives it? Then the client doesn’t know whether the commit succeeded or not! The 2PC protocol says that we must wait for the acknowledgement message to arrive in order to decide the outcome. Waiting forever isn’t realistic for real systems, so at some point the client will time out and declare an error occurred. The commit protocol is now in an indeterminate state.
  • 20. PG cluster + Jepsen + Withdraw example https://aphyr.com/posts/282-jepsen-postgres
  • 21. But we have pg_shard for scaling load https://www.citusdata.com/citus-products/pg-shard/pg- shard-quick-start-guide Yes but Postgres with pg_shard is not ACID! Limitations: • Transactional semantics for queries that span across multiple shards - For example, you're a financial institution and you sharded your data based on customer_id. You'd now like to withdraw money from one customer's account and debit it to another one's account, in a single transaction block. • Unique constraints on columns other than the partition key, or foreign key constraints. • Distributed JOINs also aren't supported in pg_shard
  • 22. pg_shard Frequently Asked Questions How does pg_shard handle INSERT/UPDATE/DELETE commands? pg_shard requires that any modifications (INSERTs, UPDATEs, or DELETEs) involve exactly one shard. In the UPDATE and DELETE case, this means commands must include a WHERE qualification on the partition column that restricts the query to a single shard. Such qualifications usually take the form of an equality clause on the tables partition column. As for INSERT commands, the partition column of the row being inserted must be specified using an expression that can be reduced to a constant. For instance, a value such as 3, or even char_length('bob') would be suitable, though rand() would not. In additions, INSERT commands must specify exactly one row to be inserted. Note that the above restriction implies that commands similar to "INSERT INTO table SELECT col_one, col_two FROM other_table" are not currently supported. From an implementation standpoint, pg_shard determines the shard involved in a given INSERT, UPDATE, or DELETE command and then rewrites the SQL of that command to reference the shard table. The rewritten SQL is then sent to the placements for that shard to complete processing of the command. How exactly does pg_shard distribute my data? Rather than using hosts as the unit of distribution, pg_shard creates many small shards and places them across many hosts in a round-robin fashion. For example, a user might have eight hosts in their cluster but 256 shards with a replication factor of two. Shard one would be created on hosts A and B, shard two on B and C, and so forth. The advantage of this approach is that the additional load incurred after a host failure is spread among many other hosts instead of falling entirely on a single replica.
  • 23. But Mysql Galera has master-master cluster approach! Multi-master replication means that applications update the same tables on different masters, and the changes replicate automatically between those masters. Row-Based Replication to Avoid Data Drift Replication depends on deterministic updates--a transaction that changes 10 rows on the original master should change exactly the same rows when it executes against a replica. Unfortunately many SQL statements that are deterministic in master/slave replication are non-deterministic in multi-master topologies. Consider the following example, which gives a 10% raise to employees in department #35. UPDATE emp SET salary = salary * 1.1 WHERE dep_id = 35; If all masters add employees, then the number of employees who actually get the raise will vary depending on whether such additions have replicated to all masters. Your servers will very likely become inconsistent with statement replication. The fix is to enable row-based replication using binlog-format=row in my.cnf. Row replication transfers the exact row updates from each master to the others and eliminates ambiguity. But this reduce performance dramatically.
  • 24. Mysql Galera Prevent Key Collisions on INSERTs For applications that use auto-increment keys, MySQL offers a useful trick to ensure that such keys do not collide between masters using the auto-increment-increment and auto-increment-offset parameters in my.cnf. The following example ensures that auto-increment keys start at 1 and increment by 4 to give values like 1, 5, 9, etc. on this server. server-id=1 auto-increment-offset = 1 auto-increment-increment = 4 This works so long as your applications use auto-increment keys faithfully. However, any table that either does not have a primary key or where the key is not an auto-increment field is suspect. You need to hunt them down and ensure the application generates a proper key that does not collide across masters, for example using UUIDs or by putting the server ID into the key. Here is a query on the MySQL information schema to help locate tables that do not have an auto-increment primary key.
  • 25. Mysql Galera Semantic Conflicts in Applications MySQL replication can resolve conflicts. You need to avoid them in your applications. Here are a few tips as you go about this. First, avoid obvious conflicts. These include inserting data with the same keys on different masters (described above), updating rows in two places at once, or deleting rows that are updated elsewhere. Any of these can cause errors that will break replication or cause your masters to become out of sync. The good news is that many of these problems are not hard to detect and eliminate using properly formatted transactions. The bad news is that these are the easy conflicts. There are others that are much harder to address. For example, accounting systems need to generate unbroken sequences of numbers for invoices. A common approach is to use a table that holds the next invoice number and increment it in the same transaction that creates a new invoice. Another accounting example is reports that need to read the value of accounts consistently, for example at monthly close. Neither example works off-the-shelf in a multi-master system with asynchronous replication, as they both require some form of synchronization to ensure global consistency across masters. Or salary and balance task. These and other such cases may force substantial application changes. Some applications simply do not work with multi-master topologies for this reason.
  • 26. Mysql Galera Have a Plan for Sorting Out Mixed Up Data Master/slave replication has its discontents, but at least sorting out messed up replicas is simple: re-provision from another slave or the master. No so with multi-master topologies--you can easily get into a situation where all masters have transactions you need to preserve and the only way to sort things out is to track down differences and update masters directly. Here are some thoughts on how to do this. 1. Ensure you have tools to detect inconsistencies. Tungsten has built-in consistency checking with the 'trepctl check' command. You can also use the Percona Toolkit pt-table-checksum to find differences. Be forewarned that neither of these works especially well on large tables and may give false results if more than one master is active when you run them. 2. Consider relaxing foreign key constraints. I love foreign keys because they keep data in sync. However, they can also create problems for fixing messed up data, because the constraints may break replication or make it difficult to go table- by-table when synchronizing across masters. There is an argument for being a little more relaxed in multi-master settings. 3. Switch masters off if possible. Fixing problems is a lot easier if you can quiesce applications on all but one master. 4. Know how to fix data. Being handy with SQL is very helpful for fixing up problems. I find SELECT INTO OUTFILE and LOAD DATA INFILE quite handy for moving changes between masters. Don't forget SET SESSION LOG_FILE_BIN=0 to keep changes from being logged and breaking replication elsewhere. There are also various synchronization tools like pt-table- sync, but I do not know enough about them to make recommendations. 5. At this point it's probably worth mentioning commercial support. Unless you are a replication guru, it is very comforting to have somebody to call when you are dealing with messed up masters. Even better, expert advice early on can help you avoid problems in the first place.
  • 27. Mysql Galera + Jepsen + Withdraw https://aphyr.com/posts/327-jepsen-mariadb-galera-cluster Imagine a system of two bank accounts, each with a balance of $10. SET SESSION TRANSACTION ISOLATION LEVEL SERIALIZABLE set autocommit=0 select * from accounts where id = 0 select * from accounts where id = 1 UPDATE accounts SET balance = 8 WHERE id = 0 UPDATE accounts SET balance = 12 WHERE id = 1 COMMIT
  • 28. Mysql Galera + Jepsen + Withdraw Case 1: T1 commits before T2’s start time. Operations from T1 and T2 cannot interleave, by Lemma 1, because their intervals do not overlap. Case 2: T1 and T2 operate on disjoint sets of accounts. They serialize trivially. Case 3: T1 and T2 operate on intersecting sets of accounts, and T1 commits before T2 commits. Then T1 wrote data that T2 also wrote, and committed in T2’s interval, which violates First-committer-wins. T2 must abort. Case 4: T1 and T2 operate on intersecting sets of accounts, and T1 commits after T2 commits. Then T2 wrote data that T1 also wrote, and committed in T1’s interval, which violates First-committer-wins. T1 must abort.
  • 29. Mysql Galera + Jepsen + Withdraw Read-only transactions trivially serialize with one another. Do they serialize with respect to transfer transactions? The answer is yes: since every read-only transaction sees only committed data in a Snapshot Isolation system, and commits no data itself, it must appear to take place atomically at some time between other transactions. SET SESSION TRANSACTION ISOLATION LEVEL SERIALIZABLE set autocommit=0 select * from accounts COMMIT
  • 30. Mysql Galera + Jepsen + Withdraw
  • 31. Mysql Galera conclusion The transfer transactions should have kept the total amount of money at $20, but by the end of the test the totals all sum to $22. And in this run, 25% of the funds in the system mysteriously vanish. These results remain stable after all other transactions have ended–they are not a concurrency anomaly. Dirty reads! No first-committer-wins, no snapshot isolation. No snapshot isolation, well… I’m not sure exactly what Galera does guarantee. Master-Master works for append only DB http://scale-out-blog.blogspot.com/2012/04/if-you-must- deploy-multi-master.html http://www.onlamp.com/2016/04/20/advanced-mysql-
  • 32.
  • 33. We know that Instagram uses Postgres, pinterest uses mysql! True! https://engineering.pinterest.com/blog/sharding-pinterest- how-we-scaled-our-mysql-fleet >>In 2011, we hit traction. By some estimates, we were growing faster than any other previous startup. Around September 2011, every piece of our infrastructure was over capacity. We had several NoSQL technologies, all of which eventually broke catastrophically. We also had a boatload of MySQL slaves we were using for reads, which makes lots of irritating bugs, especially with caching.
  • 34. Pinterest How we sharded Whatever we were going build needed to meet our needs and be stable, performant and repairable. In other words, it needed to not suck, and so we chose a mature technology as our base to build on, MySQL. We intentionally ran away from auto-scaling newer technology like MongoDB, Cassandra and Membase, because their maturity was simply not far enough along (and they were crashing in spectacular ways on us!). Aside: I still recommend startups avoid the fancy new stuff — try really hard to just use MySQL. Trust me. I have the scars to prove it. MySQL is mature, stable and it just works. Not only do we use it, but it’s also used by plenty of other companies pushing even bigger scale. MySQL supports our need for ordering data requests, selecting certain ranges of data and row-level transactions. It has a hell of a lot more features, but we don’t need or use them. But, MySQL is a single box solution, hence the need to shard our data. Here’s our solution: We started with eight EC2 servers running one MySQL instance each:
  • 35. Pinterest How we sharded So how do we distribute our data to these shards? We created a 64 bit ID that contains the shard ID, the type of the containing data, and where this data is in the table (local ID). The shard ID is 16 bits, type ID is 10 bits and local ID is 36 bits. The savvy additionology experts out there will notice that only adds to 62 bits. My past in compiler and chip design has taught me that reserve bits are worth their weight in gold. So we have two (set to zero). ID = (shard ID << 46) | (type ID << 36) | (local ID<<0)
  • 36. RabbitMQ RabbitMQ is a distributed message queue, and is probably the most popular open-source implementation of the AMQP messaging protocol. It supports a wealth of durability, routing, and fanout strategies, and combines excellent documentation with well-designed protocol extensions.
  • 38. RabbitMQ cluster + CAP According table there is a choice between CP and CA, but in real life CP means loss data from http://www.rabbitmq.com/partitions.html RabbitMQ clusters do not tolerate network partitions well. If you are thinking of clustering across a WAN, don't. You should use federation or the shovel instead. However, sometimes accidents happen. RabbitMQ stores information about queues, exchanges, bindings etc in Erlang's distributed database, Mnesia.
  • 39. RabbitMQ cluster and partitions RabbitMQ also offers three ways to deal with network partitions automatically: pause-minority mode, pause- if-all-down mode and autoheal mode. (The default behaviour is referred to as ignore mode). In pause-minority mode RabbitMQ will automatically pause cluster nodes which determine themselves to be in a minority (i.e. fewer or equal than half the total number of nodes) after seeing other nodes go down. It therefore chooses partition tolerance over availability from the CAP theorem. This ensures that in the event of a network partition, at most the nodes in a single partition will continue to run. The minority nodes will pause as soon as a partition starts, and will start again when the partition ends. In pause-if-all-down mode, RabbitMQ will automatically pause cluster nodes which cannot reach any of the listed nodes. In other words, all the listed nodes must be down for RabbitMQ to pause a cluster node. This is close to the pause-minority mode, however, it allows an administrator to decide which nodes to prefer, instead of relying on the context. For instance, if the cluster is made of two nodes in rack A and two nodes in rack B, and the link between racks is lost, pause-minority mode will pause all nodes. In pause-if-all-down mode, if the administrator listed the two nodes in rack A, only nodes in rack B will pause. Note that it is possible the listed nodes get split across both sides of a partition: in this situation, no node will pause. That is why there is an additional ignore/autoheal argument to indicate how to recover from the partition. In autoheal mode RabbitMQ will automatically decide on a winning partition if a partition is deemed to have occurred, and will restart all nodes that are not in the winning partition. Unlike pause_minority mode it therefore takes effect when a partition ends, rather than when one starts. The winning partition is the one which has the most clients connected (or if this produces a draw, the one with the most nodes; and if that still produces a draw then one of the partitions is chosen in an unspecified way).
  • 40. How to scale? Federation Federation allows an exchange or queue on one broker to receive messages published to an exchange or queue on another (the brokers may be individual machines, or clusters). Communication is via AMQP (with optional SSL), so for two exchanges or queues to federate they must be granted appropriate users and permissions. Federated exchanges are connected with one way point-to-point links. By default, messages will only be forwarded over a federation link once, but this can be increased to allow for more complex routing topologies. Some messages may not be forwarded over the link; if a message would not be routed to a queue after reaching the federated exchange, it will not be forwarded in the first place. Federated queues are similarly connected with one way point-to-point links. Messages will be moved between federated queues an arbitrary number of times to follow the consumers. Typically you would use federation to link brokers across the internet for pub/sub messaging and work queueing. The Shovel Connecting brokers with the shovel is conceptually similar to connecting them with federation. However, the shovel works at a lower level. Whereas federation aims to provide opinionated distribution of exchanges and queues, the shovel simply consumes messages from a queue on one broker, and forwards them to an exchange on another. Typically you would use the shovel to link brokers across the internet when you need more control than federation provides.
  • 41. How to scale? Horizontally! We offer to use more simple way of scaling instead of Federation or shovel Just start N clusters (like mysql or postgres): Gateways RabbitMqRabbitMqRabbitMqGatewaysGateways GatewaysGatewaysBackends Gateways RabbitMqRabbitMqRabbitMqGatewaysGateways GatewaysGatewaysBackends Gateways RabbitMqRabbitMqRabbitMqGatewaysGateways GatewaysGatewaysBackends
  • 43. Redis 1. Redis fast! 2. Redis lost data! (CP)
  • 44. Redis fast? Exceptionally Fast : Redis is very fast and can perform about 110000 SETs per second, about 81000 GETs per second (one thread) 1. Operations are atomic : All the Redis operations are atomic, which ensures that if two clients concurrently access Redis server will get the updated value. discuss about CAS in java.
  • 45. Redis fast? Access by value O(1), by score O(log(N)). For numerical members, the value is the score. For string members, the score is a hash of the string.
  • 46. Redis scalable? Yes! due to simple format of data storage (key -> value), where every entry uses hash for searching, very simple to shard by hash range or value range by , no additional effort comparing to mongodb (speak about mongodb indexes) for example. approaches: 1. Proxy assisted partitioning means that our clients send requests to a proxy that is able to speak the Redis protocol, instead of sending requests directly to the right Redis instance. The proxy will make sure to forward our request to the right Redis instance accordingly to the configured partitioning schema, and will send the replies back to the client. The Redis and Memcached proxy Twemproxy implements proxy assisted partitioning. 2. Query routing means that you can send your query to a random instance, and the instance will make sure to forward your query to the right node. Redis Cluster implements an hybrid form of query routing, with the help of the client (the request is not directly forwarded from a Redis instance to another, but the client gets redirected to the right node).
  • 47. Redis scalable? Yes! due to simple format of data storage (key -> value), where every entry uses hash for searching, very simple to shard by hash range or value range by , no additional effort comparing to mongodb (speak about mongodb indexes) for example. approaches: 1. crc32: Proxy assisted partitioning means that our clients send requests to a proxy that is able to speak the Redis protocol, instead of sending requests directly to the right Redis instance. The proxy will make sure to forward our request to the right Redis instance accordingly to the configured partitioning schema, and will send the replies back to the client. The Redis and Memcached proxy Twemproxy implements proxy assisted partitioning. 2. Redis Cluster: Query routing means that you can send your query to a random instance, and the instance will make sure to forward your query to the right node. Redis Cluster implements an hybrid form of query routing, with the help of the client (the request is not directly forwarded from a Redis instance to another, but the client gets redirected to the right node). Discuss how to configure this! & Presharding http://redis.io/topics/cluster-tutorial http://redis.io/topics/partitioning http://docs.spring.io/spring-data/redis/docs/current/reference/html/#redis:sentinel
  • 48. What about HA Redis offers asynchronous primary->secondary replication. A single server is chosen as the primary, which can accept writes. It relays its state changes to secondary servers, which follow along. Asynchronous means that you don’t have to wait for a write to be replicated before the primary returns a response to the client. 1. Sentinel Sentinel tries to establish a quorum between Sentinel nodes, agree on which Redis servers are alive, and promote any which appear to have failed. If we colocate the Sentinel nodes with the Redis nodes, this should allow us to promote a new primary in the majority component (should one exist). 2. Redis cluster (discuss about slots)! http://redis.io/topics/replication http://redis.io/topics/sentinel http://redis.io/topics/cluster-tutorial http://redis.io/topics/sentinel-clients
  • 50. Eureka (pure AP algorithm) Once the server starts receiving traffic, all of the operations that is performed on the server is replicated to all of the peer nodes that the server knows about. If an operation fails for some reason, the information is reconciled on the next heartbeat that also gets replicated between servers. When the Eureka server comes up, it tries to get all of the instance registry information from a neighboring node. If there is a problem getting the information from a node, the server tries all of the peers before it gives up. If the server is able to successfully get all of the instances, it sets the renewal threshold that it should be receiving based on that information. If any time, the renewals falls below the percent configured for that value (below 85% within 15 mins), the server stops expiring instances to protect the current instance registry information. It is called as self-preservation mode and is primarily used as a protection in scenarios where there is a network partition between a group of clients and the Eureka Server. In these scenarios, the server tries to protect the information it already has. There may be scenarios in case of a mass outage that this may cause the clients to get the instances that do not exist anymore. The clients must make sure they are resilient to eureka server returning an instance that is non- existent or un-responsive. The best protection in these scenarios is to timeout quickly and try other servers. What we do in balancer, gateway (file service, rabbitmq), backends (rabbitmq) In the case, where the server is not able get the registry information from the neighboring node, it waits for a few minutes (5 mins) so that the clients can register their information.
  • 51. Eureka (AP) What happens during network outages between Peers? In the case of network outages between peers, following things may happen 1. The heartbeat replications between peers may fail and the server detects this situation and enters into a self-preservation mode protecting the current state. 2. The situation autocorrects itself after the network connectivity is restored to a stable state. When the peers are able to communicate fine, the registration information is automatically transferred to the servers that do not have them. The bottom line is, during the network outages, the server tries to be as resilient as possible, but there is a possibility of clients having different views of the servers during that time
  • 52. Zookeeper based on PAXOS algorithm and provides CA That is mean that it uses transactions for sharing state and can’t provide partition tolerance While eureka sends entire state all the time Transactions? Eureka vs Zookeeper CAP
  • 53. 1. Eureka integrates better with other NetflixOSS components (Ribbon especially). 2. ZooKeeper is hard. We've gotten pretty good at it, but it requires care and feeding. https://tech.knewton.com/blog/2014/12/eureka-shouldnt-use- zookeeper-service-discovery/ Eureka vs Zookeeper
  • 55. Push service 1. Stateless 2. Locks 3. Performance
  • 57. Each Component Scaling Capability Type CAP Best for Platform module Independent; stateless HA & Performance Redis CP Performance Weave DNS AP HA w/o consistency Docker Swarm CA HA RabbitMQ Queues replicated across nodes HA & slight Performance Eureka AP HA w/o consistency Conf service Stateless HA
  • 58. Reminder 1. L1 cache reference 0.3 ns 2. Branch mispredict 3 ns 3. L2 cache reference 7 ns 4. Mutex lock/unlock 80 ns 5. Main memory reference 100 ns 6. Compress 1K bytes with Zippy 10,000 ns 7. Send 2K bytes over 1 Gbps network 20,000 ns 8. Read 1 MB sequentially from memory 250,000 ns 9. Round trip within same datacenter 500,000 ns 10.Disk seek 10,000,000 ns 11.Read 1 MB sequentially from network 5,000,000 ns 12.Read 1 MB sequentially from disk 30,000,000 ns 13.Send packet CA->Netherlands->CA 150,000,000 ns
  • 59. Reminder 2 Ensure your design works if scale changes by 10X or 20X but the right solution for X often not optimal for 100X
  • 60. Eventual Consistency Eventual Consistency - BASE Along with the CAP conjuncture, Brewer suggested a new consistency model - BASE (Basically Available, Soft state, Eventual consistency) • BASE model gives up on Consistency from the CAP Theorem. • This model is optimistic and accepts eventual consistency, in contrast to ACID. o Given enough time, all nodes will be consistent and every request will result with same responses. • Brewer points out that ACID and BASE are two extremes and one can have a range of options in choosing the balance between consistency and availability. (consistency models). Basically Available - the system does guarantee availability, in terms of the CAP theorem. It is always available, but subsets of data may become unavailable for short periods of time. • Soft state - State of system may change over time, even without input. Data does not have to be consistent. • Eventual Consistency - System will become consistent eventually in the future. ACID, on the contrary, enforces consistency immediately after any operation.