Memory is the new disk, disk is the new tape, Bela Ban (JBoss by RedHat)
1. Memory is the new disk,
disk is the new tape
Bela Ban, JBoss / Red Hat
2. Motivation
● We want to store our data in memory
– Memory access is faster than disk access
– Even across a network
– A DB requires network communication, too
● The disk is used for archival purposes
● Not a replacement for DBs !
– Only a key-value store
– NoSQL
3. Problems
● #1: How do we provide memory large
enough to store the data (e.g. 2 TB of
memory) ?
● #2: How do we guarantee persistence ?
– Survival of data between reboots / crashes
4. #1: Large memory
● We aggregate the memory of all nodes in a
cluster into a large virtual memory space
– 100 nodes of 10 GB == 1 TB of virtual
memory
5. #2: Persistence
● We store keys redundantly on multiple
nodes
– Unless all nodes on which key K is stored
crash at the same time, K is persistent
● We can also store the data on disk
– To prevent data loss in case all cluster
nodes crash
– This can be done asynchronously, on a
background thread
7. Store every key on every node
A B C D
K1 K1 K1 K1
K2 K2 K2 K2
K3 K3 K3 K3
K4 K4 K4 K4
● RAID 1
● Pro: data is available everywhere
– No network round trip
– Data loss only when all nodes crash
● Con: we can only use 25% of our memory
8. Store every key on 1 node only
A B C D
K1 K2 K3 K4
● RAID 0, JBOD
● Pro: we can use 100% of our memory
● Con: data loss on node crash
– No redundancy
9. Store every key on K nodes
A B C D
K1 K1
K2 K2
K3 K3
K4 K4
● K is configurable (2 in the example)
● Variable RAID
● Pro: we can use a variable % of our memory
– User determines tradeoff between memory
consumption and risk of data loss
10. So how do we determine on which nodes the
keys are stored ?
11. Consistent hashing
● Given a key K and a set of nodes, CH(K)
will always pick the same node P for K
– We can also pick a list {P,Q} for K
● Anyone 'knows' that K is on P
● If P leaves, CH(K) will pick another node Q
and rebalance affected keys
● A good CH will rebalance 1/N keys at most
(where N = number of cluster nodes)
12. Example
A B C D
K1 K1
K2 K2
K3 K3
K4 K4
● K2 is stored on B (primary owner) and C
(backup owner)
13. Example
A B C D
K1 K1
K2 K2
K3 K3
K4 K4
● Node B now crashes
14. Example
A B C D
K1 K1 K1
K2 K2 K2
K3 K3
K4 K4
● C (the backup owner of K2) copies K2 to D
– C is now the primary owner of K2
● A copies K1 to C
– C is now the backup owner of K1
15. Rebalancing
● Unless all N owners of a key K crash
exactly at the same time, K is always
stored redundantly
● When less than N owners crash,
rebalancing will copy/move keys to other
nodes, so that we have N owners again
16. Enter ReplCache
● ReplCache is a distributed hashmap
spanning the entire cluster
● Operations: put(K,V), get(K), remove(K)
● For every key, we can define how many
times we'd like it to be stored in the cluster
– 1: RAID 0
– -1: RAID 1
– N: variable RAID
17. Use of ReplCache
JBoss ReplCache
Servlet
Apache JBoss ReplCache
Cluster
HTTP Servlet
mod_jk
JBoss ReplCache
Servlet
DB
19. Use cases
● JBoss AS: session distribution using
Infinispan
– For data scalability, sessions are stored
only N times in a cluster
● GridFS (Infinispan)
– I/O over grid
– Files are chunked into slices, each slice is
stored in the grid (redundantly if needed)
– Store a 4GB DVD in a grid where each
node has only 2GB of heap
20. Use cases
● Hibernate Over Grid (OGM)
– Replaces DB backend with Infinispan
backed grid
21. Conclusion
● Given enough nodes in a cluster, we can
provide persistence for data
● Unlike RAID, where everything is stored
fully redundantly (even /tmp), we can
define persistence guarantees per key
● Ideal for data sets which need to be
accessed quickly
– For the paranoid we can still stream to disk
22. Conclusion
● Data is distributed over a grid
– Cache is closer to clients
– No bottleneck to the DBMS
– Keys are on different nodes