Tom White_ Consistent Hashing

1/13/2017 Tom White: Consistent Hashing
http://www.tomewhite.com/2007/11/consistenthashing.html 1/9
Tom White Bio
Speaking
Writing
About
Problems worthy of attack prove their worth by hitting back. —Piet Hein
T u e s d a y , 2 7 N o v e m b e r 2 0 0 7
Consistent Hashing
I've bumped into consistent hashing a couple of times lately. The paper that
introduced the idea (Consistent Hashing and Random Trees: Distributed
Caching Protocols for Relieving Hot Spots on the World Wide Web by David
Karger et al) appeared ten years ago, although recently it seems the idea has
quietly been finding its way into more and more services, from Amazon's
Dynamo to memcached (courtesy of Last.fm). So what is consistent hashing and
why should you care?
The need for consistent hashing arose from limitations experienced while
running collections of caching machines web caches, for example. If you have a
collection of n cache machines then a common way of load balancing across
them is to put object o in cache machine number hash(o) mod n. This works well
until you add or remove cache machines (for whatever reason), for then n
changes and every object is hashed to a new location. This can be catastrophic
since the originating content servers are swamped with requests from the cache
machines. It's as if the cache suddenly disappeared. Which it has, in a sense.
(This is why you should care consistent hashing is needed to avoid swamping
your servers!)
It would be nice if, when a cache machine was added, it took its fair share of
objects from all the other cache machines. Equally, when a cache machine was
removed, it would be nice if its objects were shared between the remaining
machines. This is exactly what consistent hashing does consistently maps
objects to the same cache machine, as far as is possible, at least.
The basic idea behind the consistent hashing algorithm is to hash both objects
and caches using the same hash function. The reason to do this is to map the
cache to an interval, which will contain a number of object hashes. If the cache is
removed then its interval is taken over by a cache with an adjacent interval. All
the other caches remain unchanged.
Let's look at this in more detail. The hash function actually maps objects and
caches to a number range. This should be familiar to every Java programmer
the hashCode method on Object returns an int, which lies in the range 231 to
2311. Imagine mapping this range into a circle so the values wrap around. Here's
a picture of the circle with a number of objects (1, 2, 3, 4) and caches (A, B, C)
marked at the points that they hash to (based on a diagram from Web Caching
with Consistent Hashing by David Karger et al):
Demonstration
►  2016 (1)
►  2015 (5)
►  2013 (3)
►  2012 (5)
►  2011 (3)
►  2010 (1)
►  2009 (2)
►  2008 (24)
▼  2007 (19)
▼  November (3)
Eels in Many Worlds
Consistent Hashing
Back on the net
►  October (4)
►  September (5)
►  August (2)
►  July (5)
Blog Archive

To find which cache an object goes in, we move clockwise round the circle until
we find a cache point. So in the diagram above, we see object 1 and 4 belong in
cache A, object 2 belongs in cache B and object 3 belongs in cache C. Consider
what happens if cache C is removed: object 3 now belongs in cache A, and all the
other object mappings are unchanged. If then another cache D is added in the
position marked it will take objects 3 and 4, leaving only object 1 belonging to A.
This works well, except the size of the intervals assigned to each cache is pretty
hit and miss. Since it is essentially random it is possible to have a very non
uniform distribution of objects between caches. The solution to this problem is
to introduce the idea of "virtual nodes", which are replicas of cache points in the
circle. So whenever we add a cache we create a number of points in the circle for
it.
You can see the effect of this in the following plot which I produced by
simulating storing 10,000 objects in 10 caches using the code described below.
On the xaxis is the number of replicas of cache points (with a logarithmic scale).
When it is small, we see that the distribution of objects across caches is
unbalanced, since the standard deviation as a percentage of the mean number of
objects per cache (on the yaxis, also logarithmic) is high. As the number of
replicas increases the distribution of objects becomes more balanced. This
experiment shows that a figure of one or two hundred replicas achieves an
acceptable balance (a standard deviation that is roughly between 5% and 10% of
the mean).

For completeness here is a simple implementation in Java. In order for
consistent hashing to be effective it is important to have a hash function that
mixes well. Most implementations of Object's hashCode do not mix well for
example, they typically produce a restricted number of small integer values so
we have a HashFunction interface to allow a custom hash function to be used.
MD5 hashes are recommended here.
import java.util.Collection;
import java.util.SortedMap;
import java.util.TreeMap;
public class ConsistentHash<T> {
  private final HashFunction hashFunction;
  private final int numberOfReplicas;
  private final SortedMap<Integer, T> circle =
    new TreeMap<Integer, T>();
  public ConsistentHash(HashFunction hashFunction,
    int numberOfReplicas, Collection<T> nodes) {
    this.hashFunction = hashFunction;
    this.numberOfReplicas = numberOfReplicas;
    for (T node : nodes) {
      add(node);
    }
  }
  public void add(T node) {
    for (int i = 0; i < numberOfReplicas; i++) {
      circle.put(hashFunction.hash(node.toString() + i),
        node);
Implementation

Posted by Tom White at 17:26
Labels: Distributed Systems, Hashing
    }
  }
  public void remove(T node) {
    for (int i = 0; i < numberOfReplicas; i++) {
      circle.remove(hashFunction.hash(node.toString() + i));
    }
  }
  public T get(Object key) {
    if (circle.isEmpty()) {
      return null;
    }
    int hash = hashFunction.hash(key);
    if (!circle.containsKey(hash)) {
      SortedMap<Integer, T> tailMap =
        circle.tailMap(hash);
      hash = tailMap.isEmpty() ?
             circle.firstKey() : tailMap.firstKey();
    }
    return circle.get(hash);
  }
}
The circle is represented as a sorted map of integers, which represent the hash
values, to caches (of type T here).
When a ConsistentHash object is created each node is added to the circle map a
number of times (controlled by numberOfReplicas). The location of each
replica is chosen by hashing the node's name along with a numerical suffix, and
the node is stored at each of these points in the map.
To find a node for an object (the get method), the hash value of the object is
used to look in the map. Most of the time there will not be a node stored at this
hash value (since the hash value space is typically much larger than the number
of nodes, even with replicas), so the next node is found by looking for the first
key in the tail map. If the tail map is empty then we wrap around the circle by
getting the first key in the circle.
So how can you use consistent hashing? You are most likely to meet it in a
library, rather than having to code it yourself. For example, as mentioned above,
memcached, a distributed memory object caching system, now has clients that
support consistent hashing. Last.fm's ketama by Richard Jones was the first, and
there is now a Java implementation by Dustin Sallings (which inspired my
simplified demonstration implementation above). It is interesting to note that it
is only the client that needs to implement the consistent hashing algorithm the
memcached server is unchanged. Other systems that employ consistent hashing
include Chord, which is a distributed hash table implementation, and Amazon's
Dynamo, which is a keyvalue store (not available outside Amazon).
Usage

14 comments:
morrita said...
good article!
i've made Japanese translation for your article. which is available at
http://www.hyuki.com/yukiwiki/wiki.cgi?ConsistentHashing .
if you have any trouble, please let me know.
thank you for your work.
1 December 2007 at 06:06
Tom White said...
morrita Glad you enjoyed the post and thanks for the translation! Tom
Marcus said...
Cool! I'm as we speak creating a distributed caching and searching system
which uses JGroups for membership. The biggest problem I faced was this
exact thing. What to do on the memberjoined/leaved events and for the
system to be able to know at all times to which node to send what command :)
The caching system is strictly following the Map (and SortedMap) interface
and a bunch of implementations have been implemented. LFU, LRU, MRU,
Diskbased B+Tree (jdbm), ehcache wrapper, memcached java client wrapper,
hibernate support...
I like the Map interface since it is quite clean..
The impl I'm working on now is a cache/persister which uses HDFS as
persistance layer. See how that turns out. The line between a cache and a
persistence engine is fine.
And of course all caches must be searchable = My own indexer/searcher +
Lucene free text index/search, ohh and all must be able to work in a
distributed environment.. fuck it is a big task.
marcusherou said...
Hi. Do you have any clue of how to create an algorithm which tracks the
history of joins/leves of members and delivers the same node for the same key
if it previously looked it up. Perhaps I'm explaining this in bad terms but
something like a (in memory or persistent) database in cojunction with a
consistent hash.
perhaps:
public Address getAddress(key)
{
if(lookedUpMap.containsKey(key))
{
return (Address)lookedUpMap.get(key)
}
else
{
Address a = get(key);
lookedUpMap.put(key, a);
return a;

}
}
Then it is up to the client to check if the previously stored node is reachable at
this very moment.
An extension to this cache would be to return a Set which size is equal to the nr
of replicas of each key/value.
If a value is stored on two or more nodes then the possibilty for that at least
one of the servers is up increases a lot. I'm building a distributed storage
solution and the data cannnot be lost just because the cache "recalculates" the
hash :)
Ahhh enough already I will implement something like this :)
3 January 2008 at 20:37
naorei said...
This site has talks lots about the cache with its advantages, thanks for the
topic, it will lots of gain for to the visitors for this site.
Naorei
==============================
http://community.widecircles.com
29 July 2008 at 04:53
John Cormie said...
Nice explanation of consistent hashing, but there is a subtle bug in the
implementation. The problem is hash collisions. Suppose two different nodes x
and y have replicas that hash to the same Integer key in "circle". The sequence
of calls "add(x); add(y);" will clobber one of x's replicas, and a later call to
remove(y) will clear y's replica from "circle" without restoring x's.
One consequence of this bug is that as nodes come and go you may slowly lose
replicas. A more serious consequence is that if two clients notice changes to the
set of nodes in a different order (very possible in a distributed system), clients
will no longer agree on the key to node mapping. For example, suppose nodes
x and y have replicas that collide as above, and node x fails around the same
time that new (replacement?) node y comes online. In response, suppose that
ConsistentHash client A invokes add(y) ... remove(x) and that another client B
does the same but in the reverse order. From now on, the set of keys
corresponding to the clobbered replica will map to node y at B, while A will
map them somewhere else.
How likely are collisions? If we model hashing as a random function, expect a
collision after only about 2^(32/2) = 65536 insertions with Integer sized keys
(http://en.wikipedia.org/wiki/Birthday_paradox#Cast_as_a_collision_proble
m). If numberOfReplicas is set to the suggested 200 replicas per node, we
expect a collision in a system with only 328 nodes.
31 January 2009 at 21:51
Christophe said...
I have written a small test app that tells me how many nodes should be
relocated in the event of a node addition, by comparing the same dataset over

2 hashsets.
It consistently reports that nearly all nodes should be moved ... am I doing
something wrong ?
http://pastebin.com/f459047ef
31 May 2009 at 01:40
chingju said...
great article, and the japanese translation from morrita is cool!!!
19 September 2009 at 03:19
Danny said...
In response to John Cormie about the collisions:
All you need is a list of node as the value paired for the hash, in the circle,
and a comparator for the nodes that cannot return 0.
When 2 nodes collide to a same hash, put them both in the list, in ascending
order. The "smaller" node is first and has precedence. When you remove a
node, you make sure to remove the right one from the list and leave the other
alone.
This is a good time to drop the generics and us an object in the circle values
(node vs singlelinkedlist of nodes), to save some memory.
26 February 2010 at 02:51
Anonymous said...
Thanks so much for explaining the concept so clearly for us people not blessed
with enough smarts to read a technical paper and understand lemmas,
theorems, math to grasp a concept!
May I request you to explain again in simple terms what vector clocks are?
Thanks again
2 January 2011 at 18:48
Monish said...
Wonderful though i missed the concept of ring in the implementation
3 January 2011 at 09:50
Sergey Gussak said...
Hello. Thanks a lot to Tom for the article and to John Cormie for finding the
bug.
There is a still another issue in the implementation that may result in not good
distribution of keys between nodes. Particularly this code:
for (int i = 0; i < numberOfReplicas; i++) {
circle.put(hashFunction.hash(node.toString() + i),
node);

}
In here, we take nodes's string representation, append integer value and use
the resultant "virtual node" as a parameter for hash function. The problem is
that for nodes of numeric types (Long, Integer e.t.c) or any other types whose
toString() method implementation returns string that ends with a number
there is a "increased" probability for different nodes to be mapped to same
postion on the circle.
Let me explain by example.
Assume we have 20 nodes which are represented by strings as "node1",
"node2", "node3", ..., "node19" and "node20". And we use 100 replicas for each
node. So following "virtual nodes" are generated by code above for "node1":
"node1" + 0 = "node10"
...
"node1" + 10 = "node110"
"node1" + 11 = "node111"
"node1" + 12 = "node112"
...
"node1" + 99 = "node199"
At the same time for "node11" we have:
"node11" + 0 = "node110"
"node11" + 1 = "node111"
"node11" + 2 = "node112"
...
As you can see from text in bold, different combination of "node" and postfix
resulted in same "virtual nodes" i.e collisions on circle.
In our example 90 "virtual nodes" of "node1" will collide with "virtual nodes"
of other nodes, with 10 nodes of each node from "node11" to "node19".
To solve this problem we can use some separator while generating "virtual
node":
hashFunction.hash(node.toString() + ":" + i
With this we will have:
"node1" + : + 10 = "node1:10"
"node11" + : + 1 = "node11:0"
The nodes do not collide anymore.
Also do not forget to use separator while removing node.
8 July 2011 at 08:05
Björn said...
Very nice article.
I understood the concept how other servers become "responsible" for entries in
the hash space when one server goes down. What I did not understand is what
happens to the data itself.
Let's say server X is responsible for an entry V which is for example just an
integer value. When server X goes down, now server Y is responsible for entry

Newer Post Older PostHome
Subscribe to: Post Comments (Atom)
Post a Comment
Create a Link
V. But server Y does not know about the actual value of V, right?
Does the concept of consistent hashing only delegate responsibility to servers
that share a common database?
Or does server X somehow transfer all its values to server Y while crashing?
Thanks!
Björn
11 June 2013 at 12:36
Anuj Tripathi said...
@Bjorn: From what I understand consistent hashing is required to do load
balancing across servers, 'sharing resource'. So to answer your question, I
guess it works only for shared resources. Even if there is an independent
monitor involved to cause a failover to the new server, I don't think the
transition will be seamless, defeating the whole purpose of high availability.
22 September 2013 at 22:58
Links to this post
Apache Hadoop and Hadoop are trademarks of the Apache Software Foundation.
Simple template. Powered by Blogger.

Tom White_ Consistent Hashing

Recommended

Recommended

More Related Content

Featured

Featured (20)

Tom White_ Consistent Hashing