2. Bootstrapping
What is Bootstrapping?
Adding new nodes is called “Bootstrapping”
Ways of Adding new node
There are two ways of adding node :
– New node gets assigned a random token which gives its position in the ring. It gossips its location to rest of
the ring where the information is exchanged about one another.
– New node reads its config file to contact it’s initial contact points.
• New nodes are added manually by administrator via CLI or Web interface provided by Cassandra.
http://s3.amazonaws.com/ppt-download/cassandraekaterinberg2013-131212053553-phpapp01.pdf?response-content-
disposition=attachment&Signature=7pB%2BhMgGqV1vxcRUaqCbCt2%2BH6o%3D&Expires=1458678552&AWSAccessKeyId=AKIAJ6D6SEMXSASXHDAQ
3. Bootstrapping Contd..
• These initial contact points are known as Seeds, which is basically used by newly added node to know each other,
where ultimate goal for all nodes in the cluster is to discover one another.
• Seeds can also come from configuration service like Zookeeper, which is a centralized service for maintaining
configuration information, naming, providing distributed synchronization, and providing group services.
“Because Coordinating Distributed Systems is a Zoo”
Google images
4. Facts!!!
• Comparison with Amazon’s Dynamo which is a
highly available key-value structured storage system.
“Dynamo’s load is no where close to what we see in
practice over here at Facebook.” –Avinash Lakshman
nosqlmatters2012-130102154135-phpapp01.pdf
5. Configuration
In addition to seeds, you'll also need to configure the IP interface to listen on for Gossip and CQL,
(listen_address and rpc_address respectively).Use
listen_address that will be reachable from the listen_address used on all other nodes, and a rpc_address that will be
accessible to clients.
Once everything is configured and the nodes are running, use the bin/nodetool status utility to verify a properly
connected cluster. For example:
https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_add_node_to_cluster_t.html
6. Environment
• Node outages occurred are often transient but may last for extended intervals.
• A network outage rarely signifies a permanent departure and should not result in re-balancing of the
partition assignment or repair of the unreachable replicas.
• Manual errors could result in unintentional startup of new Cassandra nodes. As a result an explicit
mechanism is considered appropriate to initiate the addition and removal of nodes from a Cassandra
Instance.
• Administrator – uses a command line tool or a browser to connect to a Cassandra node and issue a
membership change to join or leave the cluster.
7. Scaling the cluster
• Whenever a new node is added into the system, it gets assigned a token such that it can alleviate heavily loaded node.
• New node will take the range which other node were responsible for before.
• Cassandra bootstrapping algorithm is initiated from any other node in the system either using a command line utility or web
dashboard.
• The node giving up the data streams the data over to the new nodes using kernel kernel copy techniques.
Cassandra Ring showing scalability.
Scaling the Cluster
https://www.google.com/search?q=scalability+in+cassandra&biw=1366&bih=667&source=lnms&tbm=isch&sa=X&ved=0ahUKEwjtztnc4dfLAhWEzoMKHQ1zD4MQ_AUIBygC#imgrc=qusi2veDeVAH4M%3A
8. Future
What is the Future?
• Operational experience has shown that data can be transferred at the rate of 40MB/sec from a single node. Work
is going on to have multiple replicas take part in the bootstrap transfer by parallelizing the effort, similar to bit
torrent which is a p2p system used to transfer large files to thousands of location in a short period of time
• Facebook uses bit torrent to distribute updates to Facebook servers.
“Bit Torrent is fantastic for this, it’s really great,” Cook said. “It’s ‘super-duper’fast and it allows us to alleviate a
lot of scaling concerns we’ve had in the past, where it took forever to get code to the webservers before you could
even boot it up and run it.”
9. Virtual nodes in Cassandra
• One of the new features slated for Cassandra 1.2’s release was virtual nodes (vnodes) where there was paradigm
change from one token or range per node, to many per node. Within a cluster these can be randomly selected and
be non-contiguous, giving us many smaller ranges that belong to each node.
Advantage?
Use of Heterogeneous machines in a cluster.
Node Failures and backing up.
http://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2