Apache Cassandra - Diagnostics and monitoring

Apache Cassandra:
diagnostics and monitoring
Alex Thompson
Solution Architect - APAC
DataStax Australia

Intro
This presentation is intended as a field guide for users of Apache Cassandra.
This guide specifically covers an explanation diagnostics tools and monitoring
tools and methods used in conjunction with C*, it is written in a pragmatic order
with the most important tools first.

>nodetool tpstats
Probably the most important “at a
glance” summary of the health of a
node and the ﬁrst diagnostics
command to run.
>nodetool tpstats is better described
as “nodetool thread statistics”; it gives
us a real-time measure of each thread
in C* and its current workload.
Note: if you restart a C* instance these statistics
are cleared to zero, so you have to run it on a node
that has been up for a while to be able to diagnose
workload.
Pool Name Active Pending Completed Blocked All time blocked
MutationStage 0 0 25159974 0 0
ViewMutationStage 0 0 0 0 0
ReadStage 0 0 3231222 0 0
RequestResponseStage 0 0 36609517 0 0
ReadRepairStage 0 0 410293 0 0
CounterMutationStage 0 0 0 0 0
MiscStage 0 0 0 0 0
CompactionExecutor 8 108 287003 0 0
MemtableReclaimMemory 0 0 444 0 0
PendingRangeCalculator 0 0 27 0 0
GossipStage 0 0 464348 0 0
SecondaryIndexManagement 0 0 13 0 0
HintsDispatcher 0 0 396 0 0
MigrationStage 0 0 25 0 0
MemtablePostFlush 0 0 1114 0 0
ValidationExecutor 0 0 321 0 0
Sampler 0 0 0 0 0
MemtableFlushWriter 0 0 444 0 0
InternalResponseStage 0 0 68544 0 0
AntiEntropyStage 0 0 1209 0 0
CacheCleanupExecutor 0 0 0 0 0
Native-Transport-Requests 0 0 35849149 0 536
Message type Dropped
READ 4
RANGE_SLICE 0
_TRACE 5095
HINT 0
MUTATION 180
COUNTER_MUTATION 0
BATCH_STORE 0
BATCH_REMOVE 0
REQUEST_RESPONSE 23
PAGED_RANGE 0
READ_REPAIR 0

>nodetool tpstats
First thing to check is Pending work on
threads, this node is showing
compactions getting behind, this may
be OK but is usually an indication with
other diagnostics of an overloaded
node.
ReadStage 0 0 3231222 0 0
MiscStage 0 0 0 0 0
Sampler 0 0 0 0 0
READ 4
RANGE_SLICE 0
_TRACE 5095
HINT 0
MUTATION 180
COUNTER_MUTATION 0
BATCH_STORE 0
BATCH_REMOVE 0
REQUEST_RESPONSE 23
PAGED_RANGE 0
READ_REPAIR 0

>nodetool tpstats
Next up is to check All time blocked: in this
case Native-Transport-Requests which are
calls to the binary CQL port (reads or
writes) that have not been completed
due to overload. Also note the high
Completed This node is servicing a lot of
requests.
In combination with Pending mentioned
in the prior slide this is starting to look
like an overloaded node, but let’s dig
deeper...
ReadStage 0 0 3231222 0 0
MiscStage 0 0 0 0 0
Sampler 0 0 0 0 0
READ 4
RANGE_SLICE 0
_TRACE 5095
HINT 0
MUTATION 180
COUNTER_MUTATION 0
BATCH_STORE 0
BATCH_REMOVE 0
REQUEST_RESPONSE 23
PAGED_RANGE 0
READ_REPAIR 0

>nodetool tpstats
OK, now the nasty part, Dropped
messages.
These are messages of various types
that the node has received that is has
not been able to process due to
overload, to save itself from going
down C* has gone into “emergency
mode” and shed the messages, we
should never see any dropped
messages. Period.
Lets go thru these messages one by
one….
ReadStage 0 0 3231222 0 0
MiscStage 0 0 0 0 0
Sampler 0 0 0 0 0
READ 4
RANGE_SLICE 0
_TRACE 5095
HINT 0
MUTATION 180
COUNTER_MUTATION 0
BATCH_STORE 0
BATCH_REMOVE 0
REQUEST_RESPONSE 23
PAGED_RANGE 0
READ_REPAIR 0

>nodetool tpstats
So that’s 4x READ messages that were
dropped, they were CQL SELECT
statements that C* could not process
due to overload of this node
Other nodes with replicas would have
stepped in to satisfy the query*.
*As long as the driver was correctly conﬁgured
and the correct consistency level was applied
to the CQL SELECT statement.
ReadStage 0 0 3231222 0 0
MiscStage 0 0 0 0 0
Sampler 0 0 0 0 0
READ 4
RANGE_SLICE 0
_TRACE 5095
HINT 0
MUTATION 180
COUNTER_MUTATION 0
BATCH_STORE 0
BATCH_REMOVE 0
REQUEST_RESPONSE 23
PAGED_RANGE 0
READ_REPAIR 0

>nodetool tpstats
5095x TRACE messages have been
dropped.
This is a problem. Someone has either:
1) turned TRACE on on the server
using: >nodetool settraceprobablity 1
2) more worryingly has checked in CQL
code in at the application tier with
TRACE ON.
TRACE puts an enormous weight on a
node and should never be on in
production!
ReadStage 0 0 3231222 0 0
MiscStage 0 0 0 0 0
Sampler 0 0 0 0 0
READ 4
RANGE_SLICE 0
_TRACE 5095
HINT 0
MUTATION 180
COUNTER_MUTATION 0
BATCH_STORE 0
BATCH_REMOVE 0
REQUEST_RESPONSE 23
PAGED_RANGE 0
READ_REPAIR 0

>nodetool tpstats
With TRACE on on this node, all bets
are off, this could be the sole cause of
this node’s problems, TRACE is such a
heavy hitting process that it can retard
a node if activated on a production
node or retard an entire cluster if
activated on all nodes.
To turn it off run on all nodes:
>nodetool settraceprobability 0
If it’s in checked in CQL code you need
to audit all app tier code to identify the
offending statement/s.
ReadStage 0 0 3231222 0 0
MiscStage 0 0 0 0 0
Sampler 0 0 0 0 0
READ 4
RANGE_SLICE 0
_TRACE 5095
HINT 0
MUTATION 180
COUNTER_MUTATION 0
BATCH_STORE 0
BATCH_REMOVE 0
REQUEST_RESPONSE 23
PAGED_RANGE 0
READ_REPAIR 0

>nodetool tpstats
TRACE on on a production node earns
my dill award.

>nodetool tpstats
180x MUTATION message drops,
MUTATIONS are writes, the server has
not had the headroom to perform
these writes.
REQUEST_RESPONSE drops are self
explanatory.
ReadStage 0 0 3231222 0 0
MiscStage 0 0 0 0 0
Sampler 0 0 0 0 0
READ 4
RANGE_SLICE 0
_TRACE 5095
HINT 0
MUTATION 180
COUNTER_MUTATION 0
BATCH_STORE 0
BATCH_REMOVE 0
REQUEST_RESPONSE 23
PAGED_RANGE 0
READ_REPAIR 0

>nodetool tpstats
What to look for:
On a typical node you should not really
see Thread Pools going into Pending
state.
Under 10 in Pending for
CompactionExecutor can be OK, but
when you get into larger numbers it
usually indicates a problem.
As for dropped messages you should
not see any, it means there is a real
issue in peak workloads that needs to
be addressed.
ReadStage 0 0 3231222 0 0
MiscStage 0 0 0 0 0
Sampler 0 0 0 0 0
READ 4
RANGE_SLICE 0
_TRACE 5095
HINT 0
MUTATION 180
COUNTER_MUTATION 0
BATCH_STORE 0
BATCH_REMOVE 0
REQUEST_RESPONSE 23
PAGED_RANGE 0
READ_REPAIR 0

>nodetool netstats
Aside from >nodetool tpstats,
>nodetool netstats is your second
go-to diagnostic that gives a good
view on how healthy a node is.
The ﬁrst thing to check is “Read Repair
Statistics”, these indicate
inconsistencies in data found on this
node when compared to other nodes
when a query executes, they usually
indicate again that the node or cluster
is under stress and may not be
properly provisioned for the workload
it is expected to do.
Mode: NORMAL
Not sending any streams.
Read Repair Statistics:
Attempted: 408271
Mismatch (Blocking): 78
Mismatch (Background): 602
Pool Name Active Pending Completed Dropped
Large messages n/a 0 12252 913
Small messages n/a 0 63614651 0
Gossip messages n/a 0 480331 0

>nodetool netstats
The speciﬁc counts we are interested
in are the Mismatch values.
You can see here that compared to the
number of read repairs attempted
(408271) we have some minor repairs
occurring: 78/602
These are minor numbers but do
indicate at times that this node is
under stress.
Mode: NORMAL
Attempted: 408271

>nodetool netstats
This is more worrying though and quite
unusual. The amount of dropped large
messages indicates to me that
someone is doing something silly here
and either attempting to perform
overly large writes or query for overly
large SELECTs.
As soon as I saw this, I would start
asking questions as to where these
messages are coming from and put a
stop to the misuse.
Mode: NORMAL
Attempted: 408271

>nodetool netstats
What to look for:
Large Mismatch values indicate a
node that in the past has been under
severe stress and incapable of keeping
up with write workloads.
Dropped Large Messages probably
means that someone is performing
ridiculous queries or writes against
your system, ﬁnd them and terminate
them with extreme prejudice.
Mode: NORMAL
Attempted: 408271

>nodetool cfstats
Rounding out the top 3 diagnostics
commands is >nodetool cfstats, or
more verbosely: nodetool
columnfamily statistics.
It’s a large ﬁle detailing statistics on
each table in your cluster, for brevity's
sake let’s take a look at one table’s
output from cfstats….
Table: rollups60
SSTable count: 10
Space used (live): 1757632985
Space used (total): 1757632985
Space used by snapshots (total): 0
Off heap memory used (total): 520044
SSTable Compression Ratio: 0.5405234880604174
Number of keys (estimate): 14317
Memtable cell count: 1251073
Memtable data size: 57091879
Memtable off heap memory used: 0
Memtable switch count: 2
Local read count: 211506
Local read latency: 0.923 ms
Local write count: 18096351
Local write latency: 0.028 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 89280
Bloom filter off heap memory used: 89200
Index summary off heap memory used: 38420
Compression metadata off heap memory used: 392424
Compacted partition minimum bytes: 5723
Compacted partition maximum bytes: 2816159
Compacted partition mean bytes: 47670
Average live cells per slice (last five minutes): 2.7963433445814063
Maximum live cells per slice (last five minutes): 3
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1

>nodetool cfstats
Rounding out the top 3 diagnostics
commands is >nodetool cfstats, or
more verbosely: nodetool
columnfamily statistics.
It’s a large ﬁle detailing statistics on
each table in your cluster, for brevity's
sake let’s take a look at one table’s
output from cfstats.
There is a lot of useful information
here, but at a glance there are a couple
of key metrics...
Table: mytablename
SSTable count: 10
Pending flushes: 0

>nodetool cfstats
SStable count.
The amount of sstables that make up
this table on this node, this should be
in the 10’s to possibly 100’s, if you see
it higher than that it usually means
there are problems with compaction
on the node, problems with
compaction are usually caused by too
many writes for the underlying I/O
capability of the node.
Table: mytablename
SSTable count: 10
Pending flushes: 0

>nodetool cfstats
Number of keys (estimate).
This is the number of partition keys for
this table on this node, if this node
table has large amounts of data on
this node and the key count is very low
it usually means there may be a data
modelling issue...more on this later.
Table: mytablename
SSTable count: 10
Pending flushes: 0

>nodetool cfstats
Local read count, Local write count.
Interesting on their own, but more
interesting when viewed together, you
can see there are a lot more writes
than reads on this cluster, that is the
workload is very heavily write oriented.
In fact running a calculation there are
85 writes for every read! One caveat
here is that we do not know 1) how
long the node has been up and 2)
whether their traﬃc peaks during the
day, so we may have missed read
traﬃc which would alter the ratio.
Table: mytablename
SSTable count: 10
Pending flushes: 0

>nodetool cfstats
Local read latency, Local write latency.
You can see that their latencies are
quite good, writes are faster than
reads in C* which is what we would
expect and with reads under 1ms this
is a good result.
If you start to see large read latencies
you need to investigate if there are
large queries running or potential I/O
issues on the node at hardware level.
Table: mytablename
SSTable count: 10
Pending flushes: 0

>nodetool cfstats
Compacted partition maximum bytes.
This is the amount of data under an
individual partition key on on this node,
in this case the largest found is 2.8mb
which is good.
You really want to keep this number
under 100mb, some say 1gb but you
would really need to know what you’re
doing if you go to 1gb.
If you see large values under here that
are over a couple of 100mb then you
may have a data modelling issue.
Table: mytablename
SSTable count: 10
Pending flushes: 0

>nodetool cfstats
Compacted partition mean bytes.
This is the average amount of data
under all partition keys on on this
node.
You really want to keep this number
under 100mb.
If you see large values under here you
know you have a data modelling issue.
Table: mytablename
SSTable count: 10
Pending flushes: 0

>nodetool cfstats
Average live cells per slice.
This is a measure of the amount of
data you are pulling back for the
average query (SELECT).
Pulling 10’s or 100’s of cells (values) is
fine, in fact pulling back 1000’s of cells
on average is fine if that’s what you
intended to do, but if it’s not what you
intended your solution to do then you
might want to look at who is doing lazy
SELECT * queries on your cluster!
Be aware that larger queries are going
to increase read latency significantly
Table: mytablename
SSTable count: 10
Pending flushes: 0

>nodetool cfstats
Maximum live cells per slice.
Self explanatory, the largest query
seen in the last 5 minutes.
Table: mytablename
SSTable count: 10
Pending flushes: 0

>nodetool cfstats
Average tombstones per slice.
Tombstones are not returned in
queries but they have to be read off
disk and ﬁltered thru the JVM so they
can add signiﬁcant relative overhead
to a query.
If you are pulling back 1x live cell and
100 tombstones in a query its going to
impact your performance.
Tombstones are the result of deletes
and deletes need to be very carefully
managed and modelled in C*.
Table: mytablename
SSTable count: 10
Pending flushes: 0

>nodetool cfstats
Maximum tombstones per slice.
Self explanatory, the largest amount of
tombstones seen in a query in the last
5 minutes.
Table: mytablename
SSTable count: 10
Pending flushes: 0

Summary so far...
That rounds out the top 3 diagnostic nodetool commands in Apache Cassandra:
● nodetool tpstats
● nodetool netstats
● nodetool cfstats
With those 3 commands you can get a very good grasp of the health of a node and possible issues, if you then see a
pattern cluster wide you know you have a general issue (usually workload), if however you only see poor health on a
single node it’s probably* time to start looking at hardware as the culprit.
*I say probably because there are circumstances where a hot partition on a single node can get hammered with requests, the times i have seen
this is where someone has accidentally turned a tool against C* that focuses on a single partition (thanks security guy).

>system.log
On package installs lives in:
/var/log/cassandra
What to look for:
● Exceptions
● GC events
● Other nodes going UP and DOWN in gossip
● Dropped messages
● WARNs on large partitions / wide rows
● Tombstone warnings
● Repair session failures
● Compactions with large amounts of sstables in them
● Startup problems and warnings
● Topology warnings

JMX
Cassandra exposes its metics via
MBeans, here you see Jconsole
connected to a Cassandra node listing
all the MBeans available for
interrogation.
These JMX MBeans can be
instrumented in Java and Python
interfaces plus some commercial
products.
DataStax uses these same MBeans to
instrument OpsCenter.

JMX
Cassandra exposes its metics via
MBeans, here you see Jconsole
connected to a Cassandra node listing
all the MBeans available for
interrogation.
These JMX MBeans can be
instrumented in Java and Python
interfaces plus some commercial
products.
A list of alternatives to Jconsole is here: JMX Clients with Apache Cassandra

JMX
Invoking an MBean in Java
This is sample code for a simple
method call against an MBean with no
return value, you would need to return
data in a useful application and
present the result to a screen or store
the result for analysis.
This code was stripped from the following link for
educational and training purposes and all copyright
belongs to their respective owners:
http://stackoverﬂow.com/questions/16583859/execute-a
-method-with-jmx-without-jconsole
import javax.management.*;
import javax.management.remote.*;
import com.sun.messaging.AdminConnectionFactory;
import com.sun.messaging.jms.management.server.*;
public class InvokeOp
{
public static void main (String[] args){
try{
// Create administration connection factory
AdminConnectionFactory acf = new AdminConnectionFactory();
// Get JMX connector, supplying user name and password
JMXConnector jmxc = acf.createConnection("AliBaba", "sesame");
// Get MBean server connection
MBeanServerConnection mbsc = jmxc.getMBeanServerConnection();
// Create object name
ObjectName serviceConfigName = MQObjectName.createServiceConfig("jms");
// Invoke operation
mbsc.invoke(serviceConfigName, ServiceOperations.PAUSE, null, null);
// Close JMX connector
jmxc.close();
}
catch (Exception e){
System.out.println( "Exception occurred: " + e.toString() );
e.printStackTrace();
}
}
}

JMX
Invoking an MBean in jython, Python
running on the Java JVM.
https://egkatzioura.com/2014/09/22/connecting-to-jmx-t
hrough-jython/59/execute-a-method-with-jmx-without-jcon
sole
from javax.management.remote import JMXConnector
from javax.management.remote import JMXConnectorFactory
from javax.management.remote import JMXServiceURL
from javax.management import MBeanServerConnection
from javax.management import MBeanInfo
from javax.management import ObjectName
from java.lang import String
from jarray import array
import sys
if __name__=='__main__':
if len(sys.argv)> 5:
serverUrl = sys.argv[1]
username = sys.argv[2]
password = sys.argv[3]
beanName = sys.argv[4]
action = sys.argv[5]
else:
sys.exit(-1)
credentials = array([username,password],String)
environment = {JMXConnector.CREDENTIALS:credentials}
jmxServiceUrl = JMXServiceURL('service:jmx:rmi:///jndi/rmi://'+serverUrl+':9999/jmxrmi');
jmxConnector = JMXConnectorFactory.connect(jmxServiceUrl,environment);
mBeanServerConnection = jmxConnector.getMBeanServerConnection()
objectName = ObjectName(beanName);
mBeanServerConnection.invoke(objectName,action,None,None)
jmxConnector.close()

JMX
Invoking an MBean in C Python using
Jokolia, a JMX library for python
https://jolokia.org/
This approach is a little more complex
as agents need to be installed on
nodes.
There are some other Python JMX libraries but I have not
used them so cannot vouch for them.
https://jolokia.org/tutorial.html
import org.jolokia.client.*;
import org.jolokia.client.request.*;
import java.util.Map;
public class JolokiaDemo {
public static void main(String[] args) throws Exception {
J4pClient j4pClient = new J4pClient("http://localhost:8080/jolokia");
J4pReadRequest req = new J4pReadRequest("java.lang:type=Memory",
"HeapMemoryUsage");
J4pReadResponse resp = j4pClient.execute(req);
Map<String,Long> vals = resp.getValue();
long used = vals.get("used");
long max = vals.get("max");
int usage = (int) (used * 100 / max);
System.out.println("Memory usage: used: " + used +
" / max: " + max + " = " + usage + "%");
}
}

JMX + Node.js
jmx npm
https://www.npmjs.com/package/jmx
Can’t vouch for this one, but node.js is a great way to serve
javascript directly into a GUI, the meteor project is also an
excellent pub/sub/push system built on node.js that would
make a great C* Operats GUI.
https://www.meteor.com/
https://www.npmjs.com/package/jmx
var jmx = require("jmx");
client = jmx.createClient({
host: "localhost", // optional
port: 3000
});
client.connect();
client.on("connect", function() {
client.getAttribute("java.lang:type=Memory", "HeapMemoryUsage", function(data) {
var used = data.getSync('used');
console.log("HeapMemoryUsage used: " + used.longValue);
// console.log(data.toString());
});
client.setAttribute("java.lang:type=Memory", "Verbose", true, function() {
console.log("Memory verbose on"); // callback is optional
});
client.invoke("java.lang:type=Memory", "gc", [], function(data) {
console.log("gc() done");
});
});

JMX + Node.js
jokolia npm
https://www.npmjs.com/package/jolokia
Can’t vouch for this one, but node.js is a great way to serve
javascript directly into a GUI, the meteor project is also an
excellent pub/sub/push system built on node.js that would
make a great C* Ops GUI.
https://www.meteor.com/
https://www.npmjs.com/package/jolokia
// In Node.js or using Browserify
var Jolokia = require('jolokia');
// In browser
var Jolokia = window.Jolokia;
// Or using RequireJs
require(['./path/to/jolokia'], function(Jolokia) {
// code below
});
var jolokia = new Jolokia({
url: '/jmx', // use full url when in Node.js environment
method: 'post', // force specific HTTP method
});
jolokia.list().then(function(value) {
// do something with list of JMX domains
}, function(error) {
// handle error
});

Thanks!
Contact us:
DataStax
Sydney, Australia
alex.thompson@datastax.com
www.datastax.com

Apache Cassandra - Diagnostics and monitoring

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache Cassandra - Diagnostics and monitoring

Similar to Apache Cassandra - Diagnostics and monitoring (20)

More from Alex Thompson

More from Alex Thompson (6)

Recently uploaded

Recently uploaded (20)

Apache Cassandra - Diagnostics and monitoring