SlideShare a Scribd company logo
1 of 41
Download to read offline
Apache Cassandra:
diagnostics and monitoring
Alex Thompson
Solution Architect - APAC
DataStax Australia
Intro
This presentation is intended as a field guide for users of Apache Cassandra.
This guide specifically covers an explanation diagnostics tools and monitoring
tools and methods used in conjunction with C*, it is written in a pragmatic order
with the most important tools first.
Diagnostics
>nodetool tpstats
Probably the most important “at a
glance” summary of the health of a
node and the first diagnostics
command to run.
>nodetool tpstats is better described
as “nodetool thread statistics”; it gives
us a real-time measure of each thread
in C* and its current workload.
Note: if you restart a C* instance these statistics
are cleared to zero, so you have to run it on a node
that has been up for a while to be able to diagnose
workload.
Pool Name Active Pending Completed Blocked All time blocked
MutationStage 0 0 25159974 0 0
ViewMutationStage 0 0 0 0 0
ReadStage 0 0 3231222 0 0
RequestResponseStage 0 0 36609517 0 0
ReadRepairStage 0 0 410293 0 0
CounterMutationStage 0 0 0 0 0
MiscStage 0 0 0 0 0
CompactionExecutor 8 108 287003 0 0
MemtableReclaimMemory 0 0 444 0 0
PendingRangeCalculator 0 0 27 0 0
GossipStage 0 0 464348 0 0
SecondaryIndexManagement 0 0 13 0 0
HintsDispatcher 0 0 396 0 0
MigrationStage 0 0 25 0 0
MemtablePostFlush 0 0 1114 0 0
ValidationExecutor 0 0 321 0 0
Sampler 0 0 0 0 0
MemtableFlushWriter 0 0 444 0 0
InternalResponseStage 0 0 68544 0 0
AntiEntropyStage 0 0 1209 0 0
CacheCleanupExecutor 0 0 0 0 0
Native-Transport-Requests 0 0 35849149 0 536
Message type Dropped
READ 4
RANGE_SLICE 0
_TRACE 5095
HINT 0
MUTATION 180
COUNTER_MUTATION 0
BATCH_STORE 0
BATCH_REMOVE 0
REQUEST_RESPONSE 23
PAGED_RANGE 0
READ_REPAIR 0
>nodetool tpstats
First thing to check is Pending work on
threads, this node is showing
compactions getting behind, this may
be OK but is usually an indication with
other diagnostics of an overloaded
node.
Pool Name Active Pending Completed Blocked All time blocked
MutationStage 0 0 25159974 0 0
ViewMutationStage 0 0 0 0 0
ReadStage 0 0 3231222 0 0
RequestResponseStage 0 0 36609517 0 0
ReadRepairStage 0 0 410293 0 0
CounterMutationStage 0 0 0 0 0
MiscStage 0 0 0 0 0
CompactionExecutor 8 108 287003 0 0
MemtableReclaimMemory 0 0 444 0 0
PendingRangeCalculator 0 0 27 0 0
GossipStage 0 0 464348 0 0
SecondaryIndexManagement 0 0 13 0 0
HintsDispatcher 0 0 396 0 0
MigrationStage 0 0 25 0 0
MemtablePostFlush 0 0 1114 0 0
ValidationExecutor 0 0 321 0 0
Sampler 0 0 0 0 0
MemtableFlushWriter 0 0 444 0 0
InternalResponseStage 0 0 68544 0 0
AntiEntropyStage 0 0 1209 0 0
CacheCleanupExecutor 0 0 0 0 0
Native-Transport-Requests 0 0 35849149 0 536
Message type Dropped
READ 4
RANGE_SLICE 0
_TRACE 5095
HINT 0
MUTATION 180
COUNTER_MUTATION 0
BATCH_STORE 0
BATCH_REMOVE 0
REQUEST_RESPONSE 23
PAGED_RANGE 0
READ_REPAIR 0
>nodetool tpstats
Next up is to check All time blocked: in this
case Native-Transport-Requests which are
calls to the binary CQL port (reads or
writes) that have not been completed
due to overload. Also note the high
Completed This node is servicing a lot of
requests.
In combination with Pending mentioned
in the prior slide this is starting to look
like an overloaded node, but let’s dig
deeper...
Pool Name Active Pending Completed Blocked All time blocked
MutationStage 0 0 25159974 0 0
ViewMutationStage 0 0 0 0 0
ReadStage 0 0 3231222 0 0
RequestResponseStage 0 0 36609517 0 0
ReadRepairStage 0 0 410293 0 0
CounterMutationStage 0 0 0 0 0
MiscStage 0 0 0 0 0
CompactionExecutor 8 108 287003 0 0
MemtableReclaimMemory 0 0 444 0 0
PendingRangeCalculator 0 0 27 0 0
GossipStage 0 0 464348 0 0
SecondaryIndexManagement 0 0 13 0 0
HintsDispatcher 0 0 396 0 0
MigrationStage 0 0 25 0 0
MemtablePostFlush 0 0 1114 0 0
ValidationExecutor 0 0 321 0 0
Sampler 0 0 0 0 0
MemtableFlushWriter 0 0 444 0 0
InternalResponseStage 0 0 68544 0 0
AntiEntropyStage 0 0 1209 0 0
CacheCleanupExecutor 0 0 0 0 0
Native-Transport-Requests 0 0 35849149 0 536
Message type Dropped
READ 4
RANGE_SLICE 0
_TRACE 5095
HINT 0
MUTATION 180
COUNTER_MUTATION 0
BATCH_STORE 0
BATCH_REMOVE 0
REQUEST_RESPONSE 23
PAGED_RANGE 0
READ_REPAIR 0
>nodetool tpstats
OK, now the nasty part, Dropped
messages.
These are messages of various types
that the node has received that is has
not been able to process due to
overload, to save itself from going
down C* has gone into “emergency
mode” and shed the messages, we
should never see any dropped
messages. Period.
Lets go thru these messages one by
one….
Pool Name Active Pending Completed Blocked All time blocked
MutationStage 0 0 25159974 0 0
ViewMutationStage 0 0 0 0 0
ReadStage 0 0 3231222 0 0
RequestResponseStage 0 0 36609517 0 0
ReadRepairStage 0 0 410293 0 0
CounterMutationStage 0 0 0 0 0
MiscStage 0 0 0 0 0
CompactionExecutor 8 108 287003 0 0
MemtableReclaimMemory 0 0 444 0 0
PendingRangeCalculator 0 0 27 0 0
GossipStage 0 0 464348 0 0
SecondaryIndexManagement 0 0 13 0 0
HintsDispatcher 0 0 396 0 0
MigrationStage 0 0 25 0 0
MemtablePostFlush 0 0 1114 0 0
ValidationExecutor 0 0 321 0 0
Sampler 0 0 0 0 0
MemtableFlushWriter 0 0 444 0 0
InternalResponseStage 0 0 68544 0 0
AntiEntropyStage 0 0 1209 0 0
CacheCleanupExecutor 0 0 0 0 0
Native-Transport-Requests 0 0 35849149 0 536
Message type Dropped
READ 4
RANGE_SLICE 0
_TRACE 5095
HINT 0
MUTATION 180
COUNTER_MUTATION 0
BATCH_STORE 0
BATCH_REMOVE 0
REQUEST_RESPONSE 23
PAGED_RANGE 0
READ_REPAIR 0
>nodetool tpstats
So that’s 4x READ messages that were
dropped, they were CQL SELECT
statements that C* could not process
due to overload of this node
Other nodes with replicas would have
stepped in to satisfy the query*.
*As long as the driver was correctly configured
and the correct consistency level was applied
to the CQL SELECT statement.
Pool Name Active Pending Completed Blocked All time blocked
MutationStage 0 0 25159974 0 0
ViewMutationStage 0 0 0 0 0
ReadStage 0 0 3231222 0 0
RequestResponseStage 0 0 36609517 0 0
ReadRepairStage 0 0 410293 0 0
CounterMutationStage 0 0 0 0 0
MiscStage 0 0 0 0 0
CompactionExecutor 8 108 287003 0 0
MemtableReclaimMemory 0 0 444 0 0
PendingRangeCalculator 0 0 27 0 0
GossipStage 0 0 464348 0 0
SecondaryIndexManagement 0 0 13 0 0
HintsDispatcher 0 0 396 0 0
MigrationStage 0 0 25 0 0
MemtablePostFlush 0 0 1114 0 0
ValidationExecutor 0 0 321 0 0
Sampler 0 0 0 0 0
MemtableFlushWriter 0 0 444 0 0
InternalResponseStage 0 0 68544 0 0
AntiEntropyStage 0 0 1209 0 0
CacheCleanupExecutor 0 0 0 0 0
Native-Transport-Requests 0 0 35849149 0 536
Message type Dropped
READ 4
RANGE_SLICE 0
_TRACE 5095
HINT 0
MUTATION 180
COUNTER_MUTATION 0
BATCH_STORE 0
BATCH_REMOVE 0
REQUEST_RESPONSE 23
PAGED_RANGE 0
READ_REPAIR 0
>nodetool tpstats
5095x TRACE messages have been
dropped.
This is a problem. Someone has either:
1) turned TRACE on on the server
using: >nodetool settraceprobablity 1
2) more worryingly has checked in CQL
code in at the application tier with
TRACE ON.
TRACE puts an enormous weight on a
node and should never be on in
production!
Pool Name Active Pending Completed Blocked All time blocked
MutationStage 0 0 25159974 0 0
ViewMutationStage 0 0 0 0 0
ReadStage 0 0 3231222 0 0
RequestResponseStage 0 0 36609517 0 0
ReadRepairStage 0 0 410293 0 0
CounterMutationStage 0 0 0 0 0
MiscStage 0 0 0 0 0
CompactionExecutor 8 108 287003 0 0
MemtableReclaimMemory 0 0 444 0 0
PendingRangeCalculator 0 0 27 0 0
GossipStage 0 0 464348 0 0
SecondaryIndexManagement 0 0 13 0 0
HintsDispatcher 0 0 396 0 0
MigrationStage 0 0 25 0 0
MemtablePostFlush 0 0 1114 0 0
ValidationExecutor 0 0 321 0 0
Sampler 0 0 0 0 0
MemtableFlushWriter 0 0 444 0 0
InternalResponseStage 0 0 68544 0 0
AntiEntropyStage 0 0 1209 0 0
CacheCleanupExecutor 0 0 0 0 0
Native-Transport-Requests 0 0 35849149 0 536
Message type Dropped
READ 4
RANGE_SLICE 0
_TRACE 5095
HINT 0
MUTATION 180
COUNTER_MUTATION 0
BATCH_STORE 0
BATCH_REMOVE 0
REQUEST_RESPONSE 23
PAGED_RANGE 0
READ_REPAIR 0
>nodetool tpstats
With TRACE on on this node, all bets
are off, this could be the sole cause of
this node’s problems, TRACE is such a
heavy hitting process that it can retard
a node if activated on a production
node or retard an entire cluster if
activated on all nodes.
To turn it off run on all nodes:
>nodetool settraceprobability 0
If it’s in checked in CQL code you need
to audit all app tier code to identify the
offending statement/s.
Pool Name Active Pending Completed Blocked All time blocked
MutationStage 0 0 25159974 0 0
ViewMutationStage 0 0 0 0 0
ReadStage 0 0 3231222 0 0
RequestResponseStage 0 0 36609517 0 0
ReadRepairStage 0 0 410293 0 0
CounterMutationStage 0 0 0 0 0
MiscStage 0 0 0 0 0
CompactionExecutor 8 108 287003 0 0
MemtableReclaimMemory 0 0 444 0 0
PendingRangeCalculator 0 0 27 0 0
GossipStage 0 0 464348 0 0
SecondaryIndexManagement 0 0 13 0 0
HintsDispatcher 0 0 396 0 0
MigrationStage 0 0 25 0 0
MemtablePostFlush 0 0 1114 0 0
ValidationExecutor 0 0 321 0 0
Sampler 0 0 0 0 0
MemtableFlushWriter 0 0 444 0 0
InternalResponseStage 0 0 68544 0 0
AntiEntropyStage 0 0 1209 0 0
CacheCleanupExecutor 0 0 0 0 0
Native-Transport-Requests 0 0 35849149 0 536
Message type Dropped
READ 4
RANGE_SLICE 0
_TRACE 5095
HINT 0
MUTATION 180
COUNTER_MUTATION 0
BATCH_STORE 0
BATCH_REMOVE 0
REQUEST_RESPONSE 23
PAGED_RANGE 0
READ_REPAIR 0
>nodetool tpstats
TRACE on on a production node earns
my dill award.
>nodetool tpstats
180x MUTATION message drops,
MUTATIONS are writes, the server has
not had the headroom to perform
these writes.
REQUEST_RESPONSE drops are self
explanatory.
Pool Name Active Pending Completed Blocked All time blocked
MutationStage 0 0 25159974 0 0
ViewMutationStage 0 0 0 0 0
ReadStage 0 0 3231222 0 0
RequestResponseStage 0 0 36609517 0 0
ReadRepairStage 0 0 410293 0 0
CounterMutationStage 0 0 0 0 0
MiscStage 0 0 0 0 0
CompactionExecutor 8 108 287003 0 0
MemtableReclaimMemory 0 0 444 0 0
PendingRangeCalculator 0 0 27 0 0
GossipStage 0 0 464348 0 0
SecondaryIndexManagement 0 0 13 0 0
HintsDispatcher 0 0 396 0 0
MigrationStage 0 0 25 0 0
MemtablePostFlush 0 0 1114 0 0
ValidationExecutor 0 0 321 0 0
Sampler 0 0 0 0 0
MemtableFlushWriter 0 0 444 0 0
InternalResponseStage 0 0 68544 0 0
AntiEntropyStage 0 0 1209 0 0
CacheCleanupExecutor 0 0 0 0 0
Native-Transport-Requests 0 0 35849149 0 536
Message type Dropped
READ 4
RANGE_SLICE 0
_TRACE 5095
HINT 0
MUTATION 180
COUNTER_MUTATION 0
BATCH_STORE 0
BATCH_REMOVE 0
REQUEST_RESPONSE 23
PAGED_RANGE 0
READ_REPAIR 0
>nodetool tpstats
What to look for:
On a typical node you should not really
see Thread Pools going into Pending
state.
Under 10 in Pending for
CompactionExecutor can be OK, but
when you get into larger numbers it
usually indicates a problem.
As for dropped messages you should
not see any, it means there is a real
issue in peak workloads that needs to
be addressed.
Pool Name Active Pending Completed Blocked All time blocked
MutationStage 0 0 25159974 0 0
ViewMutationStage 0 0 0 0 0
ReadStage 0 0 3231222 0 0
RequestResponseStage 0 0 36609517 0 0
ReadRepairStage 0 0 410293 0 0
CounterMutationStage 0 0 0 0 0
MiscStage 0 0 0 0 0
CompactionExecutor 8 108 287003 0 0
MemtableReclaimMemory 0 0 444 0 0
PendingRangeCalculator 0 0 27 0 0
GossipStage 0 0 464348 0 0
SecondaryIndexManagement 0 0 13 0 0
HintsDispatcher 0 0 396 0 0
MigrationStage 0 0 25 0 0
MemtablePostFlush 0 0 1114 0 0
ValidationExecutor 0 0 321 0 0
Sampler 0 0 0 0 0
MemtableFlushWriter 0 0 444 0 0
InternalResponseStage 0 0 68544 0 0
AntiEntropyStage 0 0 1209 0 0
CacheCleanupExecutor 0 0 0 0 0
Native-Transport-Requests 0 0 35849149 0 536
Message type Dropped
READ 4
RANGE_SLICE 0
_TRACE 5095
HINT 0
MUTATION 180
COUNTER_MUTATION 0
BATCH_STORE 0
BATCH_REMOVE 0
REQUEST_RESPONSE 23
PAGED_RANGE 0
READ_REPAIR 0
>nodetool netstats
Aside from >nodetool tpstats,
>nodetool netstats is your second
go-to diagnostic that gives a good
view on how healthy a node is.
The first thing to check is “Read Repair
Statistics”, these indicate
inconsistencies in data found on this
node when compared to other nodes
when a query executes, they usually
indicate again that the node or cluster
is under stress and may not be
properly provisioned for the workload
it is expected to do.
Mode: NORMAL
Not sending any streams.
Read Repair Statistics:
Attempted: 408271
Mismatch (Blocking): 78
Mismatch (Background): 602
Pool Name Active Pending Completed Dropped
Large messages n/a 0 12252 913
Small messages n/a 0 63614651 0
Gossip messages n/a 0 480331 0
>nodetool netstats
The specific counts we are interested
in are the Mismatch values.
You can see here that compared to the
number of read repairs attempted
(408271) we have some minor repairs
occurring: 78/602
These are minor numbers but do
indicate at times that this node is
under stress.
Mode: NORMAL
Not sending any streams.
Read Repair Statistics:
Attempted: 408271
Mismatch (Blocking): 78
Mismatch (Background): 602
Pool Name Active Pending Completed Dropped
Large messages n/a 0 12252 913
Small messages n/a 0 63614651 0
Gossip messages n/a 0 480331 0
>nodetool netstats
This is more worrying though and quite
unusual. The amount of dropped large
messages indicates to me that
someone is doing something silly here
and either attempting to perform
overly large writes or query for overly
large SELECTs.
As soon as I saw this, I would start
asking questions as to where these
messages are coming from and put a
stop to the misuse.
Mode: NORMAL
Not sending any streams.
Read Repair Statistics:
Attempted: 408271
Mismatch (Blocking): 78
Mismatch (Background): 602
Pool Name Active Pending Completed Dropped
Large messages n/a 0 12252 913
Small messages n/a 0 63614651 0
Gossip messages n/a 0 480331 0
>nodetool netstats
What to look for:
Large Mismatch values indicate a
node that in the past has been under
severe stress and incapable of keeping
up with write workloads.
Dropped Large Messages probably
means that someone is performing
ridiculous queries or writes against
your system, find them and terminate
them with extreme prejudice.
Mode: NORMAL
Not sending any streams.
Read Repair Statistics:
Attempted: 408271
Mismatch (Blocking): 78
Mismatch (Background): 602
Pool Name Active Pending Completed Dropped
Large messages n/a 0 12252 913
Small messages n/a 0 63614651 0
Gossip messages n/a 0 480331 0
>nodetool cfstats
Rounding out the top 3 diagnostics
commands is >nodetool cfstats, or
more verbosely: nodetool
columnfamily statistics.
It’s a large file detailing statistics on
each table in your cluster, for brevity's
sake let’s take a look at one table’s
output from cfstats….
Table: rollups60
SSTable count: 10
Space used (live): 1757632985
Space used (total): 1757632985
Space used by snapshots (total): 0
Off heap memory used (total): 520044
SSTable Compression Ratio: 0.5405234880604174
Number of keys (estimate): 14317
Memtable cell count: 1251073
Memtable data size: 57091879
Memtable off heap memory used: 0
Memtable switch count: 2
Local read count: 211506
Local read latency: 0.923 ms
Local write count: 18096351
Local write latency: 0.028 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 89280
Bloom filter off heap memory used: 89200
Index summary off heap memory used: 38420
Compression metadata off heap memory used: 392424
Compacted partition minimum bytes: 5723
Compacted partition maximum bytes: 2816159
Compacted partition mean bytes: 47670
Average live cells per slice (last five minutes): 2.7963433445814063
Maximum live cells per slice (last five minutes): 3
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
>nodetool cfstats
Rounding out the top 3 diagnostics
commands is >nodetool cfstats, or
more verbosely: nodetool
columnfamily statistics.
It’s a large file detailing statistics on
each table in your cluster, for brevity's
sake let’s take a look at one table’s
output from cfstats.
There is a lot of useful information
here, but at a glance there are a couple
of key metrics...
Table: mytablename
SSTable count: 10
Space used (live): 1757632985
Space used (total): 1757632985
Space used by snapshots (total): 0
Off heap memory used (total): 520044
SSTable Compression Ratio: 0.5405234880604174
Number of keys (estimate): 14317
Memtable cell count: 1251073
Memtable data size: 57091879
Memtable off heap memory used: 0
Memtable switch count: 2
Local read count: 211506
Local read latency: 0.923 ms
Local write count: 18096351
Local write latency: 0.028 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 89280
Bloom filter off heap memory used: 89200
Index summary off heap memory used: 38420
Compression metadata off heap memory used: 392424
Compacted partition minimum bytes: 5723
Compacted partition maximum bytes: 2816159
Compacted partition mean bytes: 47670
Average live cells per slice (last five minutes): 2.7963433445814063
Maximum live cells per slice (last five minutes): 3
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
>nodetool cfstats
SStable count.
The amount of sstables that make up
this table on this node, this should be
in the 10’s to possibly 100’s, if you see
it higher than that it usually means
there are problems with compaction
on the node, problems with
compaction are usually caused by too
many writes for the underlying I/O
capability of the node.
Table: mytablename
SSTable count: 10
Space used (live): 1757632985
Space used (total): 1757632985
Space used by snapshots (total): 0
Off heap memory used (total): 520044
SSTable Compression Ratio: 0.5405234880604174
Number of keys (estimate): 14317
Memtable cell count: 1251073
Memtable data size: 57091879
Memtable off heap memory used: 0
Memtable switch count: 2
Local read count: 211506
Local read latency: 0.923 ms
Local write count: 18096351
Local write latency: 0.028 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 89280
Bloom filter off heap memory used: 89200
Index summary off heap memory used: 38420
Compression metadata off heap memory used: 392424
Compacted partition minimum bytes: 5723
Compacted partition maximum bytes: 2816159
Compacted partition mean bytes: 47670
Average live cells per slice (last five minutes): 2.7963433445814063
Maximum live cells per slice (last five minutes): 3
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
>nodetool cfstats
Number of keys (estimate).
This is the number of partition keys for
this table on this node, if this node
table has large amounts of data on
this node and the key count is very low
it usually means there may be a data
modelling issue...more on this later.
Table: mytablename
SSTable count: 10
Space used (live): 1757632985
Space used (total): 1757632985
Space used by snapshots (total): 0
Off heap memory used (total): 520044
SSTable Compression Ratio: 0.5405234880604174
Number of keys (estimate): 14317
Memtable cell count: 1251073
Memtable data size: 57091879
Memtable off heap memory used: 0
Memtable switch count: 2
Local read count: 211506
Local read latency: 0.923 ms
Local write count: 18096351
Local write latency: 0.028 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 89280
Bloom filter off heap memory used: 89200
Index summary off heap memory used: 38420
Compression metadata off heap memory used: 392424
Compacted partition minimum bytes: 5723
Compacted partition maximum bytes: 2816159
Compacted partition mean bytes: 47670
Average live cells per slice (last five minutes): 2.7963433445814063
Maximum live cells per slice (last five minutes): 3
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
>nodetool cfstats
Local read count, Local write count.
Interesting on their own, but more
interesting when viewed together, you
can see there are a lot more writes
than reads on this cluster, that is the
workload is very heavily write oriented.
In fact running a calculation there are
85 writes for every read! One caveat
here is that we do not know 1) how
long the node has been up and 2)
whether their traffic peaks during the
day, so we may have missed read
traffic which would alter the ratio.
Table: mytablename
SSTable count: 10
Space used (live): 1757632985
Space used (total): 1757632985
Space used by snapshots (total): 0
Off heap memory used (total): 520044
SSTable Compression Ratio: 0.5405234880604174
Number of keys (estimate): 14317
Memtable cell count: 1251073
Memtable data size: 57091879
Memtable off heap memory used: 0
Memtable switch count: 2
Local read count: 211506
Local read latency: 0.923 ms
Local write count: 18096351
Local write latency: 0.028 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 89280
Bloom filter off heap memory used: 89200
Index summary off heap memory used: 38420
Compression metadata off heap memory used: 392424
Compacted partition minimum bytes: 5723
Compacted partition maximum bytes: 2816159
Compacted partition mean bytes: 47670
Average live cells per slice (last five minutes): 2.7963433445814063
Maximum live cells per slice (last five minutes): 3
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
>nodetool cfstats
Local read latency, Local write latency.
You can see that their latencies are
quite good, writes are faster than
reads in C* which is what we would
expect and with reads under 1ms this
is a good result.
If you start to see large read latencies
you need to investigate if there are
large queries running or potential I/O
issues on the node at hardware level.
Table: mytablename
SSTable count: 10
Space used (live): 1757632985
Space used (total): 1757632985
Space used by snapshots (total): 0
Off heap memory used (total): 520044
SSTable Compression Ratio: 0.5405234880604174
Number of keys (estimate): 14317
Memtable cell count: 1251073
Memtable data size: 57091879
Memtable off heap memory used: 0
Memtable switch count: 2
Local read count: 211506
Local read latency: 0.923 ms
Local write count: 18096351
Local write latency: 0.028 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 89280
Bloom filter off heap memory used: 89200
Index summary off heap memory used: 38420
Compression metadata off heap memory used: 392424
Compacted partition minimum bytes: 5723
Compacted partition maximum bytes: 2816159
Compacted partition mean bytes: 47670
Average live cells per slice (last five minutes): 2.7963433445814063
Maximum live cells per slice (last five minutes): 3
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
>nodetool cfstats
Compacted partition maximum bytes.
This is the amount of data under an
individual partition key on on this node,
in this case the largest found is 2.8mb
which is good.
You really want to keep this number
under 100mb, some say 1gb but you
would really need to know what you’re
doing if you go to 1gb.
If you see large values under here that
are over a couple of 100mb then you
may have a data modelling issue.
Table: mytablename
SSTable count: 10
Space used (live): 1757632985
Space used (total): 1757632985
Space used by snapshots (total): 0
Off heap memory used (total): 520044
SSTable Compression Ratio: 0.5405234880604174
Number of keys (estimate): 14317
Memtable cell count: 1251073
Memtable data size: 57091879
Memtable off heap memory used: 0
Memtable switch count: 2
Local read count: 211506
Local read latency: 0.923 ms
Local write count: 18096351
Local write latency: 0.028 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 89280
Bloom filter off heap memory used: 89200
Index summary off heap memory used: 38420
Compression metadata off heap memory used: 392424
Compacted partition minimum bytes: 5723
Compacted partition maximum bytes: 2816159
Compacted partition mean bytes: 47670
Average live cells per slice (last five minutes): 2.7963433445814063
Maximum live cells per slice (last five minutes): 3
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
>nodetool cfstats
Compacted partition mean bytes.
This is the average amount of data
under all partition keys on on this
node.
You really want to keep this number
under 100mb.
If you see large values under here you
know you have a data modelling issue.
Table: mytablename
SSTable count: 10
Space used (live): 1757632985
Space used (total): 1757632985
Space used by snapshots (total): 0
Off heap memory used (total): 520044
SSTable Compression Ratio: 0.5405234880604174
Number of keys (estimate): 14317
Memtable cell count: 1251073
Memtable data size: 57091879
Memtable off heap memory used: 0
Memtable switch count: 2
Local read count: 211506
Local read latency: 0.923 ms
Local write count: 18096351
Local write latency: 0.028 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 89280
Bloom filter off heap memory used: 89200
Index summary off heap memory used: 38420
Compression metadata off heap memory used: 392424
Compacted partition minimum bytes: 5723
Compacted partition maximum bytes: 2816159
Compacted partition mean bytes: 47670
Average live cells per slice (last five minutes): 2.7963433445814063
Maximum live cells per slice (last five minutes): 3
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
>nodetool cfstats
Average live cells per slice.
This is a measure of the amount of
data you are pulling back for the
average query (SELECT).
Pulling 10’s or 100’s of cells (values) is
fine, in fact pulling back 1000’s of cells
on average is fine if that’s what you
intended to do, but if it’s not what you
intended your solution to do then you
might want to look at who is doing lazy
SELECT * queries on your cluster!
Be aware that larger queries are going
to increase read latency significantly
Table: mytablename
SSTable count: 10
Space used (live): 1757632985
Space used (total): 1757632985
Space used by snapshots (total): 0
Off heap memory used (total): 520044
SSTable Compression Ratio: 0.5405234880604174
Number of keys (estimate): 14317
Memtable cell count: 1251073
Memtable data size: 57091879
Memtable off heap memory used: 0
Memtable switch count: 2
Local read count: 211506
Local read latency: 0.923 ms
Local write count: 18096351
Local write latency: 0.028 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 89280
Bloom filter off heap memory used: 89200
Index summary off heap memory used: 38420
Compression metadata off heap memory used: 392424
Compacted partition minimum bytes: 5723
Compacted partition maximum bytes: 2816159
Compacted partition mean bytes: 47670
Average live cells per slice (last five minutes): 2.7963433445814063
Maximum live cells per slice (last five minutes): 3
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
>nodetool cfstats
Maximum live cells per slice.
Self explanatory, the largest query
seen in the last 5 minutes.
Table: mytablename
SSTable count: 10
Space used (live): 1757632985
Space used (total): 1757632985
Space used by snapshots (total): 0
Off heap memory used (total): 520044
SSTable Compression Ratio: 0.5405234880604174
Number of keys (estimate): 14317
Memtable cell count: 1251073
Memtable data size: 57091879
Memtable off heap memory used: 0
Memtable switch count: 2
Local read count: 211506
Local read latency: 0.923 ms
Local write count: 18096351
Local write latency: 0.028 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 89280
Bloom filter off heap memory used: 89200
Index summary off heap memory used: 38420
Compression metadata off heap memory used: 392424
Compacted partition minimum bytes: 5723
Compacted partition maximum bytes: 2816159
Compacted partition mean bytes: 47670
Average live cells per slice (last five minutes): 2.7963433445814063
Maximum live cells per slice (last five minutes): 3
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
>nodetool cfstats
Average tombstones per slice.
Tombstones are not returned in
queries but they have to be read off
disk and filtered thru the JVM so they
can add significant relative overhead
to a query.
If you are pulling back 1x live cell and
100 tombstones in a query its going to
impact your performance.
Tombstones are the result of deletes
and deletes need to be very carefully
managed and modelled in C*.
Table: mytablename
SSTable count: 10
Space used (live): 1757632985
Space used (total): 1757632985
Space used by snapshots (total): 0
Off heap memory used (total): 520044
SSTable Compression Ratio: 0.5405234880604174
Number of keys (estimate): 14317
Memtable cell count: 1251073
Memtable data size: 57091879
Memtable off heap memory used: 0
Memtable switch count: 2
Local read count: 211506
Local read latency: 0.923 ms
Local write count: 18096351
Local write latency: 0.028 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 89280
Bloom filter off heap memory used: 89200
Index summary off heap memory used: 38420
Compression metadata off heap memory used: 392424
Compacted partition minimum bytes: 5723
Compacted partition maximum bytes: 2816159
Compacted partition mean bytes: 47670
Average live cells per slice (last five minutes): 2.7963433445814063
Maximum live cells per slice (last five minutes): 3
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
>nodetool cfstats
Maximum tombstones per slice.
Self explanatory, the largest amount of
tombstones seen in a query in the last
5 minutes.
Table: mytablename
SSTable count: 10
Space used (live): 1757632985
Space used (total): 1757632985
Space used by snapshots (total): 0
Off heap memory used (total): 520044
SSTable Compression Ratio: 0.5405234880604174
Number of keys (estimate): 14317
Memtable cell count: 1251073
Memtable data size: 57091879
Memtable off heap memory used: 0
Memtable switch count: 2
Local read count: 211506
Local read latency: 0.923 ms
Local write count: 18096351
Local write latency: 0.028 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 89280
Bloom filter off heap memory used: 89200
Index summary off heap memory used: 38420
Compression metadata off heap memory used: 392424
Compacted partition minimum bytes: 5723
Compacted partition maximum bytes: 2816159
Compacted partition mean bytes: 47670
Average live cells per slice (last five minutes): 2.7963433445814063
Maximum live cells per slice (last five minutes): 3
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
Summary so far...
That rounds out the top 3 diagnostic nodetool commands in Apache Cassandra:
● nodetool tpstats
● nodetool netstats
● nodetool cfstats
With those 3 commands you can get a very good grasp of the health of a node and possible issues, if you then see a
pattern cluster wide you know you have a general issue (usually workload), if however you only see poor health on a
single node it’s probably* time to start looking at hardware as the culprit.
*I say probably because there are circumstances where a hot partition on a single node can get hammered with requests, the times i have seen
this is where someone has accidentally turned a tool against C* that focuses on a single partition (thanks security guy).
Security guy:
>system.log
On package installs lives in:
/var/log/cassandra
What to look for:
● Exceptions
● GC events
● Other nodes going UP and DOWN in gossip
● Dropped messages
● WARNs on large partitions / wide rows
● Tombstone warnings
● Repair session failures
● Compactions with large amounts of sstables in them
● Startup problems and warnings
● Topology warnings
Monitoring Automation
JMX
Cassandra exposes its metics via
MBeans, here you see Jconsole
connected to a Cassandra node listing
all the MBeans available for
interrogation.
These JMX MBeans can be
instrumented in Java and Python
interfaces plus some commercial
products.
DataStax uses these same MBeans to
instrument OpsCenter.
JMX
Cassandra exposes its metics via
MBeans, here you see Jconsole
connected to a Cassandra node listing
all the MBeans available for
interrogation.
These JMX MBeans can be
instrumented in Java and Python
interfaces plus some commercial
products.
A list of alternatives to Jconsole is here: JMX Clients with Apache Cassandra
JMX
Invoking an MBean in Java
This is sample code for a simple
method call against an MBean with no
return value, you would need to return
data in a useful application and
present the result to a screen or store
the result for analysis.
This code was stripped from the following link for
educational and training purposes and all copyright
belongs to their respective owners:
http://stackoverflow.com/questions/16583859/execute-a
-method-with-jmx-without-jconsole
import javax.management.*;
import javax.management.remote.*;
import com.sun.messaging.AdminConnectionFactory;
import com.sun.messaging.jms.management.server.*;
public class InvokeOp
{
public static void main (String[] args){
try{
// Create administration connection factory
AdminConnectionFactory acf = new AdminConnectionFactory();
// Get JMX connector, supplying user name and password
JMXConnector jmxc = acf.createConnection("AliBaba", "sesame");
// Get MBean server connection
MBeanServerConnection mbsc = jmxc.getMBeanServerConnection();
// Create object name
ObjectName serviceConfigName = MQObjectName.createServiceConfig("jms");
// Invoke operation
mbsc.invoke(serviceConfigName, ServiceOperations.PAUSE, null, null);
// Close JMX connector
jmxc.close();
}
catch (Exception e){
System.out.println( "Exception occurred: " + e.toString() );
e.printStackTrace();
}
}
}
JMX
Invoking an MBean in jython, Python
running on the Java JVM.
This code was stripped from the following link for
educational and training purposes and all copyright
belongs to their respective owners:
https://egkatzioura.com/2014/09/22/connecting-to-jmx-t
hrough-jython/59/execute-a-method-with-jmx-without-jcon
sole
from javax.management.remote import JMXConnector
from javax.management.remote import JMXConnectorFactory
from javax.management.remote import JMXServiceURL
from javax.management import MBeanServerConnection
from javax.management import MBeanInfo
from javax.management import ObjectName
from java.lang import String
from jarray import array
import sys
if __name__=='__main__':
if len(sys.argv)> 5:
serverUrl = sys.argv[1]
username = sys.argv[2]
password = sys.argv[3]
beanName = sys.argv[4]
action = sys.argv[5]
else:
sys.exit(-1)
credentials = array([username,password],String)
environment = {JMXConnector.CREDENTIALS:credentials}
jmxServiceUrl = JMXServiceURL('service:jmx:rmi:///jndi/rmi://'+serverUrl+':9999/jmxrmi');
jmxConnector = JMXConnectorFactory.connect(jmxServiceUrl,environment);
mBeanServerConnection = jmxConnector.getMBeanServerConnection()
objectName = ObjectName(beanName);
mBeanServerConnection.invoke(objectName,action,None,None)
jmxConnector.close()
JMX
Invoking an MBean in C Python using
Jokolia, a JMX library for python
https://jolokia.org/
This approach is a little more complex
as agents need to be installed on
nodes.
There are some other Python JMX libraries but I have not
used them so cannot vouch for them.
This code was stripped from the following link for
educational and training purposes and all copyright
belongs to their respective owners:
https://jolokia.org/tutorial.html
import org.jolokia.client.*;
import org.jolokia.client.request.*;
import java.util.Map;
public class JolokiaDemo {
public static void main(String[] args) throws Exception {
J4pClient j4pClient = new J4pClient("http://localhost:8080/jolokia");
J4pReadRequest req = new J4pReadRequest("java.lang:type=Memory",
"HeapMemoryUsage");
J4pReadResponse resp = j4pClient.execute(req);
Map<String,Long> vals = resp.getValue();
long used = vals.get("used");
long max = vals.get("max");
int usage = (int) (used * 100 / max);
System.out.println("Memory usage: used: " + used +
" / max: " + max + " = " + usage + "%");
}
}
JMX + Node.js
jmx npm
https://www.npmjs.com/package/jmx
Can’t vouch for this one, but node.js is a great way to serve
javascript directly into a GUI, the meteor project is also an
excellent pub/sub/push system built on node.js that would
make a great C* Operats GUI.
https://www.meteor.com/
This code was stripped from the following link for
educational and training purposes and all copyright
belongs to their respective owners:
https://www.npmjs.com/package/jmx
var jmx = require("jmx");
client = jmx.createClient({
host: "localhost", // optional
port: 3000
});
client.connect();
client.on("connect", function() {
client.getAttribute("java.lang:type=Memory", "HeapMemoryUsage", function(data) {
var used = data.getSync('used');
console.log("HeapMemoryUsage used: " + used.longValue);
// console.log(data.toString());
});
client.setAttribute("java.lang:type=Memory", "Verbose", true, function() {
console.log("Memory verbose on"); // callback is optional
});
client.invoke("java.lang:type=Memory", "gc", [], function(data) {
console.log("gc() done");
});
});
JMX + Node.js
jokolia npm
https://www.npmjs.com/package/jolokia
Can’t vouch for this one, but node.js is a great way to serve
javascript directly into a GUI, the meteor project is also an
excellent pub/sub/push system built on node.js that would
make a great C* Ops GUI.
https://www.meteor.com/
This code was stripped from the following link for
educational and training purposes and all copyright
belongs to their respective owners:
https://www.npmjs.com/package/jolokia
// In Node.js or using Browserify
var Jolokia = require('jolokia');
// In browser
var Jolokia = window.Jolokia;
// Or using RequireJs
require(['./path/to/jolokia'], function(Jolokia) {
// code below
});
var jolokia = new Jolokia({
url: '/jmx', // use full url when in Node.js environment
method: 'post', // force specific HTTP method
});
jolokia.list().then(function(value) {
// do something with list of JMX domains
}, function(error) {
// handle error
});
Thanks!
Contact us:
DataStax
Sydney, Australia
alex.thompson@datastax.com
www.datastax.com

More Related Content

What's hot

What's hot (20)

MySQL 5.7にやられないためにおぼえておいてほしいこと
MySQL 5.7にやられないためにおぼえておいてほしいことMySQL 5.7にやられないためにおぼえておいてほしいこと
MySQL 5.7にやられないためにおぼえておいてほしいこと
 
Cassandraのしくみ データの読み書き編
Cassandraのしくみ データの読み書き編Cassandraのしくみ データの読み書き編
Cassandraのしくみ データの読み書き編
 
Autovacuum, explained for engineers, new improved version PGConf.eu 2015 Vienna
Autovacuum, explained for engineers, new improved version PGConf.eu 2015 ViennaAutovacuum, explained for engineers, new improved version PGConf.eu 2015 Vienna
Autovacuum, explained for engineers, new improved version PGConf.eu 2015 Vienna
 
Navigating Transactions: ACID Complexity in Modern Databases
Navigating Transactions: ACID Complexity in Modern DatabasesNavigating Transactions: ACID Complexity in Modern Databases
Navigating Transactions: ACID Complexity in Modern Databases
 
MySQLを割と一人で300台管理する技術
MySQLを割と一人で300台管理する技術MySQLを割と一人で300台管理する技術
MySQLを割と一人で300台管理する技術
 
Apache Kafka - Yüksek Performanslı Dağıtık Mesajlaşma Sistemi - Türkçe
Apache Kafka - Yüksek Performanslı Dağıtık Mesajlaşma Sistemi - TürkçeApache Kafka - Yüksek Performanslı Dağıtık Mesajlaşma Sistemi - Türkçe
Apache Kafka - Yüksek Performanslı Dağıtık Mesajlaşma Sistemi - Türkçe
 
Clickhouse at Cloudflare. By Marek Vavrusa
Clickhouse at Cloudflare. By Marek VavrusaClickhouse at Cloudflare. By Marek Vavrusa
Clickhouse at Cloudflare. By Marek Vavrusa
 
さいきんの InnoDB Adaptive Flushing (仮)
さいきんの InnoDB Adaptive Flushing (仮)さいきんの InnoDB Adaptive Flushing (仮)
さいきんの InnoDB Adaptive Flushing (仮)
 
リペア時間短縮にむけた取り組み@Yahoo! JAPAN #casstudy
リペア時間短縮にむけた取り組み@Yahoo! JAPAN #casstudyリペア時間短縮にむけた取り組み@Yahoo! JAPAN #casstudy
リペア時間短縮にむけた取り組み@Yahoo! JAPAN #casstudy
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
 
MySQL Performance for DevOps
MySQL Performance for DevOpsMySQL Performance for DevOps
MySQL Performance for DevOps
 
MySQL Database Architectures - InnoDB ReplicaSet & Cluster
MySQL Database Architectures - InnoDB ReplicaSet & ClusterMySQL Database Architectures - InnoDB ReplicaSet & Cluster
MySQL Database Architectures - InnoDB ReplicaSet & Cluster
 
Linux Namespace
Linux NamespaceLinux Namespace
Linux Namespace
 
C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz
C* Summit 2013: How Not to Use Cassandra by Axel LiljencrantzC* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz
C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz
 
nftables: the Next Generation Firewall in Linux
nftables: the Next Generation Firewall in Linuxnftables: the Next Generation Firewall in Linux
nftables: the Next Generation Firewall in Linux
 
Percona XtraDB Cluster vs Galera Cluster vs MySQL Group Replication
Percona XtraDB Cluster vs Galera Cluster vs MySQL Group ReplicationPercona XtraDB Cluster vs Galera Cluster vs MySQL Group Replication
Percona XtraDB Cluster vs Galera Cluster vs MySQL Group Replication
 
まずやっとくPostgreSQLチューニング
まずやっとくPostgreSQLチューニングまずやっとくPostgreSQLチューニング
まずやっとくPostgreSQLチューニング
 
地理分散DBについて
地理分散DBについて地理分散DBについて
地理分散DBについて
 
Proxysql sharding
Proxysql shardingProxysql sharding
Proxysql sharding
 
PostgreSQL Extensions: A deeper look
PostgreSQL Extensions:  A deeper lookPostgreSQL Extensions:  A deeper look
PostgreSQL Extensions: A deeper look
 

Similar to Apache Cassandra - Diagnostics and monitoring

How Many Ways Can I Manage Oracle GoldenGate?
How Many Ways Can I Manage Oracle GoldenGate?How Many Ways Can I Manage Oracle GoldenGate?
How Many Ways Can I Manage Oracle GoldenGate?
Enkitec
 
Varnish @ Velocity Ignite
Varnish @ Velocity IgniteVarnish @ Velocity Ignite
Varnish @ Velocity Ignite
Artur Bergman
 
Generic Framework for Knowledge Classification-1
Generic Framework  for Knowledge Classification-1Generic Framework  for Knowledge Classification-1
Generic Framework for Knowledge Classification-1
Venkata Vineel
 
PowerPoint Presentation
PowerPoint PresentationPowerPoint Presentation
PowerPoint Presentation
webhostingguy
 

Similar to Apache Cassandra - Diagnostics and monitoring (20)

Cassandra Day SV 2014: Basic Operations with Apache Cassandra
Cassandra Day SV 2014: Basic Operations with Apache CassandraCassandra Day SV 2014: Basic Operations with Apache Cassandra
Cassandra Day SV 2014: Basic Operations with Apache Cassandra
 
Performance tuning
Performance tuningPerformance tuning
Performance tuning
 
OSDC 2015: Georg Schönberger | Linux Performance Profiling and Monitoring
OSDC 2015: Georg Schönberger | Linux Performance Profiling and MonitoringOSDC 2015: Georg Schönberger | Linux Performance Profiling and Monitoring
OSDC 2015: Georg Schönberger | Linux Performance Profiling and Monitoring
 
Linux Performance Profiling and Monitoring
Linux Performance Profiling and MonitoringLinux Performance Profiling and Monitoring
Linux Performance Profiling and Monitoring
 
NodeJs
NodeJsNodeJs
NodeJs
 
AWR Sample Report
AWR Sample ReportAWR Sample Report
AWR Sample Report
 
ioDrive de benchmarking 2011 1209_zem_distribution
ioDrive de benchmarking 2011 1209_zem_distributionioDrive de benchmarking 2011 1209_zem_distribution
ioDrive de benchmarking 2011 1209_zem_distribution
 
How Many Ways Can I Manage Oracle GoldenGate?
How Many Ways Can I Manage Oracle GoldenGate?How Many Ways Can I Manage Oracle GoldenGate?
How Many Ways Can I Manage Oracle GoldenGate?
 
Maximizing SQL Reviews and Tuning with pt-query-digest
Maximizing SQL Reviews and Tuning with pt-query-digestMaximizing SQL Reviews and Tuning with pt-query-digest
Maximizing SQL Reviews and Tuning with pt-query-digest
 
OSDC 2017 - Werner Fischer - Linux performance profiling and monitoring
OSDC 2017 - Werner Fischer - Linux performance profiling and monitoringOSDC 2017 - Werner Fischer - Linux performance profiling and monitoring
OSDC 2017 - Werner Fischer - Linux performance profiling and monitoring
 
OSMC 2015: Linux Performance Profiling and Monitoring by Werner Fischer
OSMC 2015: Linux Performance Profiling and Monitoring by Werner FischerOSMC 2015: Linux Performance Profiling and Monitoring by Werner Fischer
OSMC 2015: Linux Performance Profiling and Monitoring by Werner Fischer
 
OSMC 2015 | Linux Performance Profiling and Monitoring by Werner Fischer
OSMC 2015 | Linux Performance Profiling and Monitoring by Werner FischerOSMC 2015 | Linux Performance Profiling and Monitoring by Werner Fischer
OSMC 2015 | Linux Performance Profiling and Monitoring by Werner Fischer
 
Beyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeBeyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the code
 
Managing PostgreSQL with PgCenter
Managing PostgreSQL with PgCenterManaging PostgreSQL with PgCenter
Managing PostgreSQL with PgCenter
 
PostgreSQL 9.6 Performance-Scalability Improvements
PostgreSQL 9.6 Performance-Scalability ImprovementsPostgreSQL 9.6 Performance-Scalability Improvements
PostgreSQL 9.6 Performance-Scalability Improvements
 
Varnish @ Velocity Ignite
Varnish @ Velocity IgniteVarnish @ Velocity Ignite
Varnish @ Velocity Ignite
 
2014 OSDC Talk: Introduction to Percona XtraDB Cluster and HAProxy
2014 OSDC Talk: Introduction to Percona XtraDB Cluster and HAProxy2014 OSDC Talk: Introduction to Percona XtraDB Cluster and HAProxy
2014 OSDC Talk: Introduction to Percona XtraDB Cluster and HAProxy
 
Generic Framework for Knowledge Classification-1
Generic Framework  for Knowledge Classification-1Generic Framework  for Knowledge Classification-1
Generic Framework for Knowledge Classification-1
 
Cassandra 2.1 boot camp, Overview
Cassandra 2.1 boot camp, OverviewCassandra 2.1 boot camp, Overview
Cassandra 2.1 boot camp, Overview
 
PowerPoint Presentation
PowerPoint PresentationPowerPoint Presentation
PowerPoint Presentation
 

More from Alex Thompson

More from Alex Thompson (6)

The Apache Cassandra ecosystem
The Apache Cassandra ecosystemThe Apache Cassandra ecosystem
The Apache Cassandra ecosystem
 
Apache Cassandra - Drivers deep dive
Apache Cassandra - Drivers deep diveApache Cassandra - Drivers deep dive
Apache Cassandra - Drivers deep dive
 
Deconstructing Apache Cassandra
Deconstructing Apache CassandraDeconstructing Apache Cassandra
Deconstructing Apache Cassandra
 
Apache Cassandra - Data modelling
Apache Cassandra - Data modellingApache Cassandra - Data modelling
Apache Cassandra - Data modelling
 
Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche Spark
 
Building Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scaleBuilding Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scale
 

Recently uploaded

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
shivangimorya083
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
shambhavirathore45
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 

Recently uploaded (20)

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 

Apache Cassandra - Diagnostics and monitoring

  • 1. Apache Cassandra: diagnostics and monitoring Alex Thompson Solution Architect - APAC DataStax Australia
  • 2. Intro This presentation is intended as a field guide for users of Apache Cassandra. This guide specifically covers an explanation diagnostics tools and monitoring tools and methods used in conjunction with C*, it is written in a pragmatic order with the most important tools first.
  • 4. >nodetool tpstats Probably the most important “at a glance” summary of the health of a node and the first diagnostics command to run. >nodetool tpstats is better described as “nodetool thread statistics”; it gives us a real-time measure of each thread in C* and its current workload. Note: if you restart a C* instance these statistics are cleared to zero, so you have to run it on a node that has been up for a while to be able to diagnose workload. Pool Name Active Pending Completed Blocked All time blocked MutationStage 0 0 25159974 0 0 ViewMutationStage 0 0 0 0 0 ReadStage 0 0 3231222 0 0 RequestResponseStage 0 0 36609517 0 0 ReadRepairStage 0 0 410293 0 0 CounterMutationStage 0 0 0 0 0 MiscStage 0 0 0 0 0 CompactionExecutor 8 108 287003 0 0 MemtableReclaimMemory 0 0 444 0 0 PendingRangeCalculator 0 0 27 0 0 GossipStage 0 0 464348 0 0 SecondaryIndexManagement 0 0 13 0 0 HintsDispatcher 0 0 396 0 0 MigrationStage 0 0 25 0 0 MemtablePostFlush 0 0 1114 0 0 ValidationExecutor 0 0 321 0 0 Sampler 0 0 0 0 0 MemtableFlushWriter 0 0 444 0 0 InternalResponseStage 0 0 68544 0 0 AntiEntropyStage 0 0 1209 0 0 CacheCleanupExecutor 0 0 0 0 0 Native-Transport-Requests 0 0 35849149 0 536 Message type Dropped READ 4 RANGE_SLICE 0 _TRACE 5095 HINT 0 MUTATION 180 COUNTER_MUTATION 0 BATCH_STORE 0 BATCH_REMOVE 0 REQUEST_RESPONSE 23 PAGED_RANGE 0 READ_REPAIR 0
  • 5. >nodetool tpstats First thing to check is Pending work on threads, this node is showing compactions getting behind, this may be OK but is usually an indication with other diagnostics of an overloaded node. Pool Name Active Pending Completed Blocked All time blocked MutationStage 0 0 25159974 0 0 ViewMutationStage 0 0 0 0 0 ReadStage 0 0 3231222 0 0 RequestResponseStage 0 0 36609517 0 0 ReadRepairStage 0 0 410293 0 0 CounterMutationStage 0 0 0 0 0 MiscStage 0 0 0 0 0 CompactionExecutor 8 108 287003 0 0 MemtableReclaimMemory 0 0 444 0 0 PendingRangeCalculator 0 0 27 0 0 GossipStage 0 0 464348 0 0 SecondaryIndexManagement 0 0 13 0 0 HintsDispatcher 0 0 396 0 0 MigrationStage 0 0 25 0 0 MemtablePostFlush 0 0 1114 0 0 ValidationExecutor 0 0 321 0 0 Sampler 0 0 0 0 0 MemtableFlushWriter 0 0 444 0 0 InternalResponseStage 0 0 68544 0 0 AntiEntropyStage 0 0 1209 0 0 CacheCleanupExecutor 0 0 0 0 0 Native-Transport-Requests 0 0 35849149 0 536 Message type Dropped READ 4 RANGE_SLICE 0 _TRACE 5095 HINT 0 MUTATION 180 COUNTER_MUTATION 0 BATCH_STORE 0 BATCH_REMOVE 0 REQUEST_RESPONSE 23 PAGED_RANGE 0 READ_REPAIR 0
  • 6. >nodetool tpstats Next up is to check All time blocked: in this case Native-Transport-Requests which are calls to the binary CQL port (reads or writes) that have not been completed due to overload. Also note the high Completed This node is servicing a lot of requests. In combination with Pending mentioned in the prior slide this is starting to look like an overloaded node, but let’s dig deeper... Pool Name Active Pending Completed Blocked All time blocked MutationStage 0 0 25159974 0 0 ViewMutationStage 0 0 0 0 0 ReadStage 0 0 3231222 0 0 RequestResponseStage 0 0 36609517 0 0 ReadRepairStage 0 0 410293 0 0 CounterMutationStage 0 0 0 0 0 MiscStage 0 0 0 0 0 CompactionExecutor 8 108 287003 0 0 MemtableReclaimMemory 0 0 444 0 0 PendingRangeCalculator 0 0 27 0 0 GossipStage 0 0 464348 0 0 SecondaryIndexManagement 0 0 13 0 0 HintsDispatcher 0 0 396 0 0 MigrationStage 0 0 25 0 0 MemtablePostFlush 0 0 1114 0 0 ValidationExecutor 0 0 321 0 0 Sampler 0 0 0 0 0 MemtableFlushWriter 0 0 444 0 0 InternalResponseStage 0 0 68544 0 0 AntiEntropyStage 0 0 1209 0 0 CacheCleanupExecutor 0 0 0 0 0 Native-Transport-Requests 0 0 35849149 0 536 Message type Dropped READ 4 RANGE_SLICE 0 _TRACE 5095 HINT 0 MUTATION 180 COUNTER_MUTATION 0 BATCH_STORE 0 BATCH_REMOVE 0 REQUEST_RESPONSE 23 PAGED_RANGE 0 READ_REPAIR 0
  • 7. >nodetool tpstats OK, now the nasty part, Dropped messages. These are messages of various types that the node has received that is has not been able to process due to overload, to save itself from going down C* has gone into “emergency mode” and shed the messages, we should never see any dropped messages. Period. Lets go thru these messages one by one…. Pool Name Active Pending Completed Blocked All time blocked MutationStage 0 0 25159974 0 0 ViewMutationStage 0 0 0 0 0 ReadStage 0 0 3231222 0 0 RequestResponseStage 0 0 36609517 0 0 ReadRepairStage 0 0 410293 0 0 CounterMutationStage 0 0 0 0 0 MiscStage 0 0 0 0 0 CompactionExecutor 8 108 287003 0 0 MemtableReclaimMemory 0 0 444 0 0 PendingRangeCalculator 0 0 27 0 0 GossipStage 0 0 464348 0 0 SecondaryIndexManagement 0 0 13 0 0 HintsDispatcher 0 0 396 0 0 MigrationStage 0 0 25 0 0 MemtablePostFlush 0 0 1114 0 0 ValidationExecutor 0 0 321 0 0 Sampler 0 0 0 0 0 MemtableFlushWriter 0 0 444 0 0 InternalResponseStage 0 0 68544 0 0 AntiEntropyStage 0 0 1209 0 0 CacheCleanupExecutor 0 0 0 0 0 Native-Transport-Requests 0 0 35849149 0 536 Message type Dropped READ 4 RANGE_SLICE 0 _TRACE 5095 HINT 0 MUTATION 180 COUNTER_MUTATION 0 BATCH_STORE 0 BATCH_REMOVE 0 REQUEST_RESPONSE 23 PAGED_RANGE 0 READ_REPAIR 0
  • 8. >nodetool tpstats So that’s 4x READ messages that were dropped, they were CQL SELECT statements that C* could not process due to overload of this node Other nodes with replicas would have stepped in to satisfy the query*. *As long as the driver was correctly configured and the correct consistency level was applied to the CQL SELECT statement. Pool Name Active Pending Completed Blocked All time blocked MutationStage 0 0 25159974 0 0 ViewMutationStage 0 0 0 0 0 ReadStage 0 0 3231222 0 0 RequestResponseStage 0 0 36609517 0 0 ReadRepairStage 0 0 410293 0 0 CounterMutationStage 0 0 0 0 0 MiscStage 0 0 0 0 0 CompactionExecutor 8 108 287003 0 0 MemtableReclaimMemory 0 0 444 0 0 PendingRangeCalculator 0 0 27 0 0 GossipStage 0 0 464348 0 0 SecondaryIndexManagement 0 0 13 0 0 HintsDispatcher 0 0 396 0 0 MigrationStage 0 0 25 0 0 MemtablePostFlush 0 0 1114 0 0 ValidationExecutor 0 0 321 0 0 Sampler 0 0 0 0 0 MemtableFlushWriter 0 0 444 0 0 InternalResponseStage 0 0 68544 0 0 AntiEntropyStage 0 0 1209 0 0 CacheCleanupExecutor 0 0 0 0 0 Native-Transport-Requests 0 0 35849149 0 536 Message type Dropped READ 4 RANGE_SLICE 0 _TRACE 5095 HINT 0 MUTATION 180 COUNTER_MUTATION 0 BATCH_STORE 0 BATCH_REMOVE 0 REQUEST_RESPONSE 23 PAGED_RANGE 0 READ_REPAIR 0
  • 9. >nodetool tpstats 5095x TRACE messages have been dropped. This is a problem. Someone has either: 1) turned TRACE on on the server using: >nodetool settraceprobablity 1 2) more worryingly has checked in CQL code in at the application tier with TRACE ON. TRACE puts an enormous weight on a node and should never be on in production! Pool Name Active Pending Completed Blocked All time blocked MutationStage 0 0 25159974 0 0 ViewMutationStage 0 0 0 0 0 ReadStage 0 0 3231222 0 0 RequestResponseStage 0 0 36609517 0 0 ReadRepairStage 0 0 410293 0 0 CounterMutationStage 0 0 0 0 0 MiscStage 0 0 0 0 0 CompactionExecutor 8 108 287003 0 0 MemtableReclaimMemory 0 0 444 0 0 PendingRangeCalculator 0 0 27 0 0 GossipStage 0 0 464348 0 0 SecondaryIndexManagement 0 0 13 0 0 HintsDispatcher 0 0 396 0 0 MigrationStage 0 0 25 0 0 MemtablePostFlush 0 0 1114 0 0 ValidationExecutor 0 0 321 0 0 Sampler 0 0 0 0 0 MemtableFlushWriter 0 0 444 0 0 InternalResponseStage 0 0 68544 0 0 AntiEntropyStage 0 0 1209 0 0 CacheCleanupExecutor 0 0 0 0 0 Native-Transport-Requests 0 0 35849149 0 536 Message type Dropped READ 4 RANGE_SLICE 0 _TRACE 5095 HINT 0 MUTATION 180 COUNTER_MUTATION 0 BATCH_STORE 0 BATCH_REMOVE 0 REQUEST_RESPONSE 23 PAGED_RANGE 0 READ_REPAIR 0
  • 10. >nodetool tpstats With TRACE on on this node, all bets are off, this could be the sole cause of this node’s problems, TRACE is such a heavy hitting process that it can retard a node if activated on a production node or retard an entire cluster if activated on all nodes. To turn it off run on all nodes: >nodetool settraceprobability 0 If it’s in checked in CQL code you need to audit all app tier code to identify the offending statement/s. Pool Name Active Pending Completed Blocked All time blocked MutationStage 0 0 25159974 0 0 ViewMutationStage 0 0 0 0 0 ReadStage 0 0 3231222 0 0 RequestResponseStage 0 0 36609517 0 0 ReadRepairStage 0 0 410293 0 0 CounterMutationStage 0 0 0 0 0 MiscStage 0 0 0 0 0 CompactionExecutor 8 108 287003 0 0 MemtableReclaimMemory 0 0 444 0 0 PendingRangeCalculator 0 0 27 0 0 GossipStage 0 0 464348 0 0 SecondaryIndexManagement 0 0 13 0 0 HintsDispatcher 0 0 396 0 0 MigrationStage 0 0 25 0 0 MemtablePostFlush 0 0 1114 0 0 ValidationExecutor 0 0 321 0 0 Sampler 0 0 0 0 0 MemtableFlushWriter 0 0 444 0 0 InternalResponseStage 0 0 68544 0 0 AntiEntropyStage 0 0 1209 0 0 CacheCleanupExecutor 0 0 0 0 0 Native-Transport-Requests 0 0 35849149 0 536 Message type Dropped READ 4 RANGE_SLICE 0 _TRACE 5095 HINT 0 MUTATION 180 COUNTER_MUTATION 0 BATCH_STORE 0 BATCH_REMOVE 0 REQUEST_RESPONSE 23 PAGED_RANGE 0 READ_REPAIR 0
  • 11. >nodetool tpstats TRACE on on a production node earns my dill award.
  • 12. >nodetool tpstats 180x MUTATION message drops, MUTATIONS are writes, the server has not had the headroom to perform these writes. REQUEST_RESPONSE drops are self explanatory. Pool Name Active Pending Completed Blocked All time blocked MutationStage 0 0 25159974 0 0 ViewMutationStage 0 0 0 0 0 ReadStage 0 0 3231222 0 0 RequestResponseStage 0 0 36609517 0 0 ReadRepairStage 0 0 410293 0 0 CounterMutationStage 0 0 0 0 0 MiscStage 0 0 0 0 0 CompactionExecutor 8 108 287003 0 0 MemtableReclaimMemory 0 0 444 0 0 PendingRangeCalculator 0 0 27 0 0 GossipStage 0 0 464348 0 0 SecondaryIndexManagement 0 0 13 0 0 HintsDispatcher 0 0 396 0 0 MigrationStage 0 0 25 0 0 MemtablePostFlush 0 0 1114 0 0 ValidationExecutor 0 0 321 0 0 Sampler 0 0 0 0 0 MemtableFlushWriter 0 0 444 0 0 InternalResponseStage 0 0 68544 0 0 AntiEntropyStage 0 0 1209 0 0 CacheCleanupExecutor 0 0 0 0 0 Native-Transport-Requests 0 0 35849149 0 536 Message type Dropped READ 4 RANGE_SLICE 0 _TRACE 5095 HINT 0 MUTATION 180 COUNTER_MUTATION 0 BATCH_STORE 0 BATCH_REMOVE 0 REQUEST_RESPONSE 23 PAGED_RANGE 0 READ_REPAIR 0
  • 13. >nodetool tpstats What to look for: On a typical node you should not really see Thread Pools going into Pending state. Under 10 in Pending for CompactionExecutor can be OK, but when you get into larger numbers it usually indicates a problem. As for dropped messages you should not see any, it means there is a real issue in peak workloads that needs to be addressed. Pool Name Active Pending Completed Blocked All time blocked MutationStage 0 0 25159974 0 0 ViewMutationStage 0 0 0 0 0 ReadStage 0 0 3231222 0 0 RequestResponseStage 0 0 36609517 0 0 ReadRepairStage 0 0 410293 0 0 CounterMutationStage 0 0 0 0 0 MiscStage 0 0 0 0 0 CompactionExecutor 8 108 287003 0 0 MemtableReclaimMemory 0 0 444 0 0 PendingRangeCalculator 0 0 27 0 0 GossipStage 0 0 464348 0 0 SecondaryIndexManagement 0 0 13 0 0 HintsDispatcher 0 0 396 0 0 MigrationStage 0 0 25 0 0 MemtablePostFlush 0 0 1114 0 0 ValidationExecutor 0 0 321 0 0 Sampler 0 0 0 0 0 MemtableFlushWriter 0 0 444 0 0 InternalResponseStage 0 0 68544 0 0 AntiEntropyStage 0 0 1209 0 0 CacheCleanupExecutor 0 0 0 0 0 Native-Transport-Requests 0 0 35849149 0 536 Message type Dropped READ 4 RANGE_SLICE 0 _TRACE 5095 HINT 0 MUTATION 180 COUNTER_MUTATION 0 BATCH_STORE 0 BATCH_REMOVE 0 REQUEST_RESPONSE 23 PAGED_RANGE 0 READ_REPAIR 0
  • 14. >nodetool netstats Aside from >nodetool tpstats, >nodetool netstats is your second go-to diagnostic that gives a good view on how healthy a node is. The first thing to check is “Read Repair Statistics”, these indicate inconsistencies in data found on this node when compared to other nodes when a query executes, they usually indicate again that the node or cluster is under stress and may not be properly provisioned for the workload it is expected to do. Mode: NORMAL Not sending any streams. Read Repair Statistics: Attempted: 408271 Mismatch (Blocking): 78 Mismatch (Background): 602 Pool Name Active Pending Completed Dropped Large messages n/a 0 12252 913 Small messages n/a 0 63614651 0 Gossip messages n/a 0 480331 0
  • 15. >nodetool netstats The specific counts we are interested in are the Mismatch values. You can see here that compared to the number of read repairs attempted (408271) we have some minor repairs occurring: 78/602 These are minor numbers but do indicate at times that this node is under stress. Mode: NORMAL Not sending any streams. Read Repair Statistics: Attempted: 408271 Mismatch (Blocking): 78 Mismatch (Background): 602 Pool Name Active Pending Completed Dropped Large messages n/a 0 12252 913 Small messages n/a 0 63614651 0 Gossip messages n/a 0 480331 0
  • 16. >nodetool netstats This is more worrying though and quite unusual. The amount of dropped large messages indicates to me that someone is doing something silly here and either attempting to perform overly large writes or query for overly large SELECTs. As soon as I saw this, I would start asking questions as to where these messages are coming from and put a stop to the misuse. Mode: NORMAL Not sending any streams. Read Repair Statistics: Attempted: 408271 Mismatch (Blocking): 78 Mismatch (Background): 602 Pool Name Active Pending Completed Dropped Large messages n/a 0 12252 913 Small messages n/a 0 63614651 0 Gossip messages n/a 0 480331 0
  • 17. >nodetool netstats What to look for: Large Mismatch values indicate a node that in the past has been under severe stress and incapable of keeping up with write workloads. Dropped Large Messages probably means that someone is performing ridiculous queries or writes against your system, find them and terminate them with extreme prejudice. Mode: NORMAL Not sending any streams. Read Repair Statistics: Attempted: 408271 Mismatch (Blocking): 78 Mismatch (Background): 602 Pool Name Active Pending Completed Dropped Large messages n/a 0 12252 913 Small messages n/a 0 63614651 0 Gossip messages n/a 0 480331 0
  • 18. >nodetool cfstats Rounding out the top 3 diagnostics commands is >nodetool cfstats, or more verbosely: nodetool columnfamily statistics. It’s a large file detailing statistics on each table in your cluster, for brevity's sake let’s take a look at one table’s output from cfstats…. Table: rollups60 SSTable count: 10 Space used (live): 1757632985 Space used (total): 1757632985 Space used by snapshots (total): 0 Off heap memory used (total): 520044 SSTable Compression Ratio: 0.5405234880604174 Number of keys (estimate): 14317 Memtable cell count: 1251073 Memtable data size: 57091879 Memtable off heap memory used: 0 Memtable switch count: 2 Local read count: 211506 Local read latency: 0.923 ms Local write count: 18096351 Local write latency: 0.028 ms Pending flushes: 0 Bloom filter false positives: 0 Bloom filter false ratio: 0.00000 Bloom filter space used: 89280 Bloom filter off heap memory used: 89200 Index summary off heap memory used: 38420 Compression metadata off heap memory used: 392424 Compacted partition minimum bytes: 5723 Compacted partition maximum bytes: 2816159 Compacted partition mean bytes: 47670 Average live cells per slice (last five minutes): 2.7963433445814063 Maximum live cells per slice (last five minutes): 3 Average tombstones per slice (last five minutes): 1.0 Maximum tombstones per slice (last five minutes): 1
  • 19. >nodetool cfstats Rounding out the top 3 diagnostics commands is >nodetool cfstats, or more verbosely: nodetool columnfamily statistics. It’s a large file detailing statistics on each table in your cluster, for brevity's sake let’s take a look at one table’s output from cfstats. There is a lot of useful information here, but at a glance there are a couple of key metrics... Table: mytablename SSTable count: 10 Space used (live): 1757632985 Space used (total): 1757632985 Space used by snapshots (total): 0 Off heap memory used (total): 520044 SSTable Compression Ratio: 0.5405234880604174 Number of keys (estimate): 14317 Memtable cell count: 1251073 Memtable data size: 57091879 Memtable off heap memory used: 0 Memtable switch count: 2 Local read count: 211506 Local read latency: 0.923 ms Local write count: 18096351 Local write latency: 0.028 ms Pending flushes: 0 Bloom filter false positives: 0 Bloom filter false ratio: 0.00000 Bloom filter space used: 89280 Bloom filter off heap memory used: 89200 Index summary off heap memory used: 38420 Compression metadata off heap memory used: 392424 Compacted partition minimum bytes: 5723 Compacted partition maximum bytes: 2816159 Compacted partition mean bytes: 47670 Average live cells per slice (last five minutes): 2.7963433445814063 Maximum live cells per slice (last five minutes): 3 Average tombstones per slice (last five minutes): 1.0 Maximum tombstones per slice (last five minutes): 1
  • 20. >nodetool cfstats SStable count. The amount of sstables that make up this table on this node, this should be in the 10’s to possibly 100’s, if you see it higher than that it usually means there are problems with compaction on the node, problems with compaction are usually caused by too many writes for the underlying I/O capability of the node. Table: mytablename SSTable count: 10 Space used (live): 1757632985 Space used (total): 1757632985 Space used by snapshots (total): 0 Off heap memory used (total): 520044 SSTable Compression Ratio: 0.5405234880604174 Number of keys (estimate): 14317 Memtable cell count: 1251073 Memtable data size: 57091879 Memtable off heap memory used: 0 Memtable switch count: 2 Local read count: 211506 Local read latency: 0.923 ms Local write count: 18096351 Local write latency: 0.028 ms Pending flushes: 0 Bloom filter false positives: 0 Bloom filter false ratio: 0.00000 Bloom filter space used: 89280 Bloom filter off heap memory used: 89200 Index summary off heap memory used: 38420 Compression metadata off heap memory used: 392424 Compacted partition minimum bytes: 5723 Compacted partition maximum bytes: 2816159 Compacted partition mean bytes: 47670 Average live cells per slice (last five minutes): 2.7963433445814063 Maximum live cells per slice (last five minutes): 3 Average tombstones per slice (last five minutes): 1.0 Maximum tombstones per slice (last five minutes): 1
  • 21. >nodetool cfstats Number of keys (estimate). This is the number of partition keys for this table on this node, if this node table has large amounts of data on this node and the key count is very low it usually means there may be a data modelling issue...more on this later. Table: mytablename SSTable count: 10 Space used (live): 1757632985 Space used (total): 1757632985 Space used by snapshots (total): 0 Off heap memory used (total): 520044 SSTable Compression Ratio: 0.5405234880604174 Number of keys (estimate): 14317 Memtable cell count: 1251073 Memtable data size: 57091879 Memtable off heap memory used: 0 Memtable switch count: 2 Local read count: 211506 Local read latency: 0.923 ms Local write count: 18096351 Local write latency: 0.028 ms Pending flushes: 0 Bloom filter false positives: 0 Bloom filter false ratio: 0.00000 Bloom filter space used: 89280 Bloom filter off heap memory used: 89200 Index summary off heap memory used: 38420 Compression metadata off heap memory used: 392424 Compacted partition minimum bytes: 5723 Compacted partition maximum bytes: 2816159 Compacted partition mean bytes: 47670 Average live cells per slice (last five minutes): 2.7963433445814063 Maximum live cells per slice (last five minutes): 3 Average tombstones per slice (last five minutes): 1.0 Maximum tombstones per slice (last five minutes): 1
  • 22. >nodetool cfstats Local read count, Local write count. Interesting on their own, but more interesting when viewed together, you can see there are a lot more writes than reads on this cluster, that is the workload is very heavily write oriented. In fact running a calculation there are 85 writes for every read! One caveat here is that we do not know 1) how long the node has been up and 2) whether their traffic peaks during the day, so we may have missed read traffic which would alter the ratio. Table: mytablename SSTable count: 10 Space used (live): 1757632985 Space used (total): 1757632985 Space used by snapshots (total): 0 Off heap memory used (total): 520044 SSTable Compression Ratio: 0.5405234880604174 Number of keys (estimate): 14317 Memtable cell count: 1251073 Memtable data size: 57091879 Memtable off heap memory used: 0 Memtable switch count: 2 Local read count: 211506 Local read latency: 0.923 ms Local write count: 18096351 Local write latency: 0.028 ms Pending flushes: 0 Bloom filter false positives: 0 Bloom filter false ratio: 0.00000 Bloom filter space used: 89280 Bloom filter off heap memory used: 89200 Index summary off heap memory used: 38420 Compression metadata off heap memory used: 392424 Compacted partition minimum bytes: 5723 Compacted partition maximum bytes: 2816159 Compacted partition mean bytes: 47670 Average live cells per slice (last five minutes): 2.7963433445814063 Maximum live cells per slice (last five minutes): 3 Average tombstones per slice (last five minutes): 1.0 Maximum tombstones per slice (last five minutes): 1
  • 23. >nodetool cfstats Local read latency, Local write latency. You can see that their latencies are quite good, writes are faster than reads in C* which is what we would expect and with reads under 1ms this is a good result. If you start to see large read latencies you need to investigate if there are large queries running or potential I/O issues on the node at hardware level. Table: mytablename SSTable count: 10 Space used (live): 1757632985 Space used (total): 1757632985 Space used by snapshots (total): 0 Off heap memory used (total): 520044 SSTable Compression Ratio: 0.5405234880604174 Number of keys (estimate): 14317 Memtable cell count: 1251073 Memtable data size: 57091879 Memtable off heap memory used: 0 Memtable switch count: 2 Local read count: 211506 Local read latency: 0.923 ms Local write count: 18096351 Local write latency: 0.028 ms Pending flushes: 0 Bloom filter false positives: 0 Bloom filter false ratio: 0.00000 Bloom filter space used: 89280 Bloom filter off heap memory used: 89200 Index summary off heap memory used: 38420 Compression metadata off heap memory used: 392424 Compacted partition minimum bytes: 5723 Compacted partition maximum bytes: 2816159 Compacted partition mean bytes: 47670 Average live cells per slice (last five minutes): 2.7963433445814063 Maximum live cells per slice (last five minutes): 3 Average tombstones per slice (last five minutes): 1.0 Maximum tombstones per slice (last five minutes): 1
  • 24. >nodetool cfstats Compacted partition maximum bytes. This is the amount of data under an individual partition key on on this node, in this case the largest found is 2.8mb which is good. You really want to keep this number under 100mb, some say 1gb but you would really need to know what you’re doing if you go to 1gb. If you see large values under here that are over a couple of 100mb then you may have a data modelling issue. Table: mytablename SSTable count: 10 Space used (live): 1757632985 Space used (total): 1757632985 Space used by snapshots (total): 0 Off heap memory used (total): 520044 SSTable Compression Ratio: 0.5405234880604174 Number of keys (estimate): 14317 Memtable cell count: 1251073 Memtable data size: 57091879 Memtable off heap memory used: 0 Memtable switch count: 2 Local read count: 211506 Local read latency: 0.923 ms Local write count: 18096351 Local write latency: 0.028 ms Pending flushes: 0 Bloom filter false positives: 0 Bloom filter false ratio: 0.00000 Bloom filter space used: 89280 Bloom filter off heap memory used: 89200 Index summary off heap memory used: 38420 Compression metadata off heap memory used: 392424 Compacted partition minimum bytes: 5723 Compacted partition maximum bytes: 2816159 Compacted partition mean bytes: 47670 Average live cells per slice (last five minutes): 2.7963433445814063 Maximum live cells per slice (last five minutes): 3 Average tombstones per slice (last five minutes): 1.0 Maximum tombstones per slice (last five minutes): 1
  • 25. >nodetool cfstats Compacted partition mean bytes. This is the average amount of data under all partition keys on on this node. You really want to keep this number under 100mb. If you see large values under here you know you have a data modelling issue. Table: mytablename SSTable count: 10 Space used (live): 1757632985 Space used (total): 1757632985 Space used by snapshots (total): 0 Off heap memory used (total): 520044 SSTable Compression Ratio: 0.5405234880604174 Number of keys (estimate): 14317 Memtable cell count: 1251073 Memtable data size: 57091879 Memtable off heap memory used: 0 Memtable switch count: 2 Local read count: 211506 Local read latency: 0.923 ms Local write count: 18096351 Local write latency: 0.028 ms Pending flushes: 0 Bloom filter false positives: 0 Bloom filter false ratio: 0.00000 Bloom filter space used: 89280 Bloom filter off heap memory used: 89200 Index summary off heap memory used: 38420 Compression metadata off heap memory used: 392424 Compacted partition minimum bytes: 5723 Compacted partition maximum bytes: 2816159 Compacted partition mean bytes: 47670 Average live cells per slice (last five minutes): 2.7963433445814063 Maximum live cells per slice (last five minutes): 3 Average tombstones per slice (last five minutes): 1.0 Maximum tombstones per slice (last five minutes): 1
  • 26. >nodetool cfstats Average live cells per slice. This is a measure of the amount of data you are pulling back for the average query (SELECT). Pulling 10’s or 100’s of cells (values) is fine, in fact pulling back 1000’s of cells on average is fine if that’s what you intended to do, but if it’s not what you intended your solution to do then you might want to look at who is doing lazy SELECT * queries on your cluster! Be aware that larger queries are going to increase read latency significantly Table: mytablename SSTable count: 10 Space used (live): 1757632985 Space used (total): 1757632985 Space used by snapshots (total): 0 Off heap memory used (total): 520044 SSTable Compression Ratio: 0.5405234880604174 Number of keys (estimate): 14317 Memtable cell count: 1251073 Memtable data size: 57091879 Memtable off heap memory used: 0 Memtable switch count: 2 Local read count: 211506 Local read latency: 0.923 ms Local write count: 18096351 Local write latency: 0.028 ms Pending flushes: 0 Bloom filter false positives: 0 Bloom filter false ratio: 0.00000 Bloom filter space used: 89280 Bloom filter off heap memory used: 89200 Index summary off heap memory used: 38420 Compression metadata off heap memory used: 392424 Compacted partition minimum bytes: 5723 Compacted partition maximum bytes: 2816159 Compacted partition mean bytes: 47670 Average live cells per slice (last five minutes): 2.7963433445814063 Maximum live cells per slice (last five minutes): 3 Average tombstones per slice (last five minutes): 1.0 Maximum tombstones per slice (last five minutes): 1
  • 27. >nodetool cfstats Maximum live cells per slice. Self explanatory, the largest query seen in the last 5 minutes. Table: mytablename SSTable count: 10 Space used (live): 1757632985 Space used (total): 1757632985 Space used by snapshots (total): 0 Off heap memory used (total): 520044 SSTable Compression Ratio: 0.5405234880604174 Number of keys (estimate): 14317 Memtable cell count: 1251073 Memtable data size: 57091879 Memtable off heap memory used: 0 Memtable switch count: 2 Local read count: 211506 Local read latency: 0.923 ms Local write count: 18096351 Local write latency: 0.028 ms Pending flushes: 0 Bloom filter false positives: 0 Bloom filter false ratio: 0.00000 Bloom filter space used: 89280 Bloom filter off heap memory used: 89200 Index summary off heap memory used: 38420 Compression metadata off heap memory used: 392424 Compacted partition minimum bytes: 5723 Compacted partition maximum bytes: 2816159 Compacted partition mean bytes: 47670 Average live cells per slice (last five minutes): 2.7963433445814063 Maximum live cells per slice (last five minutes): 3 Average tombstones per slice (last five minutes): 1.0 Maximum tombstones per slice (last five minutes): 1
  • 28. >nodetool cfstats Average tombstones per slice. Tombstones are not returned in queries but they have to be read off disk and filtered thru the JVM so they can add significant relative overhead to a query. If you are pulling back 1x live cell and 100 tombstones in a query its going to impact your performance. Tombstones are the result of deletes and deletes need to be very carefully managed and modelled in C*. Table: mytablename SSTable count: 10 Space used (live): 1757632985 Space used (total): 1757632985 Space used by snapshots (total): 0 Off heap memory used (total): 520044 SSTable Compression Ratio: 0.5405234880604174 Number of keys (estimate): 14317 Memtable cell count: 1251073 Memtable data size: 57091879 Memtable off heap memory used: 0 Memtable switch count: 2 Local read count: 211506 Local read latency: 0.923 ms Local write count: 18096351 Local write latency: 0.028 ms Pending flushes: 0 Bloom filter false positives: 0 Bloom filter false ratio: 0.00000 Bloom filter space used: 89280 Bloom filter off heap memory used: 89200 Index summary off heap memory used: 38420 Compression metadata off heap memory used: 392424 Compacted partition minimum bytes: 5723 Compacted partition maximum bytes: 2816159 Compacted partition mean bytes: 47670 Average live cells per slice (last five minutes): 2.7963433445814063 Maximum live cells per slice (last five minutes): 3 Average tombstones per slice (last five minutes): 1.0 Maximum tombstones per slice (last five minutes): 1
  • 29. >nodetool cfstats Maximum tombstones per slice. Self explanatory, the largest amount of tombstones seen in a query in the last 5 minutes. Table: mytablename SSTable count: 10 Space used (live): 1757632985 Space used (total): 1757632985 Space used by snapshots (total): 0 Off heap memory used (total): 520044 SSTable Compression Ratio: 0.5405234880604174 Number of keys (estimate): 14317 Memtable cell count: 1251073 Memtable data size: 57091879 Memtable off heap memory used: 0 Memtable switch count: 2 Local read count: 211506 Local read latency: 0.923 ms Local write count: 18096351 Local write latency: 0.028 ms Pending flushes: 0 Bloom filter false positives: 0 Bloom filter false ratio: 0.00000 Bloom filter space used: 89280 Bloom filter off heap memory used: 89200 Index summary off heap memory used: 38420 Compression metadata off heap memory used: 392424 Compacted partition minimum bytes: 5723 Compacted partition maximum bytes: 2816159 Compacted partition mean bytes: 47670 Average live cells per slice (last five minutes): 2.7963433445814063 Maximum live cells per slice (last five minutes): 3 Average tombstones per slice (last five minutes): 1.0 Maximum tombstones per slice (last five minutes): 1
  • 30. Summary so far... That rounds out the top 3 diagnostic nodetool commands in Apache Cassandra: ● nodetool tpstats ● nodetool netstats ● nodetool cfstats With those 3 commands you can get a very good grasp of the health of a node and possible issues, if you then see a pattern cluster wide you know you have a general issue (usually workload), if however you only see poor health on a single node it’s probably* time to start looking at hardware as the culprit. *I say probably because there are circumstances where a hot partition on a single node can get hammered with requests, the times i have seen this is where someone has accidentally turned a tool against C* that focuses on a single partition (thanks security guy).
  • 32. >system.log On package installs lives in: /var/log/cassandra What to look for: ● Exceptions ● GC events ● Other nodes going UP and DOWN in gossip ● Dropped messages ● WARNs on large partitions / wide rows ● Tombstone warnings ● Repair session failures ● Compactions with large amounts of sstables in them ● Startup problems and warnings ● Topology warnings
  • 34. JMX Cassandra exposes its metics via MBeans, here you see Jconsole connected to a Cassandra node listing all the MBeans available for interrogation. These JMX MBeans can be instrumented in Java and Python interfaces plus some commercial products. DataStax uses these same MBeans to instrument OpsCenter.
  • 35. JMX Cassandra exposes its metics via MBeans, here you see Jconsole connected to a Cassandra node listing all the MBeans available for interrogation. These JMX MBeans can be instrumented in Java and Python interfaces plus some commercial products. A list of alternatives to Jconsole is here: JMX Clients with Apache Cassandra
  • 36. JMX Invoking an MBean in Java This is sample code for a simple method call against an MBean with no return value, you would need to return data in a useful application and present the result to a screen or store the result for analysis. This code was stripped from the following link for educational and training purposes and all copyright belongs to their respective owners: http://stackoverflow.com/questions/16583859/execute-a -method-with-jmx-without-jconsole import javax.management.*; import javax.management.remote.*; import com.sun.messaging.AdminConnectionFactory; import com.sun.messaging.jms.management.server.*; public class InvokeOp { public static void main (String[] args){ try{ // Create administration connection factory AdminConnectionFactory acf = new AdminConnectionFactory(); // Get JMX connector, supplying user name and password JMXConnector jmxc = acf.createConnection("AliBaba", "sesame"); // Get MBean server connection MBeanServerConnection mbsc = jmxc.getMBeanServerConnection(); // Create object name ObjectName serviceConfigName = MQObjectName.createServiceConfig("jms"); // Invoke operation mbsc.invoke(serviceConfigName, ServiceOperations.PAUSE, null, null); // Close JMX connector jmxc.close(); } catch (Exception e){ System.out.println( "Exception occurred: " + e.toString() ); e.printStackTrace(); } } }
  • 37. JMX Invoking an MBean in jython, Python running on the Java JVM. This code was stripped from the following link for educational and training purposes and all copyright belongs to their respective owners: https://egkatzioura.com/2014/09/22/connecting-to-jmx-t hrough-jython/59/execute-a-method-with-jmx-without-jcon sole from javax.management.remote import JMXConnector from javax.management.remote import JMXConnectorFactory from javax.management.remote import JMXServiceURL from javax.management import MBeanServerConnection from javax.management import MBeanInfo from javax.management import ObjectName from java.lang import String from jarray import array import sys if __name__=='__main__': if len(sys.argv)> 5: serverUrl = sys.argv[1] username = sys.argv[2] password = sys.argv[3] beanName = sys.argv[4] action = sys.argv[5] else: sys.exit(-1) credentials = array([username,password],String) environment = {JMXConnector.CREDENTIALS:credentials} jmxServiceUrl = JMXServiceURL('service:jmx:rmi:///jndi/rmi://'+serverUrl+':9999/jmxrmi'); jmxConnector = JMXConnectorFactory.connect(jmxServiceUrl,environment); mBeanServerConnection = jmxConnector.getMBeanServerConnection() objectName = ObjectName(beanName); mBeanServerConnection.invoke(objectName,action,None,None) jmxConnector.close()
  • 38. JMX Invoking an MBean in C Python using Jokolia, a JMX library for python https://jolokia.org/ This approach is a little more complex as agents need to be installed on nodes. There are some other Python JMX libraries but I have not used them so cannot vouch for them. This code was stripped from the following link for educational and training purposes and all copyright belongs to their respective owners: https://jolokia.org/tutorial.html import org.jolokia.client.*; import org.jolokia.client.request.*; import java.util.Map; public class JolokiaDemo { public static void main(String[] args) throws Exception { J4pClient j4pClient = new J4pClient("http://localhost:8080/jolokia"); J4pReadRequest req = new J4pReadRequest("java.lang:type=Memory", "HeapMemoryUsage"); J4pReadResponse resp = j4pClient.execute(req); Map<String,Long> vals = resp.getValue(); long used = vals.get("used"); long max = vals.get("max"); int usage = (int) (used * 100 / max); System.out.println("Memory usage: used: " + used + " / max: " + max + " = " + usage + "%"); } }
  • 39. JMX + Node.js jmx npm https://www.npmjs.com/package/jmx Can’t vouch for this one, but node.js is a great way to serve javascript directly into a GUI, the meteor project is also an excellent pub/sub/push system built on node.js that would make a great C* Operats GUI. https://www.meteor.com/ This code was stripped from the following link for educational and training purposes and all copyright belongs to their respective owners: https://www.npmjs.com/package/jmx var jmx = require("jmx"); client = jmx.createClient({ host: "localhost", // optional port: 3000 }); client.connect(); client.on("connect", function() { client.getAttribute("java.lang:type=Memory", "HeapMemoryUsage", function(data) { var used = data.getSync('used'); console.log("HeapMemoryUsage used: " + used.longValue); // console.log(data.toString()); }); client.setAttribute("java.lang:type=Memory", "Verbose", true, function() { console.log("Memory verbose on"); // callback is optional }); client.invoke("java.lang:type=Memory", "gc", [], function(data) { console.log("gc() done"); }); });
  • 40. JMX + Node.js jokolia npm https://www.npmjs.com/package/jolokia Can’t vouch for this one, but node.js is a great way to serve javascript directly into a GUI, the meteor project is also an excellent pub/sub/push system built on node.js that would make a great C* Ops GUI. https://www.meteor.com/ This code was stripped from the following link for educational and training purposes and all copyright belongs to their respective owners: https://www.npmjs.com/package/jolokia // In Node.js or using Browserify var Jolokia = require('jolokia'); // In browser var Jolokia = window.Jolokia; // Or using RequireJs require(['./path/to/jolokia'], function(Jolokia) { // code below }); var jolokia = new Jolokia({ url: '/jmx', // use full url when in Node.js environment method: 'post', // force specific HTTP method }); jolokia.list().then(function(value) { // do something with list of JMX domains }, function(error) { // handle error });