Java tuning for Knewton’s C* clusters
Lessons learned
Carlos Monroy
Knewton
Knewton
Leader in adaptive learning
- Partners with publishers and institutions in Europe, US, and Asia
- Provides unique recommendations to students based on
previous behavior.
- Advanced content ingestion, curation, and calibration
- Runs in AWS with many different storage backends
- Check us out: www.knewton.com/about/careers/
© DataStax, All Rights Reserved. 2
1 JVM tuning at Knewton
2 Updating memtable_allocation_type
3 Changing garbage collection strategy
3© DataStax, All Rights Reserved.
Context
As many startups, our company needed to make tradeoffs in order to rapidly deliver the product:
- Technical debt.
- Silos and isolated efforts.
- Decisions based on gut and intuition.
One year ago:
- Different versions of Cassandra
- Multiple clients (i.e.: Pycassa, Hector, Astyanax, Datastax)
- Huge challenge with backups and restores
Now:
- 99.98% database uptime
- The database is not a black box anymore
© DataStax, All Rights Reserved. 4
Successful initiatives
- In house command line tools
- cassandra-toolbox python package
- distributed nodetool
- Separation of objects from heap memory (memtable_allocation_type)
- Customization of heap size allocation.
- Update to Garbage First Garbage Collection (G1GC).
- Monitoring/alerting based on JMX metrics
© DataStax, All Rights Reserved. 5
Successful initiatives
- In house command line tools https://github.com/Knewton/cassandra-toolbox/
- cassandra-toolbox python package
- distributed nodetool
- Separation of objects from heap memory (memtable_allocation_type)
- Customization of heap size allocation.
- Update to Garbage First Garbage Collection (G1GC).
- Monitoring/alerting based on JMX metrics
© DataStax, All Rights Reserved. 6
Successful initiatives
- In house command line tools
- cassandra-toolbox python package
- distributed nodetool
- Separation of objects from heap memory (memtable_allocation_type)
- Customization of heap size allocation.
- Update to Garbage First Garbage Collection (G1GC).
- Monitoring/alerting based on JMX metrics
© DataStax, All Rights Reserved. 7
Successful initiatives
- In house command line tools
- cassandra-toolbox python package
- distributed nodetool
- Separation of objects from heap memory (memtable_allocation_type)
- Customization of heap size allocation. https://tech.knewton.com/
- Update to Garbage First Garbage Collection (G1GC).
- Monitoring/alerting based on JMX metrics
© DataStax, All Rights Reserved. 8
Some less successful initiatives
- Monitoring and alerts based on Graphite graphs
- Too many resources to get an aggregate
- High incidence of false positives and false negatives
- GoCD
- Cloudwatch
© DataStax, All Rights Reserved. 9
1 JVM tuning at Knewton
2 Updating memtable_allocation_type
3 Changing garbage collection strategy
10© DataStax, All Rights Reserved.
memtable_allocation_type
Cassandra allows to keep memtables and key cache objects in the native memory, instead of the Java
JVM heap.
- Used for data structures that continue growing with time
- Options:
- heap_buffers
- default value before Cassandra 3.0
- all the objects are kept in the JVM heap memory
- offheap_buffers
- cell name and values are moved to DirectBuffer objects
- offheap_objects
- moves the entire cell off heap, leaving only a pointer
11
Update memtable_allocation_type
cassandra-stress tool is a great starting point while
validating changes for the database configuration
12
But we needed to go the extra mile with an end-to-end test
- involving the rest of the dev team
- demonstrate the positive impact of the change to the
rest of the system
Update memtable_allocation_type
cassandra-stress tool is a great starting point while
validating changes for the database configuration
13
But we needed to go the extra mile with an end-to-end test
- involving the rest of the dev team
- demonstrate the positive impact of the change to the
rest of the system
Test memtable_allocation_type update
14
Update setting
Load test
(locust)
Compile logs
from C* and
application
Analysis
with R
Response times
Functional load
tests
Update memtable - Criteria
End-to-end
15
• Response time
– Timeouts
• Errors
• Throughput
• CPU consumption
• Memory used
Cassandra specific
• Cassandra
– Time spent for Garbage
Collection
• Collection
– Read and Write latencies
– Errors/Exceptions
Update memtable_allocation_type
Time used for Garbage Collection
offheap_buffers offheap_objects heap_buffers
16
Comparing garbage collection times with different values for memtable_allocation_type
Update memtable_allocation_type
Time used for Garbage Collection
offheap_buffers offheap_objects heap_buffers
17
Comparing garbage collection times with different values for memtable_allocation_type
Update memtable_allocation_type
Memory sizes
offheap_buffers offheap_objects heap_buffers
18
Comparing the sizes per generation spaces, before and after the garbage collection.
Update memtable_allocation_type
GC phases
offheap_buffers offheap_objects heap_buffers
19
Comparing the behaviour of the garbage collection phases.
memtable_allocation_type results
We are using offheap_buffers as it showed:
- the lowest average response time for requests
- lowest CPU usage
- lowest thread count created
- lowest write latency
*Results may vary
20
1 JVM tuning at Knewton
2 Updating memtable_allocation_type
3 Changing garbage collection strategy
21© DataStax, All Rights Reserved.
22
Garbage First Garbage Collection (G1GC)
The G1 collector utilizes multiple background threads to scan through the heap
that it divides into regions.
It is named “Garbage first” (G1) gives preference to scan those regions that
contain the most garbage objects first.
This collector is turned on using the –XX:+UseG1GC flag.
G1GC analysis
G1 was released since April 2012 (JDK 7 update 4 and up)
The tools available for the analysis of the garbage collection logs didn’t have
the support or were not able to interpret all the information from our servers.
- Netflix gcviz does not support Garbage First (G1) strategy
- In Oracle’s developer blog (Jeff Taylor) it is proposed an initial approach
for JDK 7
© DataStax, All Rights Reserved. 23
Test garbage collection
24
Enable gc data
collection
Get a
baseline
Compile GC
logs
Analysis
with R
G1GC Java arguments
25
Java Arguments as defined in cassandra-env.sh
-XX:+UseG1GC
-XX:G1RSetUpdatingPauseTimePercent=5
-XX:ParallelGCThreads=2
-XX:ConcGCThreads=2
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintHeapAtGC
-XX:+PrintTenuringDistribution
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintPromotionFailure
-XX:PrintFLSStatistics=1
-Xloggc:/<valid path>/gc.log
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=10
G1GC analysis - Heap size
26
G1GC analysis - Heap size
27
G1GC analysis - Heap size
28
G1GC analysis - Heap size
29
G1GC analysis - Heap size
30
G1GC analysis - phases
31
G1GC Analysis demo
Code
Garbage collection analysis :
https://gist.github.com/roymontecutli/4cf5c97f03720e60825f414667c141da
Cassandra toolbox : https://github.com/Knewton/cassandra-toolbox
33
Conclusions
- Moving objects from the JVM heap memory can improve
the performance of the application when dealing with
large data sets. Yet you need to find out which strategy
(take out buffers or objects) suits the best for your use
case.
- Garbage Collection is an operation that can impact
adversely the performance on a Cassandra cluster.
Having tools to analyze its behaviour will help to identify
areas of impact and measure improvements.
- Configuration changes should always consider the
system as a whole, involve all the teams.
© DataStax, All Rights Reserved. 34
Thanks
carlos@knewton.com

Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Knewton) | C* Summit 2016

  • 1.
    Java tuning forKnewton’s C* clusters Lessons learned Carlos Monroy Knewton
  • 2.
    Knewton Leader in adaptivelearning - Partners with publishers and institutions in Europe, US, and Asia - Provides unique recommendations to students based on previous behavior. - Advanced content ingestion, curation, and calibration - Runs in AWS with many different storage backends - Check us out: www.knewton.com/about/careers/ © DataStax, All Rights Reserved. 2
  • 3.
    1 JVM tuningat Knewton 2 Updating memtable_allocation_type 3 Changing garbage collection strategy 3© DataStax, All Rights Reserved.
  • 4.
    Context As many startups,our company needed to make tradeoffs in order to rapidly deliver the product: - Technical debt. - Silos and isolated efforts. - Decisions based on gut and intuition. One year ago: - Different versions of Cassandra - Multiple clients (i.e.: Pycassa, Hector, Astyanax, Datastax) - Huge challenge with backups and restores Now: - 99.98% database uptime - The database is not a black box anymore © DataStax, All Rights Reserved. 4
  • 5.
    Successful initiatives - Inhouse command line tools - cassandra-toolbox python package - distributed nodetool - Separation of objects from heap memory (memtable_allocation_type) - Customization of heap size allocation. - Update to Garbage First Garbage Collection (G1GC). - Monitoring/alerting based on JMX metrics © DataStax, All Rights Reserved. 5
  • 6.
    Successful initiatives - Inhouse command line tools https://github.com/Knewton/cassandra-toolbox/ - cassandra-toolbox python package - distributed nodetool - Separation of objects from heap memory (memtable_allocation_type) - Customization of heap size allocation. - Update to Garbage First Garbage Collection (G1GC). - Monitoring/alerting based on JMX metrics © DataStax, All Rights Reserved. 6
  • 7.
    Successful initiatives - Inhouse command line tools - cassandra-toolbox python package - distributed nodetool - Separation of objects from heap memory (memtable_allocation_type) - Customization of heap size allocation. - Update to Garbage First Garbage Collection (G1GC). - Monitoring/alerting based on JMX metrics © DataStax, All Rights Reserved. 7
  • 8.
    Successful initiatives - Inhouse command line tools - cassandra-toolbox python package - distributed nodetool - Separation of objects from heap memory (memtable_allocation_type) - Customization of heap size allocation. https://tech.knewton.com/ - Update to Garbage First Garbage Collection (G1GC). - Monitoring/alerting based on JMX metrics © DataStax, All Rights Reserved. 8
  • 9.
    Some less successfulinitiatives - Monitoring and alerts based on Graphite graphs - Too many resources to get an aggregate - High incidence of false positives and false negatives - GoCD - Cloudwatch © DataStax, All Rights Reserved. 9
  • 10.
    1 JVM tuningat Knewton 2 Updating memtable_allocation_type 3 Changing garbage collection strategy 10© DataStax, All Rights Reserved.
  • 11.
    memtable_allocation_type Cassandra allows tokeep memtables and key cache objects in the native memory, instead of the Java JVM heap. - Used for data structures that continue growing with time - Options: - heap_buffers - default value before Cassandra 3.0 - all the objects are kept in the JVM heap memory - offheap_buffers - cell name and values are moved to DirectBuffer objects - offheap_objects - moves the entire cell off heap, leaving only a pointer 11
  • 12.
    Update memtable_allocation_type cassandra-stress toolis a great starting point while validating changes for the database configuration 12 But we needed to go the extra mile with an end-to-end test - involving the rest of the dev team - demonstrate the positive impact of the change to the rest of the system
  • 13.
    Update memtable_allocation_type cassandra-stress toolis a great starting point while validating changes for the database configuration 13 But we needed to go the extra mile with an end-to-end test - involving the rest of the dev team - demonstrate the positive impact of the change to the rest of the system
  • 14.
    Test memtable_allocation_type update 14 Updatesetting Load test (locust) Compile logs from C* and application Analysis with R Response times Functional load tests
  • 15.
    Update memtable -Criteria End-to-end 15 • Response time – Timeouts • Errors • Throughput • CPU consumption • Memory used Cassandra specific • Cassandra – Time spent for Garbage Collection • Collection – Read and Write latencies – Errors/Exceptions
  • 16.
    Update memtable_allocation_type Time usedfor Garbage Collection offheap_buffers offheap_objects heap_buffers 16 Comparing garbage collection times with different values for memtable_allocation_type
  • 17.
    Update memtable_allocation_type Time usedfor Garbage Collection offheap_buffers offheap_objects heap_buffers 17 Comparing garbage collection times with different values for memtable_allocation_type
  • 18.
    Update memtable_allocation_type Memory sizes offheap_buffersoffheap_objects heap_buffers 18 Comparing the sizes per generation spaces, before and after the garbage collection.
  • 19.
    Update memtable_allocation_type GC phases offheap_buffersoffheap_objects heap_buffers 19 Comparing the behaviour of the garbage collection phases.
  • 20.
    memtable_allocation_type results We areusing offheap_buffers as it showed: - the lowest average response time for requests - lowest CPU usage - lowest thread count created - lowest write latency *Results may vary 20
  • 21.
    1 JVM tuningat Knewton 2 Updating memtable_allocation_type 3 Changing garbage collection strategy 21© DataStax, All Rights Reserved.
  • 22.
    22 Garbage First GarbageCollection (G1GC) The G1 collector utilizes multiple background threads to scan through the heap that it divides into regions. It is named “Garbage first” (G1) gives preference to scan those regions that contain the most garbage objects first. This collector is turned on using the –XX:+UseG1GC flag.
  • 23.
    G1GC analysis G1 wasreleased since April 2012 (JDK 7 update 4 and up) The tools available for the analysis of the garbage collection logs didn’t have the support or were not able to interpret all the information from our servers. - Netflix gcviz does not support Garbage First (G1) strategy - In Oracle’s developer blog (Jeff Taylor) it is proposed an initial approach for JDK 7 © DataStax, All Rights Reserved. 23
  • 24.
    Test garbage collection 24 Enablegc data collection Get a baseline Compile GC logs Analysis with R
  • 25.
    G1GC Java arguments 25 JavaArguments as defined in cassandra-env.sh -XX:+UseG1GC -XX:G1RSetUpdatingPauseTimePercent=5 -XX:ParallelGCThreads=2 -XX:ConcGCThreads=2 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -XX:+PrintPromotionFailure -XX:PrintFLSStatistics=1 -Xloggc:/<valid path>/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10
  • 26.
    G1GC analysis -Heap size 26
  • 27.
    G1GC analysis -Heap size 27
  • 28.
    G1GC analysis -Heap size 28
  • 29.
    G1GC analysis -Heap size 29
  • 30.
    G1GC analysis -Heap size 30
  • 31.
    G1GC analysis -phases 31
  • 32.
  • 33.
    Code Garbage collection analysis: https://gist.github.com/roymontecutli/4cf5c97f03720e60825f414667c141da Cassandra toolbox : https://github.com/Knewton/cassandra-toolbox 33
  • 34.
    Conclusions - Moving objectsfrom the JVM heap memory can improve the performance of the application when dealing with large data sets. Yet you need to find out which strategy (take out buffers or objects) suits the best for your use case. - Garbage Collection is an operation that can impact adversely the performance on a Cassandra cluster. Having tools to analyze its behaviour will help to identify areas of impact and measure improvements. - Configuration changes should always consider the system as a whole, involve all the teams. © DataStax, All Rights Reserved. 34
  • 35.