Java Profiling Future

Future-Proof JVM Profiling
Evolving the platform profiling support
Jaroslav Bachorik
Staff Software Engineer
Datadog

● Capturing the appliccation performance related data
○ … and using it to improve resource usage
● Can be classified into
○ Execution Profiling
Where is the CPU time being spent
○ Memory Profiling
■ Allocation profiling
Which code is allocating the most and of which type
■ Heap profiling
Which objects are retained and who is allocating them
○ Latency Profiling
What is causing the application to do ‘nothing’
■ Wallclock profiling
■ Lock profiling
■ (Synchronous) I/O profiling
■ Syscall profiling
Profiling Is …

● Sampling profiling
○ collects random samples
○ creates a statistical representation of the process behaviour
○ light-weight
○ does not provide exact duration information and call-graph
● Tracing profiling
○ traces exact method invocation with stackttrace and timing
○ heavily intruding
■ overhead
■ JIT and memory management interference
○ provides exact call-graph and duration information
Sampling vs. Tracing Profiling

● Good enough results
● Acceptable overhead
● In practice ‘profiling’ == ‘sampling profiling’
Sampling Profiling!

● System deployments are complex
○ Cloud, K8s
● Profiling in-isolation is not enough
○ Same-service, multiple-instances
○ Same-service, multiple envs
○ How to correlate and distinguish?
● Enter APM - Application Performance Management
○ Combined tracing/profiling
■ Tracing provides ‘coarse’ information about operations
■ Profiling fills in the gory details about the code execution
○ User in control of what is traced
■ Profiles scoped to traces allow causal analysis
○ Frame level information exposed by profiler
■ Can be used to drive debug session by dynamic instrumentation
Profiling In Cloud

JVM Profiling Support
- JMX
- A complex management and observability framework
- Since 2003, JSR 160
- Easily used from Java
- JVMTI
- Low level tooling interface
- Since 2004, JSR 163
- Requires native agent
- JFR
- One-stop solution for JVM (and application) observability
- In OpenJDK since 2017 (JDK 9, backported to JDK 8 in 2020, update 262/272)
- No special agent required
- AsyncGetCallTrace (ASGCT)
- A ‘special’ way to get ‘raw’ stacktrace
- Introduced for Sun One Studio in 2004
- Requires native agent and custom profiling infrastructure

JMX
- Execution profiling
- GetAllStackTraces
- Safe-point biased
- Overhead grows with number of sampled threads
- Obsolete method
- Ubiqitous

JVMTI
- Focused on tracing profiling
- very high overhead
- Execution Sampling
- GetAllStackTraces or GetStackTrace
- Frame reference via jmethodID validity issues
- becomes invalid if parent class unloaded
- can’t force strong refs to all classes in stacktrace atomically
- Safe-point biased (as JMX)
- Allocation sampling since JDK 11
- JEP 331
- Not biased towards TLAB size
- Modern sampler with known statistical properties
- Samples can be ‘upscaled’ to real allocation sizes estimates
- Profiling support is very 2000-ish

JFR
- JDK Flight Recorder
- Low overhead observability framework
- Supports all profiling modes
- Execution Sampling
- low overhead
- avoids safe-point bias
- not a ‘true’ CPU profiler
- sampler driven by wallclock interval
- failed samples are not reported
- separate sampler thread - possibility of starvation
- non-trivial upscaling to CPU time per sample
- Allocation Sampling
- low overhead
- biased on TLAB size
- non-uniform sampling = no easy upscaling
- Heap Profiling
- biased on TLAB size
- can collect reference-chains (light-weight heapdump)
- Lock Profiling
- Thresholded on minimum blocking duration to report
- Prevents swamping recording with lock event
- Makes profiler blind to latency induced by very many short lock events
- Syscall Profiling
- Kind of - wallclock profile for threads handling JNI native code

AsyncGetCallTrace (ASGCT)
- ‘Unofficial’ API to get non-safepoint-biased stacktraces
- Added for Sun Studio One many years ago
- Not really maintained
- Lurking bugs can crash your JVM
- Can be called from signal handler -> stack may be inconsistent
- Some have been fixed
- Innocently looking methods mutating global state
- Asserts and guards for invariants not valid when using
ASGCT
- Still, the foundation of almost all 3rd party Java profilers

Can I Just Use JFR?
- TL;DR - almost
- There are still some parts missing, provided by 3rd party profilers
- ‘Proper’ CPU profiler
- driven by CPU time rather than wallclock time
- Non-biased allocation profiler
- JFR allocation sampler biases on TLAB size
- Non-biased heap profiler
- trading-off non-biased nature for reference-chains
- Profiling context
- required for tracer-profiler integration (think OTel)
- labelling events by context
- guarding events by context
- eg. instead of threshold use the presence of context
- JFR is currently very closed to enhancements
- Adding support required for contemporary profiling needs is excruciatingly slow

Really, Can I Just Use JFR?
● Yes! If the following features are implemented
○ [Proper] CPU Profiler
○ Profiling Context
● Having an API to request event emission from native would be great!
○ Custom sampling policies
○ Integration with perf events (woohoo!)
○ ebpf anyone
○ Prototyping concepts in 3rd party code before adding to JFR core

Improved JFR CPU Profiler
● Use CPU time based sampler driver (perf_event_open, timer_create)
○ Subject to availability
■ Prefer perf_event_open, if available
■ Fall-back to timer_create, if available
■ If not on Linux or neither perf_event_open nor timer_create
is available, fall-back to the dedicated sampler thread
○ Alternatively, provide a way to request ExecutionSample event
from a native signal handler
● Make the stacktrace acquisition safe to run in signal-handler
○ JEP 435: Asynchronous Stack Trace VM API
■ Samples recorded at the exact PC
■ But stack walked only on the method-exit safe-point
(credits to Erik Oesterlund for this idea!)
■ Johannes Bechberger making great progress

JFR Context
● What is context?
○ Trace ID
○ REST endpoint
○ Request URL
○ … and anything else allowing to scope JFR events
● Start simple
○ Context is attached to a (virtual) thread
○ Context value is a plain string
○ Finite small number of context values
○ Context values are represented as augmented event fields
○ No automatic context propagation
■ It is up to the API user take care of continuations carrying
the right context
● There is prior art eg. in Go
○ Profiler Labels
○ It is a first-class runtime citizen

JFR Context Alternatives
● We (Datadog) tried to work-around the lack of context by
○ Special ‘Context’ events
■ Event spans time between context set and context unset
■ Huge amount of such events for reactive/async apps
■ Getting very complex when tracking more than one context
attribute
■ Thresholding does not help as it leads to unpredictable
context loss
■ Difficult to correlate with the rest of the data that can be
sampled
○ Special ‘Context Change’ events
■ Each event represents state transition
■ Easier to correlate with potentially sampled data
■ The amount of events turned out to be unbearable (millions
per minute)

External Context Implementation
● Implement the context in a separate profiler
● Datadog profiler has such an implementation
● It comes at the cost of
○ Replicating the JFR writer implementation
○ Replicating several JFR provided events
○ Missing context for low-level events like the most of the
thread-halting events (ThreadPark, MonitorWait, etc.)
○ Relying on ASGCT which may crash the profiled app
● Still, the feature is loved by our customers for the increased clarity of
the profiling data

TL;DR Datadog Profiling Context

Datadog Profiling Context
- Context propagation
- Implemented in Java tracer
- Context associated with a unit of work
- Independent of executing thread
- Context persistence
- Implemented in the profiler agent
- Store context in JFR events
- Easy and fast Java<->Native interop is mandatory
- No JNI calls, please!
- Shared memory buffer
- Relying on Java and native side being tightly coupled
- Tags are plain strings
- Dictionarized
- No custom types
- Semi-custom context
- Capped at ten custom tags
- Custom tag types/names
- Must be defined before profiler is started
- Stored in the JFR recording

Shared Memory Context
- One context per thread
- Sparse thread-page map
- Static size
- Efficient memory layout
- 64 bytes to match the common x64 cache line size
- Checksum
- Used to detect tearing, partial writes
- 64 bit/8 bytes
- Context Content
- Provides 10 slots (currently)
- Each slot is 4 bytes
- Possibly up to 14 slots (56 bytes)

Shared Memory Context
Thread 1
Thread 2
…
Thread N
1 2 3 4 5 6 7 8 9 10
chksum
64b
Context data (10 slots, 40 bytes
64 bytes (eg. cache line)
1 2 3 4 5 6 7 8 9 10
chksum
64b
Context data (10 slots, 40 bytes)
64 bytes (eg. cache line)
Thread
page
map

Java Profiling Future

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Java Profiling Future

Similar to Java Profiling Future (20)

Recently uploaded

Recently uploaded (20)

Java Profiling Future