2. ● Capturing the appliccation performance related data
○ … and using it to improve resource usage
● Can be classified into
○ Execution Profiling
Where is the CPU time being spent
○ Memory Profiling
■ Allocation profiling
Which code is allocating the most and of which type
■ Heap profiling
Which objects are retained and who is allocating them
○ Latency Profiling
What is causing the application to do ‘nothing’
■ Wallclock profiling
■ Lock profiling
■ (Synchronous) I/O profiling
■ Syscall profiling
Profiling Is …
3. ● Sampling profiling
○ collects random samples
○ creates a statistical representation of the process behaviour
○ light-weight
○ does not provide exact duration information and call-graph
● Tracing profiling
○ traces exact method invocation with stackttrace and timing
○ heavily intruding
■ overhead
■ JIT and memory management interference
○ provides exact call-graph and duration information
Sampling vs. Tracing Profiling
4. ● Good enough results
● Acceptable overhead
● In practice ‘profiling’ == ‘sampling profiling’
Sampling Profiling!
5. ● System deployments are complex
○ Cloud, K8s
● Profiling in-isolation is not enough
○ Same-service, multiple-instances
○ Same-service, multiple envs
○ How to correlate and distinguish?
● Enter APM - Application Performance Management
○ Combined tracing/profiling
■ Tracing provides ‘coarse’ information about operations
■ Profiling fills in the gory details about the code execution
○ User in control of what is traced
■ Profiles scoped to traces allow causal analysis
○ Frame level information exposed by profiler
■ Can be used to drive debug session by dynamic instrumentation
Profiling In Cloud
7. JVM Profiling Support
- JMX
- A complex management and observability framework
- Since 2003, JSR 160
- Easily used from Java
- JVMTI
- Low level tooling interface
- Since 2004, JSR 163
- Requires native agent
- JFR
- One-stop solution for JVM (and application) observability
- In OpenJDK since 2017 (JDK 9, backported to JDK 8 in 2020, update 262/272)
- No special agent required
- AsyncGetCallTrace (ASGCT)
- A ‘special’ way to get ‘raw’ stacktrace
- Introduced for Sun One Studio in 2004
- Requires native agent and custom profiling infrastructure
8. JMX
- Execution profiling
- GetAllStackTraces
- Safe-point biased
- Overhead grows with number of sampled threads
- Obsolete method
- Ubiqitous
9. JVMTI
- Focused on tracing profiling
- very high overhead
- Execution Sampling
- GetAllStackTraces or GetStackTrace
- Frame reference via jmethodID validity issues
- becomes invalid if parent class unloaded
- can’t force strong refs to all classes in stacktrace atomically
- Safe-point biased (as JMX)
- Allocation sampling since JDK 11
- JEP 331
- Not biased towards TLAB size
- Modern sampler with known statistical properties
- Samples can be ‘upscaled’ to real allocation sizes estimates
- Profiling support is very 2000-ish
10. JFR
- JDK Flight Recorder
- Low overhead observability framework
- Supports all profiling modes
- Execution Sampling
- low overhead
- avoids safe-point bias
- not a ‘true’ CPU profiler
- sampler driven by wallclock interval
- failed samples are not reported
- separate sampler thread - possibility of starvation
- non-trivial upscaling to CPU time per sample
- Allocation Sampling
- low overhead
- biased on TLAB size
- non-uniform sampling = no easy upscaling
- Heap Profiling
- biased on TLAB size
- can collect reference-chains (light-weight heapdump)
- Lock Profiling
- Thresholded on minimum blocking duration to report
- Prevents swamping recording with lock event
- Makes profiler blind to latency induced by very many short lock events
- Syscall Profiling
- Kind of - wallclock profile for threads handling JNI native code
11. AsyncGetCallTrace (ASGCT)
- ‘Unofficial’ API to get non-safepoint-biased stacktraces
- Added for Sun Studio One many years ago
- Not really maintained
- Lurking bugs can crash your JVM
- Can be called from signal handler -> stack may be inconsistent
- Some have been fixed
- Innocently looking methods mutating global state
- Asserts and guards for invariants not valid when using
ASGCT
- Still, the foundation of almost all 3rd party Java profilers
12. Can I Just Use JFR?
- TL;DR - almost
- There are still some parts missing, provided by 3rd party profilers
- ‘Proper’ CPU profiler
- driven by CPU time rather than wallclock time
- Non-biased allocation profiler
- JFR allocation sampler biases on TLAB size
- Non-biased heap profiler
- trading-off non-biased nature for reference-chains
- Profiling context
- required for tracer-profiler integration (think OTel)
- labelling events by context
- guarding events by context
- eg. instead of threshold use the presence of context
- JFR is currently very closed to enhancements
- Adding support required for contemporary profiling needs is excruciatingly slow
13. Really, Can I Just Use JFR?
● Yes! If the following features are implemented
○ [Proper] CPU Profiler
○ Profiling Context
● Having an API to request event emission from native would be great!
○ Custom sampling policies
○ Integration with perf events (woohoo!)
○ ebpf anyone
○ Prototyping concepts in 3rd party code before adding to JFR core
14. Improved JFR CPU Profiler
● Use CPU time based sampler driver (perf_event_open, timer_create)
○ Subject to availability
■ Prefer perf_event_open, if available
■ Fall-back to timer_create, if available
■ If not on Linux or neither perf_event_open nor timer_create
is available, fall-back to the dedicated sampler thread
○ Alternatively, provide a way to request ExecutionSample event
from a native signal handler
● Make the stacktrace acquisition safe to run in signal-handler
○ JEP 435: Asynchronous Stack Trace VM API
■ Samples recorded at the exact PC
■ But stack walked only on the method-exit safe-point
(credits to Erik Oesterlund for this idea!)
■ Johannes Bechberger making great progress
15. JFR Context
● What is context?
○ Trace ID
○ REST endpoint
○ Request URL
○ … and anything else allowing to scope JFR events
● Start simple
○ Context is attached to a (virtual) thread
○ Context value is a plain string
○ Finite small number of context values
○ Context values are represented as augmented event fields
○ No automatic context propagation
■ It is up to the API user take care of continuations carrying
the right context
● There is prior art eg. in Go
○ Profiler Labels
○ It is a first-class runtime citizen
16. JFR Context Alternatives
● We (Datadog) tried to work-around the lack of context by
○ Special ‘Context’ events
■ Event spans time between context set and context unset
■ Huge amount of such events for reactive/async apps
■ Getting very complex when tracking more than one context
attribute
■ Thresholding does not help as it leads to unpredictable
context loss
■ Difficult to correlate with the rest of the data that can be
sampled
○ Special ‘Context Change’ events
■ Each event represents state transition
■ Easier to correlate with potentially sampled data
■ The amount of events turned out to be unbearable (millions
per minute)
17. External Context Implementation
● Implement the context in a separate profiler
● Datadog profiler has such an implementation
● It comes at the cost of
○ Replicating the JFR writer implementation
○ Replicating several JFR provided events
○ Missing context for low-level events like the most of the
thread-halting events (ThreadPark, MonitorWait, etc.)
○ Relying on ASGCT which may crash the profiled app
● Still, the feature is loved by our customers for the increased clarity of
the profiling data
19. Datadog Profiling Context
- Context propagation
- Implemented in Java tracer
- Context associated with a unit of work
- Independent of executing thread
- Context persistence
- Implemented in the profiler agent
- Store context in JFR events
- Easy and fast Java<->Native interop is mandatory
- No JNI calls, please!
- Shared memory buffer
- Relying on Java and native side being tightly coupled
- Tags are plain strings
- Dictionarized
- No custom types
- Semi-custom context
- Capped at ten custom tags
- Custom tag types/names
- Must be defined before profiler is started
- Stored in the JFR recording
20. Shared Memory Context
- One context per thread
- Sparse thread-page map
- Static size
- Efficient memory layout
- 64 bytes to match the common x64 cache line size
- Checksum
- Used to detect tearing, partial writes
- 64 bit/8 bytes
- Context Content
- Provides 10 slots (currently)
- Each slot is 4 bytes
- Possibly up to 14 slots (56 bytes)