Contents
Overview & Profiler Comparison Installation & Setup CPU Profiling Allocation Profiling Lock Profiling Wall-Clock Profiling Reading Flame Graphs IntelliJ IDEA Integration Common Profiling Workflows
Before attaching a profiler, understand what each tool is good at — they measure different things and have different overhead characteristics.
| Tool | Mechanism | Overhead | Best for |
|---|---|---|---|
| async-profiler | AsyncGetCallTrace + OS perf events; no safepoint bias | ~1–3% | CPU hot paths, allocation pressure, lock contention |
| Java Flight Recorder | JVM-internal ring buffer; safepoint-biased sampling | <1% | Long-running continuous recording, GC + JIT + I/O events |
| VisualVM | JVMTI sampling or instrumentation | 2–20% | Developer workstations; GUI-driven investigation |
| YourKit / JProfiler | JVMTI + native agent | 5–50% (instrumentation) | Deep object allocation tracking, enterprise support |
Async-profiler is a native agent (
CPU profiling samples the call stack of every running thread at a fixed frequency (default 100 Hz). After collection, stacks are aggregated into a flame graph showing where CPU time is spent. This is the right mode when the application is slow but not obviously blocked on I/O or locks.
Add these JVM flags to your application to enable frame pointers for more accurate native stack resolution:
Allocation profiling intercepts object allocations using TLAB (Thread-Local Allocation Buffer) events — specifically when a thread's TLAB fills and a new one is needed. This captures allocations proportional to their byte size, making it effective at finding classes that allocate enormous amounts of short-lived garbage (triggering frequent Young GC).
The resulting flame graph shows call stacks weighted by bytes allocated, not by time. Frames at the top are the allocation sites; wider frames allocate more. Look for:
- Unexpected
byte[] orchar[] allocations from serialisation, logging, or string formatting in hot paths - Builder objects (
StringBuilder ,ArrayList ) created inside loops that could be pre-allocated or reused - Boxed primitives (
Integer ,Long ) in collections that should use primitive arrays or Eclipse Collections
Lock profiling samples threads that are blocked waiting to acquire a Java monitor (synchronized block or method) or a
In the lock flame graph, each frame represents the call stack of the thread waiting to acquire the lock, not the thread holding it. The widest frames are the most contended code paths. Common causes:
- Synchronised on a single instance used by all threads (cache, connection pool, singleton service)
- Database connection pool exhausted — threads wait for a connection to become available
ConcurrentHashMap.computeIfAbsent() under heavy write contention (bin-level locking)
Wall-clock mode samples all threads regardless of their state — running, sleeping, blocked on I/O, or waiting on a lock. This is the right choice when you want to understand end-to-end request latency including all blocking time, not just on-CPU time.
A flame graph is a visualisation of aggregated stack traces. Each row is a stack frame; width represents the proportion of samples where that frame appeared. Understanding how to read one is essential — the layout can be counterintuitive at first.
- X axis — not time. It is an alphabetically sorted aggregation of stack traces. Width = proportion of samples.
- Y axis — call depth. Bottom frames are closer to main/thread start; top frames are the innermost executing code.
- Widest frames at the top — the method where the most time is actually being spent (the hot spot). This is where to focus optimization.
- Tall narrow spikes — deep call chains that rarely appear in samples. Often infrequent code paths.
- Click a frame — zooms into that subtree. The header shows the % of total time that subtree represents.
- Green frames — Java methods. Orange — C/C++ (JVM internals, native methods). Yellow — kernel functions (system calls).
IntelliJ IDEA Ultimate (2023.1+) bundles async-profiler and can profile any Run/Debug configuration directly from the IDE. The results appear as an annotated flame graph inside the IDE with source navigation.
- Open Run → Edit Configurations → select the configuration → open the Profiler tab
- Choose async-profiler as the profiler, select CPU or allocation mode
- Click Run with Profiler — IDEA starts the app with the agent attached
- Stop the profiler via the toolbar — the flame graph opens in the Profiler tool window
- Click any frame in the flame graph to jump directly to the source code
The right profiling approach depends on the symptom. Use this decision guide to pick the correct sequence of tools.
| Symptom | First check | Profiling mode | What to look for |
|---|---|---|---|
| High CPU / slow responses | CPU usage > 80% on | Widest frames at top of flame graph | |
| Frequent Young GC / GC pauses | Unexpected byte[] / boxed types in hot paths | ||
| High latency, low CPU | Thread dump — many BLOCKED/WAITING | Widest blocked stacks; lock owner | |
| Slow request end-to-end | Distributed trace shows server-side gap | I/O waits, sleep, connection pool wait in request threads | |
| Memory growing over time | Heap dump + MAT | Dominator tree — largest retained objects |