Contents

Before attaching a profiler, understand what each tool is good at — they measure different things and have different overhead characteristics.

ToolMechanismOverheadBest for
async-profilerAsyncGetCallTrace + OS perf events; no safepoint bias~1–3%CPU hot paths, allocation pressure, lock contention
Java Flight RecorderJVM-internal ring buffer; safepoint-biased sampling<1%Long-running continuous recording, GC + JIT + I/O events
VisualVMJVMTI sampling or instrumentation2–20%Developer workstations; GUI-driven investigation
YourKit / JProfilerJVMTI + native agent5–50% (instrumentation)Deep object allocation tracking, enterprise support
Safepoint bias: many JVM profilers can only capture stack traces at safepoints — JVM pause points for GC and deoptimisation. This under-reports code that never reaches a safepoint quickly (e.g., tight loops). Async-profiler avoids this by using UNIX signals and AsyncGetCallTrace to interrupt threads at arbitrary points.

Async-profiler is a native agent (.so on Linux, .dylib on macOS). Download the prebuilt tarball for your platform from the GitHub releases page — no compilation needed.

# Linux (x64) — download and extract curl -L -o async-profiler.tar.gz \ https://github.com/async-profiler/async-profiler/releases/download/v3.0/async-profiler-3.0-linux-x64.tar.gz tar xzf async-profiler.tar.gz cd async-profiler-3.0-linux-x64 # macOS curl -L -o async-profiler.tar.gz \ https://github.com/async-profiler/async-profiler/releases/download/v3.0/async-profiler-3.0-macos.tar.gz tar xzf async-profiler.tar.gz # Verify install — list available events ./asprof list <pid> # Linux kernel tuning required for perf_events (run once as root) echo 1 | sudo tee /proc/sys/kernel/perf_event_paranoid echo 0 | sudo tee /proc/sys/kernel/kptr_restrict # Optional — make persistent across reboots (/etc/sysctl.d/99-profiling.conf) kernel.perf_event_paranoid = 1 kernel.kptr_restrict = 0

CPU profiling samples the call stack of every running thread at a fixed frequency (default 100 Hz). After collection, stacks are aggregated into a flame graph showing where CPU time is spent. This is the right mode when the application is slow but not obviously blocked on I/O or locks.

# Profile a running JVM for 30 seconds, output interactive flame graph ./asprof -d 30 -f /tmp/cpu-flame.html <pid> # Higher frequency (200 Hz) for more precision — increases overhead slightly ./asprof -d 30 -e cpu -i 5ms -f /tmp/cpu-flame.html <pid> # Start / stop manually ./asprof start <pid> # ... let traffic run ... ./asprof stop -f /tmp/cpu-flame.html <pid>

Add these JVM flags to your application to enable frame pointers for more accurate native stack resolution:

java -XX:+UnlockDiagnosticVMOptions \ -XX:+DebugNonSafepoints \ -jar app.jar -XX:+DebugNonSafepoints enables JIT-compiled frame information for async-profiler. Without it, inlined methods may not appear in the flame graph. Add this flag to all environments where you plan to profile.

Allocation profiling intercepts object allocations using TLAB (Thread-Local Allocation Buffer) events — specifically when a thread's TLAB fills and a new one is needed. This captures allocations proportional to their byte size, making it effective at finding classes that allocate enormous amounts of short-lived garbage (triggering frequent Young GC).

# Allocation profiling — 60 second run ./asprof -d 60 -e alloc -f /tmp/alloc-flame.html <pid> # Profile allocation with a minimum size threshold (skip tiny allocations) ./asprof -d 60 -e alloc --alloc 512k -f /tmp/alloc-flame.html <pid>

The resulting flame graph shows call stacks weighted by bytes allocated, not by time. Frames at the top are the allocation sites; wider frames allocate more. Look for:

Lock profiling samples threads that are blocked waiting to acquire a Java monitor (synchronized block or method) or a java.util.concurrent.Lock. Use this mode when thread dumps show many BLOCKED or WAITING threads, or when latency is high but CPU utilisation is low.

# Lock contention profiling — 30 seconds ./asprof -d 30 -e lock -f /tmp/lock-flame.html <pid> # Thread dump for a quick snapshot (not async-profiler — native JVM) jcmd <pid> Thread.print > /tmp/threads.txt

In the lock flame graph, each frame represents the call stack of the thread waiting to acquire the lock, not the thread holding it. The widest frames are the most contended code paths. Common causes:

Wall-clock mode samples all threads regardless of their state — running, sleeping, blocked on I/O, or waiting on a lock. This is the right choice when you want to understand end-to-end request latency including all blocking time, not just on-CPU time.

# Wall-clock profiling — capture everything, including blocked threads ./asprof -d 30 -e wall -t -f /tmp/wall-flame.html <pid> # -t = include thread names in output (helps distinguish request threads from GC threads) # Limit to specific threads by name pattern ./asprof -d 30 -e wall -t --filter "http-nio*" -f /tmp/wall-flame.html <pid> Use CPU mode to find hot computation. Use wall-clock mode to find latency problems (I/O waits, sleeps, lock waits). Use allocation mode to find GC pressure. Using the wrong mode for a problem gives misleading results.

A flame graph is a visualisation of aggregated stack traces. Each row is a stack frame; width represents the proportion of samples where that frame appeared. Understanding how to read one is essential — the layout can be counterintuitive at first.

# Generate collapsed stacks in JFR format — integrates with IntelliJ/other tools ./asprof -d 30 -e cpu -o jfr -f /tmp/profile.jfr <pid> # Convert to text format for scripting ./asprof -d 30 -e cpu -o collapsed -f /tmp/collapsed.txt <pid>

IntelliJ IDEA Ultimate (2023.1+) bundles async-profiler and can profile any Run/Debug configuration directly from the IDE. The results appear as an annotated flame graph inside the IDE with source navigation.

IntelliJ ships with async-profiler binaries for Linux and macOS. You do not need to download async-profiler separately when using the IntelliJ integration.

The right profiling approach depends on the symptom. Use this decision guide to pick the correct sequence of tools.

SymptomFirst checkProfiling modeWhat to look for
High CPU / slow responsesCPU usage > 80% on top-e cpuWidest frames at top of flame graph
Frequent Young GC / GC pausesjstat -gcutil — high YGC rate-e allocUnexpected byte[] / boxed types in hot paths
High latency, low CPUThread dump — many BLOCKED/WAITING-e lock or -e wallWidest blocked stacks; lock owner
Slow request end-to-endDistributed trace shows server-side gap-e wall -tI/O waits, sleep, connection pool wait in request threads
Memory growing over timejstat -gcutil — rising O columnHeap dump + MATDominator tree — largest retained objects