Async-Profiler — CPU, Allocation & Lock Profiling for Java

Overview & Profiler Comparison
Installation & Setup
CPU Profiling
Allocation Profiling
Lock Profiling
Wall-Clock Profiling
Reading Flame Graphs
IntelliJ IDEA Integration
Common Profiling Workflows

Before attaching a profiler, understand what each tool is good at — they measure different things and have different overhead characteristics.

Tool	Mechanism	Overhead	Best for
async-profiler	AsyncGetCallTrace + OS perf events; no safepoint bias	~1–3%	CPU hot paths, allocation pressure, lock contention
Java Flight Recorder	JVM-internal ring buffer; safepoint-biased sampling	<1%	Long-running continuous recording, GC + JIT + I/O events
VisualVM	JVMTI sampling or instrumentation	2–20%	Developer workstations; GUI-driven investigation
YourKit / JProfiler	JVMTI + native agent	5–50% (instrumentation)	Deep object allocation tracking, enterprise support

Safepoint bias: many JVM profilers can only capture stack traces at safepoints — JVM pause points for GC and deoptimisation. This under-reports code that never reaches a safepoint quickly (e.g., tight loops). Async-profiler avoids this by using UNIX signals and AsyncGetCallTrace to interrupt threads at arbitrary points.

Async-profiler is a native agent (.so on Linux, .dylib on macOS). Download the prebuilt tarball for your platform from the GitHub releases page — no compilation needed.

# Linux (x64) — download and extract curl -L -o async-profiler.tar.gz \ https://github.com/async-profiler/async-profiler/releases/download/v3.0/async-profiler-3.0-linux-x64.tar.gz tar xzf async-profiler.tar.gz cd async-profiler-3.0-linux-x64 # macOS curl -L -o async-profiler.tar.gz \ https://github.com/async-profiler/async-profiler/releases/download/v3.0/async-profiler-3.0-macos.tar.gz tar xzf async-profiler.tar.gz # Verify install — list available events ./asprof list <pid>

# Linux kernel tuning required for perf_events (run once as root) echo 1 | sudo tee /proc/sys/kernel/perf_event_paranoid echo 0 | sudo tee /proc/sys/kernel/kptr_restrict # Optional — make persistent across reboots (/etc/sysctl.d/99-profiling.conf) kernel.perf_event_paranoid = 1 kernel.kptr_restrict = 0

CPU profiling samples the call stack of every running thread at a fixed frequency (default 100 Hz). After collection, stacks are aggregated into a flame graph showing where CPU time is spent. This is the right mode when the application is slow but not obviously blocked on I/O or locks.

# Profile a running JVM for 30 seconds, output interactive flame graph ./asprof -d 30 -f /tmp/cpu-flame.html <pid> # Higher frequency (200 Hz) for more precision — increases overhead slightly ./asprof -d 30 -e cpu -i 5ms -f /tmp/cpu-flame.html <pid> # Start / stop manually ./asprof start <pid> # ... let traffic run ... ./asprof stop -f /tmp/cpu-flame.html <pid>

Add these JVM flags to your application to enable frame pointers for more accurate native stack resolution:

java -XX:+UnlockDiagnosticVMOptions \ -XX:+DebugNonSafepoints \ -jar app.jar

-XX:+DebugNonSafepoints enables JIT-compiled frame information for async-profiler. Without it, inlined methods may not appear in the flame graph. Add this flag to all environments where you plan to profile.

Allocation profiling intercepts object allocations using TLAB (Thread-Local Allocation Buffer) events — specifically when a thread's TLAB fills and a new one is needed. This captures allocations proportional to their byte size, making it effective at finding classes that allocate enormous amounts of short-lived garbage (triggering frequent Young GC).

# Allocation profiling — 60 second run ./asprof -d 60 -e alloc -f /tmp/alloc-flame.html <pid> # Profile allocation with a minimum size threshold (skip tiny allocations) ./asprof -d 60 -e alloc --alloc 512k -f /tmp/alloc-flame.html <pid>

The resulting flame graph shows call stacks weighted by bytes allocated, not by time. Frames at the top are the allocation sites; wider frames allocate more. Look for:

Unexpected byte[] or char[] allocations from serialisation, logging, or string formatting in hot paths
Builder objects (StringBuilder, ArrayList) created inside loops that could be pre-allocated or reused
Boxed primitives (Integer, Long) in collections that should use primitive arrays or Eclipse Collections

Lock profiling samples threads that are blocked waiting to acquire a Java monitor (synchronized block or method) or a java.util.concurrent.Lock. Use this mode when thread dumps show many BLOCKED or WAITING threads, or when latency is high but CPU utilisation is low.

# Lock contention profiling — 30 seconds ./asprof -d 30 -e lock -f /tmp/lock-flame.html <pid> # Thread dump for a quick snapshot (not async-profiler — native JVM) jcmd <pid> Thread.print > /tmp/threads.txt

In the lock flame graph, each frame represents the call stack of the thread waiting to acquire the lock, not the thread holding it. The widest frames are the most contended code paths. Common causes:

Synchronised on a single instance used by all threads (cache, connection pool, singleton service)
Database connection pool exhausted — threads wait for a connection to become available
ConcurrentHashMap.computeIfAbsent() under heavy write contention (bin-level locking)

Wall-clock mode samples all threads regardless of their state — running, sleeping, blocked on I/O, or waiting on a lock. This is the right choice when you want to understand end-to-end request latency including all blocking time, not just on-CPU time.

# Wall-clock profiling — capture everything, including blocked threads ./asprof -d 30 -e wall -t -f /tmp/wall-flame.html <pid> # -t = include thread names in output (helps distinguish request threads from GC threads) # Limit to specific threads by name pattern ./asprof -d 30 -e wall -t --filter "http-nio*" -f /tmp/wall-flame.html <pid>

Use CPU mode to find hot computation. Use wall-clock mode to find latency problems (I/O waits, sleeps, lock waits). Use allocation mode to find GC pressure. Using the wrong mode for a problem gives misleading results.

A flame graph is a visualisation of aggregated stack traces. Each row is a stack frame; width represents the proportion of samples where that frame appeared. Understanding how to read one is essential — the layout can be counterintuitive at first.

X axis — not time. It is an alphabetically sorted aggregation of stack traces. Width = proportion of samples.
Y axis — call depth. Bottom frames are closer to main/thread start; top frames are the innermost executing code.
Widest frames at the top — the method where the most time is actually being spent (the hot spot). This is where to focus optimization.
Tall narrow spikes — deep call chains that rarely appear in samples. Often infrequent code paths.
Click a frame — zooms into that subtree. The header shows the % of total time that subtree represents.
Green frames — Java methods. Orange — C/C++ (JVM internals, native methods). Yellow — kernel functions (system calls).

# Generate collapsed stacks in JFR format — integrates with IntelliJ/other tools ./asprof -d 30 -e cpu -o jfr -f /tmp/profile.jfr <pid> # Convert to text format for scripting ./asprof -d 30 -e cpu -o collapsed -f /tmp/collapsed.txt <pid>

IntelliJ IDEA Ultimate (2023.1+) bundles async-profiler and can profile any Run/Debug configuration directly from the IDE. The results appear as an annotated flame graph inside the IDE with source navigation.

Open Run → Edit Configurations → select the configuration → open the Profiler tab
Choose async-profiler as the profiler, select CPU or allocation mode
Click Run with Profiler — IDEA starts the app with the agent attached
Stop the profiler via the toolbar — the flame graph opens in the Profiler tool window
Click any frame in the flame graph to jump directly to the source code

IntelliJ ships with async-profiler binaries for Linux and macOS. You do not need to download async-profiler separately when using the IntelliJ integration.

The right profiling approach depends on the symptom. Use this decision guide to pick the correct sequence of tools.

Symptom	First check	Profiling mode	What to look for
High CPU / slow responses	CPU usage > 80% on top	-e cpu	Widest frames at top of flame graph
Frequent Young GC / GC pauses	jstat -gcutil — high YGC rate	-e alloc	Unexpected byte[] / boxed types in hot paths
High latency, low CPU	Thread dump — many BLOCKED/WAITING	-e lock or -e wall	Widest blocked stacks; lock owner
Slow request end-to-end	Distributed trace shows server-side gap	-e wall -t	I/O waits, sleep, connection pool wait in request threads
Memory growing over time	jstat -gcutil — rising O column	Heap dump + MAT	Dominator tree — largest retained objects

Contents