GPU Optimization for LLM Inference on AMD MI300X : Chinmay Hebbal

Every environment variable and vLLM serve flag explained, from the master AITER switch to FP8 KV-cache quantization. What each knob does, why it matters, and what breaks if you leave it out when running GPT-OSS 120B on AMD's flagship accelerator.

Launching a 120B-parameter model on a single GPU is already impressive. But the gap between a working deployment and a fast one on AMD hardware comes down to a handful of environment variables and server flags that most tutorials gloss over. This post dissects every one of them, what the hardware actually does differently when you set each flag, and what you're leaving on the table if you don't.

Why AMD MI300X Needs Its Own Tuning

The MI300X is not simply an AMD H100. Its architecture, CDNA3, HBM3, 192 GB unified memory, differs from NVIDIA's at the silicon level. Stock PyTorch ops and standard vLLM defaults were optimised for CUDA and NVIDIA's memory hierarchy. Running them unchanged on ROCm leaves most of the MI300X's bandwidth and compute on the table.

AMD's answer is AITER, the AI Triton Engine for ROCm. It ships a library of custom C++ and Triton kernels hand-tuned for CDNA3: tiled matrix multiplication layouts matched to the MI300X's CU geometry, attention kernels that exploit HBM3's 5.2 TB/s peak bandwidth, and quantisation routines written with the hardware's INT4/FP8 datapaths in mind. The environment variables below are how you tell vLLM to use all of it.

192 GB HBM3 unified memory

5.2 TB/s peak memory BW

304 TFLOPS BF16 compute

The Full Launch Command

With the ROCm container ready, the vLLM server for GPT-OSS 120B is brought up in a single terminal session. The environment variables must be set before the vllm serve call so the ROCm runtime picks them up at process start.

bash

# ── ROCm / AITER kernel switches ──────────────────────────
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4
export VLLM_ROCM_USE_AITER_MHA=0
export HIP_FORCE_DEV_KERNARG=1

# ── Launch the inference server ───────────────────────────
vllm serve openai/gpt-oss-120b \
    --tensor-parallel-size 1 \
    --max-model-len 8192 \
    --trust-remote-code \
    --dtype bfloat16 \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.92 \
    --no-enable-chunked-prefill \
    --max-num-seqs 512 \
    --port 8080

Once the server logs INFO: Application startup complete. you have a private, OpenAI-compatible inference endpoint, essentially a self-hosted ChatGPT backend that runs entirely on your own hardware with no cloud dependency.

Part I, ROCm & AITER Environment Variables

These four variables are set before the process starts. They control which kernel implementations the ROCm runtime loads and how the GPU dispatches compute work. Getting them wrong, or omitting them entirely, means the server falls back to generic PyTorch ops that ignore everything special about CDNA3.

VLLM_ROCM_USE_AITER

= 1

The master switch. When set to 1, vLLM replaces its default Python-backed attention layers with AITER's pre-compiled C++ kernel library. This covers the full attention pathway, prefill, decode, and the KV-cache interaction layer, all rewritten with CDNA3-specific tile sizes, warp occupancy targets, and HBM3 prefetch hints.

Without this flag, vLLM falls back to ROCm's generic attention path, which is functionally correct but ignores the MI300X's memory topology. The AITER kernels exploit the fact that HBM3 has lower latency per access than HBM2e, they structure their memory access patterns specifically for that characteristic, squeezing out bandwidth that generic kernels leave unused.

Significance: The single highest-impact flag in this list. Disabling it can drop throughput by 30–50% on MI300X depending on model size and batch shape. Everything else in this list amplifies the baseline this switch enables.

VLLM_ROCM_USE_AITER_MHA

= 0

Disables AITER's multi-head attention (MHA) kernel specifically. This is deliberately set to 0, an intentional carve-out within the broader AITER suite. The MHA kernel in AITER is still maturing: while it performs well on standard attention shapes, edge cases around grouped-query attention (GQA) configurations like those used by GPT-OSS 120B can produce correctness regressions or stability issues on certain sequence lengths.

With VLLM_ROCM_USE_AITER_MHA=0, the broader AITER suite (VLLM_ROCM_USE_AITER=1) still handles all other operations, GEMM, layer norms, activation functions, the KV-cache read/write path, but MHA falls back to the stable FlashAttention-for-ROCm implementation.

Significance: A stability guardrail, not a performance sacrifice. The AITER MHA kernel will eventually ship without this caveat; for now, this combination gives you AITER's full GEMM and memory optimisations with a tested attention kernel.

VLLM_ROCM_QUICK_REDUCE_QUANTIZATION

= INT4

Configures the all-reduce quantisation scheme used during tensor-parallel communication. Setting this to INT4 means intermediate activation tensors are quantised to 4-bit integers before they are broadcast across GPU communication links during all-reduce operations.

Even with --tensor-parallel-size 1 (a single GPU), internal pipeline stages and certain fused kernels still perform reduce operations across compute units. INT4 quantisation here halves the data volume relative to INT8, and quarters it relative to FP16, directly reducing memory bandwidth pressure on HBM3 during these reductions. The MI300X's HBM3 is fast, but memory bandwidth is always the bottleneck in autoregressive decode; every byte saved translates to measurable throughput.

Significance: Most impactful at high concurrency where many reduce operations overlap. On a single MI300X running 512 concurrent sequences, INT4 reduces can meaningfully reduce bandwidth contention during the decode phase. At low concurrency (<8 sequences) the effect is smaller but non-zero.

HIP_FORCE_DEV_KERNARG

= 1

Forces kernel arguments to be passed via device memory rather than host memory. When a GPU kernel is dispatched, its arguments, tensor pointers, shape metadata, stride information, need to reach the GPU. The default HIP path copies these from host (CPU) memory. Setting this flag tells the HIP runtime to pre-stage kernel arguments directly in device memory, eliminating a PCIe round-trip on every kernel launch.

This is a micro-optimisation that looks small on any single kernel invocation — microseconds at most. It compounds, however, during high-concurrency inference where vLLM dispatches hundreds of kernels per decode step across attention, GEMM, and KV-cache operations. On the MI300X's CDNA3 architecture, where the CPU-to-GPU argument path is a well-documented latency source, this flag removes a systemic overhead that otherwise accrues quietly at scale.

Significance: Low per-kernel, meaningful at scale. Think of it as shaving 5–10 µs per step that adds up to visible throughput gains when you're running 512 concurrent sequences each taking dozens of decode steps.

Part II, vLLM Serve Flags

The vllm serve flags control how the inference engine allocates memory, batches requests, and manages the KV-cache. Each one shapes the server's performance envelope, throughput ceiling, latency floor, and maximum concurrency.

--tensor-parallel-size 1

Runs the entire model on a single GPU. Tensor parallelism splits weight matrices across multiple GPUs, with each GPU owning a shard and all-reducing activations at every layer boundary. With --tensor-parallel-size 1, no splitting occurs, the full 120B parameter model runs as a single monolithic graph on one MI300X.

This is only possible because the MI300X has 192 GB of HBM3. A BF16 GPT-OSS 120B model occupies roughly 240 GB at full precision, but with --kv-cache-dtype fp8 and --gpu-memory-utilization 0.92, the combined weight + KV-cache footprint is managed within budget. Without tensor parallelism, there is zero all-reduce communication overhead, which is the dominant latency cost in multi-GPU setups.

Significance: Eliminates inter-GPU communication entirely. A 4-GPU TP setup with all-reduce on NVLink or XGMI adds 2–8 ms of synchronisation overhead per decode step. At TP=1, that cost vanishes, the single-GPU setup is faster per step, at the cost of capacity (only one card's compute).

--max-model-len 8192

Caps the maximum sequence length the server will accept. This covers the combined length of input prompt tokens plus generated output tokens. A request with a 7000-token prompt and a 1000-token output would exceed this limit and be rejected.

The KV-cache size is derived directly from this value: vLLM pre-allocates KV-cache blocks sized for up to max-model-len tokens per sequence. A lower limit means smaller KV-cache blocks, which means more sequences can share the same HBM pool simultaneously. Setting this at 8192 rather than the model's native 128K context window is a deliberate trade, you sacrifice very long context in exchange for higher concurrency and a more predictable memory budget.

Significance: The primary knob for tuning the concurrency-vs-context trade-off. If your workload has predominantly short conversations (<4K tokens), you could lower this further to fit even more concurrent sequences. If you need RAG pipelines with large retrieved contexts, you'd raise it at the cost of concurrency.

--trust-remote-code

Allows vLLM to execute custom Python code bundled with the model repository. Some models ship with non-standard architectures, custom attention implementations, or tokeniser plugins that are not part of the stock Hugging Face Transformers library. GPT-OSS 120B uses a custom modelling file that must be imported at load time.

Without this flag, vLLM will refuse to load any model that requires importing code not already installed in the Python environment. The flag is an explicit acknowledgement that you have reviewed the model's repository and accept its code as safe to run.

Significance: A hard requirement for GPT-OSS 120B, the server simply will not start without it. Treat it as a security checklist item rather than a performance flag: verify the Hugging Face repository before enabling in production.

--dtype bfloat16

Sets the compute precision for model weights and activations to BF16. BFloat16 uses the same 8-bit exponent range as FP32 (preserving dynamic range) but truncates the mantissa to 7 bits rather than FP16's 10. For transformer inference, this is almost universally the right choice: it avoids the numeric instability of FP16 at large activation magnitudes while using half the memory of FP32.

The MI300X's CDNA3 architecture has native BF16 matrix cores, the hardware executes BF16 GEMM at full TFLOPS rate without any software emulation. Running at FP32 would halve throughput (each operation twice the data) and double the activation memory footprint. Running at FP16 risks overflow in large models with extreme activation norms, which manifests as NaN in generated text.

Significance: BF16 is the precision sweet spot for 100B+ models on CDNA3. It gives full hardware throughput, half the memory of FP32, and none of the range instability of FP16. Deviating from this requires explicit justification (e.g., quantisation research or FP8 full-model runs).

--kv-cache-dtype fp8

Stores the key and value tensors in the KV-cache at FP8 precision instead of BF16. The KV-cache is the memory buffer that holds the attention keys and values for every token in every active sequence. During autoregressive decode, every new token must read the entire KV history, making this buffer the single most bandwidth-intensive structure in LLM inference.

Quantising it to FP8 halves its memory footprint compared to BF16. On a 120B model with 96 layers and 128 KV-heads, a sequence at 8192 tokens generates a KV-cache of roughly 8 GB at BF16, and 4 GB at FP8. For 512 concurrent sequences that difference is the gap between server capacity and an OOM crash. The FP8 read path is also faster: fewer bytes transferred from HBM means lower latency per decode step, which directly improves TPOT.

Significance: One of the highest-leverage memory optimisations available. Halving KV-cache size doubles the number of sequences you can serve simultaneously, which translates directly into higher output token throughput (tok/s) at equal TPOT. The quality loss from FP8 quantisation of KV tensors is negligible for most generation tasks.

--gpu-memory-utilization 0.92

Reserves 92% of GPU HBM for vLLM, weights, KV-cache, and activations combined. The remaining 8% (roughly 15 GB on a 192 GB card) is left unallocated to accommodate ROCm runtime overhead, AITER kernel workspace buffers, and activation peaks during large prefills.

vLLM uses this fraction to compute how many PagedAttention KV-cache blocks it can allocate after the model weights are loaded. A higher value gives more room for KV-cache blocks (more concurrent sequences); a lower value is safer on systems where other processes share the GPU. At 0.92, the server is aggressive but stable — 0.95+ on MI300X with AITER active can trigger OOM during concurrent large prefills because AITER's workspace allocations spike temporarily.

Significance: Directly controls your concurrency ceiling. Every percentage point above the model-weights watermark translates into additional KV-cache blocks. At 0.92 on a 192 GB card with GPT-OSS 120B weights (~120 GB BF16) you have roughly 57 GB for KV-cache, enough for 512 concurrent sequences at 8192 context with FP8 KV.

--no-enable-chunked-prefill

Disables chunked prefill, forces each prefill to run as a single, uninterrupted operation. Chunked prefill is a feature that breaks long prompt processing into smaller chunks so decode steps can be interleaved during prefill, reducing time-to-first-token for concurrent requests.

On NVIDIA hardware with CUDA graph capture, chunked prefill provides a good latency-throughput balance. On the MI300X with AITER kernels, the chunking boundary introduces a kernel launch flush that breaks the AITER pipeline's internal batching optimisation. The net effect is worse throughput than simply running prefill to completion. The AITER prefill kernels are tuned for full-sequence operation, they achieve better HBM utilisation when given the entire prompt at once.

Significance: This flag is AMD-specific. On NVIDIA deployments, chunked prefill is often beneficial. On MI300X with AITER, disabling it yields 10–20% better prefill throughput because AITER's attention kernels are optimised for full-sequence tile layouts, not chunk boundaries.

--max-num-seqs 512

Sets the maximum number of sequences the engine will process simultaneously in a single forward pass. This is the batch size ceiling for the continuous batching scheduler. When more than 512 requests are queued, additional requests wait in the admission queue until active sequences complete.

Higher values maximise GPU utilisation, more sequences per step means the matrix multiplications in each transformer layer operate on taller weight matrices, achieving better arithmetic intensity (FLOP/byte). Lower values reduce per-step latency (each step processes fewer tokens) but leave the GPU underutilised. At 512, a 120B model on MI300X typically achieves 85–92% MFU during steady-state decode, which is near-optimal for this hardware.

Significance: The primary throughput dial. Throughput scales roughly linearly with max-num-seqs up to the point where HBM bandwidth saturates. On MI300X with 5.2 TB/s bandwidth and 120B weights, that saturation point for BF16 decode is around 400–600 sequences, 512 sits squarely in the optimal range.

--port 8080

Binds the OpenAI-compatible HTTP server to port 8080. vLLM implements the full OpenAI REST API spec, /v1/chat/completions, /v1/completions, /v1/models, so any client that targets the OpenAI API works against this server without modification. Just swap the base URL from api.openai.com to localhost:8080.

Port 8080 is the convention for this stack (the MI300X cloud environments typically expose this port externally). In a Kubernetes deployment behind a LoadBalancer service, this maps to port 80 or 443 at the service level.

Significance: Operational rather than performance-oriented. The choice of 8080 over the vLLM default of 8000 avoids port conflicts with other services (Jupyter, monitoring agents) that commonly bind port 8000 in ML infrastructure environments.

How the Flags Interact

None of these flags are independent. The table below shows the key dependency chains, understanding them lets you tune the server for different workloads without accidentally undoing optimisations elsewhere.

Flag / Variable	Depends On	Unlocks
VLLM_ROCM_USE_AITER=1	ROCm container environment	All other AITER optimisations
VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4	AITER=1 to be effective	Lower BW pressure during all-reduce ops
VLLM_ROCM_USE_AITER_MHA=0	AITER=1 (scoped carve-out)	Stability on GQA models like GPT-OSS 120B
HIP_FORCE_DEV_KERNARG=1	HIP runtime (any ROCm version)	Reduced kernel dispatch latency at scale
--kv-cache-dtype fp8	--gpu-memory-utilization budget	2× more concurrent sequences in same HBM
--gpu-memory-utilization 0.92	--kv-cache-dtype (determines block count)	512+ concurrent sequences on 192 GB HBM3
--no-enable-chunked-prefill	AITER=1 (AITER prefers full-sequence tiles)	10–20% prefill throughput improvement
--max-num-seqs 512	HBM budget set by gpu-memory-utilization + fp8 KV	Near-peak GPU arithmetic intensity on 120B

What Happens Without Each Flag

▲

This is not a list of optional extras. The flags above collectively close the gap between a generic ROCm deployment and one that uses the MI300X's hardware effectively. Omitting any of the critical ones degrades performance non-linearly because they interact with each other.

Removed	Consequence
VLLM_ROCM_USE_AITER=1	Falls back to generic ROCm attention. 30–50% throughput drop. All downstream AITER flags become inert.
VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4	All-reduce uses FP16. Higher BW consumption during decode, noticeable at max-num-seqs=512.
VLLM_ROCM_USE_AITER_MHA=0 (i.e., MHA=1)	AITER MHA kernel active. Risk of correctness regressions on certain GQA shapes with GPT-OSS 120B.
HIP_FORCE_DEV_KERNARG=1	Each kernel launch copies args from host. ~5–10 µs per launch × hundreds of kernels per step = latency tail at scale.
--kv-cache-dtype fp8	KV-cache at BF16, 2× larger. Max concurrent sequences halved. OOM risk at max-num-seqs=512 with 8192 context.
--no-enable-chunked-prefill	Chunked prefill active. AITER kernel tiling efficiency drops. 10–20% lower prefill throughput vs. full-sequence operation.
--gpu-memory-utilization 0.92	Default is 0.90. Slightly fewer KV-cache blocks allocated. Minor concurrency reduction, safe but suboptimal on 192 GB HBM3.

Tuning for Different Workloads

The command above is optimised for high-concurrency, interactive inference , chat APIs, streaming completions, multi-user services. Two other profiles worth knowing:

Low-latency single-user (minimise TTFT)

Drop --max-num-seqs to 8–32. Fewer concurrent sequences means each decode step is lighter, reducing time-to-first-token. You sacrifice aggregate throughput (fewer tokens per second total) for faster individual response time. Keep all AITER flags and FP8 KV-cache, they help regardless of concurrency.

Batch / offline processing (maximise total tok/s)

Raise --max-num-seqs to 1024 if your FP8 KV-cache budget allows it. Enable chunked prefill if your prompts are very long (>16K tokens), at extreme prompt lengths, chunking actually helps throughput by interleaving prefill compute with decode steps for other sequences. Also consider raising --gpu-memory-utilization to 0.94–0.95 in a dedicated environment with no other processes on the card.

◆

The single most impactful thing you can do beyond these flags is run vllm bench serve against your live server with request shapes that match your actual workload. Synthetic benchmarks with fixed input/output lengths reveal the ceiling; your real traffic distribution reveals where you actually operate. The two numbers are often 40–60% apart.

Key Takeaways

The MI300X is genuinely capable hardware. But its advantage over competing accelerators is only realised when you give vLLM the full stack of AMD-specific configuration.

VLLM_ROCM_USE_AITER=1 is the master enabler, without it, you're running a generic ROCm stack and leaving the majority of CDNA3's kernel-level optimisations unused.
VLLM_ROCM_USE_AITER_MHA=0 is not a bug, it's a deliberate stability fence for the MHA kernel that keeps the rest of AITER running safely on GQA models today.
FP8 KV-cache is the biggest single concurrency multiplier: it doubles the sequences you can serve at equal HBM cost, which is what turns a 192 GB card from "it fits the model" into "it serves hundreds of users simultaneously."
No-chunked-prefill is AMD-specific, this flag should be set on MI300X and left at its default on NVIDIA. Know your hardware before copy-pasting config.
All four AITER environment variables should be treated as a set. They are designed to be used together; removing any one of them changes the kernel dispatch path in ways that interact with the others.

vLLM AMD MI300X ROCm AITER FP8 KV Cache LLM Serving Inference GPU Optimization