Launching a 120B-parameter model on a single GPU is already impressive. But the gap between a working deployment and a fast one on AMD hardware comes down to a handful of environment variables and server flags that most tutorials gloss over. This post dissects every one of them, what the hardware actually does differently when you set each flag, and what you're leaving on the table if you don't.
Why AMD MI300X Needs Its Own Tuning
The MI300X is not simply an AMD H100. Its architecture, CDNA3, HBM3, 192 GB unified memory, differs from NVIDIA's at the silicon level. Stock PyTorch ops and standard vLLM defaults were optimised for CUDA and NVIDIA's memory hierarchy. Running them unchanged on ROCm leaves most of the MI300X's bandwidth and compute on the table.
AMD's answer is AITER, the AI Triton Engine for ROCm. It ships a library of custom C++ and Triton kernels hand-tuned for CDNA3: tiled matrix multiplication layouts matched to the MI300X's CU geometry, attention kernels that exploit HBM3's 5.2 TB/s peak bandwidth, and quantisation routines written with the hardware's INT4/FP8 datapaths in mind. The environment variables below are how you tell vLLM to use all of it.
The Full Launch Command
With the ROCm container ready, the vLLM server for GPT-OSS 120B is brought up in a single
terminal session. The environment variables must be set before the
vllm serve call so the ROCm runtime picks them up at process start.
# ── ROCm / AITER kernel switches ────────────────────────── export VLLM_ROCM_USE_AITER=1 export VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4 export VLLM_ROCM_USE_AITER_MHA=0 export HIP_FORCE_DEV_KERNARG=1 # ── Launch the inference server ─────────────────────────── vllm serve openai/gpt-oss-120b \ --tensor-parallel-size 1 \ --max-model-len 8192 \ --trust-remote-code \ --dtype bfloat16 \ --kv-cache-dtype fp8 \ --gpu-memory-utilization 0.92 \ --no-enable-chunked-prefill \ --max-num-seqs 512 \ --port 8080
Once the server logs INFO: Application startup complete. you have a private,
OpenAI-compatible inference endpoint, essentially a self-hosted ChatGPT backend that runs
entirely on your own hardware with no cloud dependency.
Part I, ROCm & AITER Environment Variables
These four variables are set before the process starts. They control which kernel implementations the ROCm runtime loads and how the GPU dispatches compute work. Getting them wrong, or omitting them entirely, means the server falls back to generic PyTorch ops that ignore everything special about CDNA3.
The master switch. When set to 1, vLLM replaces its
default Python-backed attention layers with AITER's pre-compiled C++ kernel library.
This covers the full attention pathway, prefill, decode, and the KV-cache interaction
layer, all rewritten with CDNA3-specific tile sizes, warp occupancy targets, and
HBM3 prefetch hints.
Without this flag, vLLM falls back to ROCm's generic attention path, which is functionally correct but ignores the MI300X's memory topology. The AITER kernels exploit the fact that HBM3 has lower latency per access than HBM2e, they structure their memory access patterns specifically for that characteristic, squeezing out bandwidth that generic kernels leave unused.
Disables AITER's multi-head attention (MHA) kernel specifically.
This is deliberately set to 0, an intentional carve-out within the
broader AITER suite. The MHA kernel in AITER is still maturing: while it performs
well on standard attention shapes, edge cases around grouped-query attention (GQA)
configurations like those used by GPT-OSS 120B can produce correctness regressions
or stability issues on certain sequence lengths.
With VLLM_ROCM_USE_AITER_MHA=0, the broader AITER suite
(VLLM_ROCM_USE_AITER=1) still handles all other operations, GEMM,
layer norms, activation functions, the KV-cache read/write path, but MHA falls
back to the stable FlashAttention-for-ROCm implementation.
Configures the all-reduce quantisation scheme used during tensor-parallel
communication. Setting this to INT4 means intermediate
activation tensors are quantised to 4-bit integers before they are broadcast
across GPU communication links during all-reduce operations.
Even with --tensor-parallel-size 1 (a single GPU), internal
pipeline stages and certain fused kernels still perform reduce operations across
compute units. INT4 quantisation here halves the data volume relative to INT8,
and quarters it relative to FP16, directly reducing memory bandwidth pressure
on HBM3 during these reductions. The MI300X's HBM3 is fast, but
memory bandwidth is always the bottleneck in autoregressive
decode; every byte saved translates to measurable throughput.
Forces kernel arguments to be passed via device memory rather than host memory. When a GPU kernel is dispatched, its arguments, tensor pointers, shape metadata, stride information, need to reach the GPU. The default HIP path copies these from host (CPU) memory. Setting this flag tells the HIP runtime to pre-stage kernel arguments directly in device memory, eliminating a PCIe round-trip on every kernel launch.
This is a micro-optimisation that looks small on any single kernel invocation — microseconds at most. It compounds, however, during high-concurrency inference where vLLM dispatches hundreds of kernels per decode step across attention, GEMM, and KV-cache operations. On the MI300X's CDNA3 architecture, where the CPU-to-GPU argument path is a well-documented latency source, this flag removes a systemic overhead that otherwise accrues quietly at scale.
Part II, vLLM Serve Flags
The vllm serve flags control how the inference engine allocates memory,
batches requests, and manages the KV-cache. Each one shapes the server's performance
envelope, throughput ceiling, latency floor, and maximum concurrency.
Runs the entire model on a single GPU. Tensor parallelism splits
weight matrices across multiple GPUs, with each GPU owning a shard and all-reducing
activations at every layer boundary. With --tensor-parallel-size 1,
no splitting occurs, the full 120B parameter model runs as a single monolithic
graph on one MI300X.
This is only possible because the MI300X has 192 GB of HBM3.
A BF16 GPT-OSS 120B model occupies roughly 240 GB at full precision, but with
--kv-cache-dtype fp8 and --gpu-memory-utilization 0.92,
the combined weight + KV-cache footprint is managed within budget. Without tensor
parallelism, there is zero all-reduce communication overhead, which is the dominant
latency cost in multi-GPU setups.
Caps the maximum sequence length the server will accept. This covers the combined length of input prompt tokens plus generated output tokens. A request with a 7000-token prompt and a 1000-token output would exceed this limit and be rejected.
The KV-cache size is derived directly from this value: vLLM pre-allocates KV-cache
blocks sized for up to max-model-len tokens per sequence. A lower limit
means smaller KV-cache blocks, which means more sequences can share the same HBM
pool simultaneously. Setting this at 8192 rather than the model's native 128K context
window is a deliberate trade, you sacrifice very long context in exchange for
higher concurrency and a more predictable memory budget.
Allows vLLM to execute custom Python code bundled with the model repository. Some models ship with non-standard architectures, custom attention implementations, or tokeniser plugins that are not part of the stock Hugging Face Transformers library. GPT-OSS 120B uses a custom modelling file that must be imported at load time.
Without this flag, vLLM will refuse to load any model that requires importing code not already installed in the Python environment. The flag is an explicit acknowledgement that you have reviewed the model's repository and accept its code as safe to run.
Sets the compute precision for model weights and activations to BF16. BFloat16 uses the same 8-bit exponent range as FP32 (preserving dynamic range) but truncates the mantissa to 7 bits rather than FP16's 10. For transformer inference, this is almost universally the right choice: it avoids the numeric instability of FP16 at large activation magnitudes while using half the memory of FP32.
The MI300X's CDNA3 architecture has native BF16 matrix cores, the hardware executes BF16 GEMM at full TFLOPS rate without any software emulation. Running at FP32 would halve throughput (each operation twice the data) and double the activation memory footprint. Running at FP16 risks overflow in large models with extreme activation norms, which manifests as NaN in generated text.
Stores the key and value tensors in the KV-cache at FP8 precision instead of BF16. The KV-cache is the memory buffer that holds the attention keys and values for every token in every active sequence. During autoregressive decode, every new token must read the entire KV history, making this buffer the single most bandwidth-intensive structure in LLM inference.
Quantising it to FP8 halves its memory footprint compared to BF16. On a 120B model with 96 layers and 128 KV-heads, a sequence at 8192 tokens generates a KV-cache of roughly 8 GB at BF16, and 4 GB at FP8. For 512 concurrent sequences that difference is the gap between server capacity and an OOM crash. The FP8 read path is also faster: fewer bytes transferred from HBM means lower latency per decode step, which directly improves TPOT.
Reserves 92% of GPU HBM for vLLM, weights, KV-cache, and activations combined. The remaining 8% (roughly 15 GB on a 192 GB card) is left unallocated to accommodate ROCm runtime overhead, AITER kernel workspace buffers, and activation peaks during large prefills.
vLLM uses this fraction to compute how many PagedAttention KV-cache blocks it can allocate after the model weights are loaded. A higher value gives more room for KV-cache blocks (more concurrent sequences); a lower value is safer on systems where other processes share the GPU. At 0.92, the server is aggressive but stable — 0.95+ on MI300X with AITER active can trigger OOM during concurrent large prefills because AITER's workspace allocations spike temporarily.
Disables chunked prefill, forces each prefill to run as a single, uninterrupted operation. Chunked prefill is a feature that breaks long prompt processing into smaller chunks so decode steps can be interleaved during prefill, reducing time-to-first-token for concurrent requests.
On NVIDIA hardware with CUDA graph capture, chunked prefill provides a good latency-throughput balance. On the MI300X with AITER kernels, the chunking boundary introduces a kernel launch flush that breaks the AITER pipeline's internal batching optimisation. The net effect is worse throughput than simply running prefill to completion. The AITER prefill kernels are tuned for full-sequence operation, they achieve better HBM utilisation when given the entire prompt at once.
Sets the maximum number of sequences the engine will process simultaneously in a single forward pass. This is the batch size ceiling for the continuous batching scheduler. When more than 512 requests are queued, additional requests wait in the admission queue until active sequences complete.
Higher values maximise GPU utilisation, more sequences per step means the matrix multiplications in each transformer layer operate on taller weight matrices, achieving better arithmetic intensity (FLOP/byte). Lower values reduce per-step latency (each step processes fewer tokens) but leave the GPU underutilised. At 512, a 120B model on MI300X typically achieves 85–92% MFU during steady-state decode, which is near-optimal for this hardware.
Binds the OpenAI-compatible HTTP server to port 8080. vLLM
implements the full OpenAI REST API spec, /v1/chat/completions,
/v1/completions, /v1/models, so any client
that targets the OpenAI API works against this server without modification.
Just swap the base URL from api.openai.com to
localhost:8080.
Port 8080 is the convention for this stack (the MI300X cloud environments typically expose this port externally). In a Kubernetes deployment behind a LoadBalancer service, this maps to port 80 or 443 at the service level.
How the Flags Interact
None of these flags are independent. The table below shows the key dependency chains, understanding them lets you tune the server for different workloads without accidentally undoing optimisations elsewhere.
| Flag / Variable | Depends On | Unlocks |
|---|---|---|
| VLLM_ROCM_USE_AITER=1 | ROCm container environment | All other AITER optimisations |
| VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4 | AITER=1 to be effective | Lower BW pressure during all-reduce ops |
| VLLM_ROCM_USE_AITER_MHA=0 | AITER=1 (scoped carve-out) | Stability on GQA models like GPT-OSS 120B |
| HIP_FORCE_DEV_KERNARG=1 | HIP runtime (any ROCm version) | Reduced kernel dispatch latency at scale |
| --kv-cache-dtype fp8 | --gpu-memory-utilization budget | 2× more concurrent sequences in same HBM |
| --gpu-memory-utilization 0.92 | --kv-cache-dtype (determines block count) | 512+ concurrent sequences on 192 GB HBM3 |
| --no-enable-chunked-prefill | AITER=1 (AITER prefers full-sequence tiles) | 10–20% prefill throughput improvement |
| --max-num-seqs 512 | HBM budget set by gpu-memory-utilization + fp8 KV | Near-peak GPU arithmetic intensity on 120B |
What Happens Without Each Flag
This is not a list of optional extras. The flags above collectively close the gap between a generic ROCm deployment and one that uses the MI300X's hardware effectively. Omitting any of the critical ones degrades performance non-linearly because they interact with each other.
| Removed | Consequence |
|---|---|
| VLLM_ROCM_USE_AITER=1 | Falls back to generic ROCm attention. 30–50% throughput drop. All downstream AITER flags become inert. |
| VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4 | All-reduce uses FP16. Higher BW consumption during decode, noticeable at max-num-seqs=512. |
| VLLM_ROCM_USE_AITER_MHA=0 (i.e., MHA=1) | AITER MHA kernel active. Risk of correctness regressions on certain GQA shapes with GPT-OSS 120B. |
| HIP_FORCE_DEV_KERNARG=1 | Each kernel launch copies args from host. ~5–10 µs per launch × hundreds of kernels per step = latency tail at scale. |
| --kv-cache-dtype fp8 | KV-cache at BF16, 2× larger. Max concurrent sequences halved. OOM risk at max-num-seqs=512 with 8192 context. |
| --no-enable-chunked-prefill | Chunked prefill active. AITER kernel tiling efficiency drops. 10–20% lower prefill throughput vs. full-sequence operation. |
| --gpu-memory-utilization 0.92 | Default is 0.90. Slightly fewer KV-cache blocks allocated. Minor concurrency reduction, safe but suboptimal on 192 GB HBM3. |
Tuning for Different Workloads
The command above is optimised for high-concurrency, interactive inference , chat APIs, streaming completions, multi-user services. Two other profiles worth knowing:
Low-latency single-user (minimise TTFT)
Drop --max-num-seqs to 8–32. Fewer concurrent sequences means each
decode step is lighter, reducing time-to-first-token. You sacrifice aggregate throughput
(fewer tokens per second total) for faster individual response time. Keep all AITER
flags and FP8 KV-cache, they help regardless of concurrency.
Batch / offline processing (maximise total tok/s)
Raise --max-num-seqs to 1024 if your FP8 KV-cache budget allows it.
Enable chunked prefill if your prompts are very long (>16K tokens), at extreme
prompt lengths, chunking actually helps throughput by interleaving prefill compute
with decode steps for other sequences. Also consider raising
--gpu-memory-utilization to 0.94–0.95 in a dedicated environment with
no other processes on the card.
The single most impactful thing you can do beyond these flags is run vllm bench serve against your live server with request shapes that match your actual workload. Synthetic benchmarks with fixed input/output lengths reveal the ceiling; your real traffic distribution reveals where you actually operate. The two numbers are often 40–60% apart.
Key Takeaways
The MI300X is genuinely capable hardware. But its advantage over competing accelerators is only realised when you give vLLM the full stack of AMD-specific configuration.
- VLLM_ROCM_USE_AITER=1 is the master enabler, without it, you're running a generic ROCm stack and leaving the majority of CDNA3's kernel-level optimisations unused.
- VLLM_ROCM_USE_AITER_MHA=0 is not a bug, it's a deliberate stability fence for the MHA kernel that keeps the rest of AITER running safely on GQA models today.
- FP8 KV-cache is the biggest single concurrency multiplier: it doubles the sequences you can serve at equal HBM cost, which is what turns a 192 GB card from "it fits the model" into "it serves hundreds of users simultaneously."
- No-chunked-prefill is AMD-specific, this flag should be set on MI300X and left at its default on NVIDIA. Know your hardware before copy-pasting config.
- All four AITER environment variables should be treated as a set. They are designed to be used together; removing any one of them changes the kernel dispatch path in ways that interact with the others.