vLLM on AMD MI300X : Chinmay Hebbal

Launching a 120B-parameter model on a single GPU, wiring up an OpenAI-compatible API, and stress-testing it with three complementary benchmarking modesall on AMD's MI300X with vLLM and AITER kernels.

Running an LLM in a notebook is one thing. Running it as a reliable, low-latency API server that handles dozens of concurrent users, streams tokens, and survives peak load is something else entirely. This post walks through exactly thatfrom spinning up a vLLM server on AMD's MI300X, to sending the first request, to measuring real performance with three complementary benchmarking tools.

Why vLLM on AMD MI300X?

vLLM is an open-source, high-throughput inference engine built around PagedAttentiona memory management technique that treats the KV-cache like virtual memory in an OS, eliminating fragmentation and dramatically reducing GPU memory waste per request. More efficient memory use means higher concurrency on the same hardware.

The AMD MI300X pairs well with this design. Its 192 GB of HBM3 memorynearly six times an H100 SXMmeans a 120B-parameter model that would otherwise require a multi-node setup fits on a single card. Combined with AMD's AITER (AI Triton Extension for ROCm) kernels and unified attention backend, the MI300X can serve openai/gpt-oss-120b from one node with throughput numbers that are genuinely competitive for production workloads.

192 GB HBM3 per card

2394 Total tok/s (serve bench)

479 Output tok/s (serve bench)

Step 1Launch the vLLM Server

With the ROCm container environment ready, the entire server launch is a single terminal command. The flags are what make the difference between a baseline inference process and a production-grade serving stack.

bash

export TP=1
export MODEL_ID="openai/gpt-oss-120b"
export VLLM_ROCM_USE_AITER=1
export VLLM_USE_AITER_UNIFIED_ATTENTION=1
export VLLM_ROCM_USE_AITER_MHA=0

vllm serve \
    $MODEL_ID \
    --port 8000 \
    --tensor-parallel $TP \
    --no-enable-prefix-caching \
    --disable-log-requests \
    --compilation-config '{"full_cuda_graph": true}'

Three flags do most of the heavy lifting:

--tensor-parallel 1the entire 120B model runs on one MI300X. The 192 GB HBM3 pool makes this possible without NVLink or inter-node communication overhead.
VLLM_ROCM_USE_AITER=1 & VLLM_USE_AITER_UNIFIED_ATTENTION=1activates AMD's AITER kernels and the unified attention backend, both optimised for the ROCm stack and delivering meaningful throughput gains over stock attention implementations.
--compilation-config '{"full_cuda_graph": true}'captures the full execution graph at warmup so inference replays a pre-recorded plan, eliminating per-step kernel launch overhead and flattening latency variance.

◆

Once launched you'll see INFO: Application complete. in the logs. At that point you have a self-hosted, OpenAI-compatible inference serveressentially a private ChatGPT backend you control end-to-end, running entirely on your own hardware.

Step 2Send Your First Request

vLLM implements the OpenAI REST API spec, so any client that speaks to the OpenAI SDK works here without modification. The simplest test is a plain requests call:

python

import requests

response = requests.post(
    "http://localhost:8000/v1/completions",
    json={
        "model": "openai/gpt-oss-120b",
        "prompt": "The future of AI is",
        "max_tokens": 100,
        "temperature": 0
    }
)
print(response.json())

The response is a standard OpenAI completion objectsame schema, same field names. Swap the base URL in your existing app, and this server becomes a drop-in replacement for the cloud API with zero application-layer changes.

Step 3Benchmarking vLLM Performance

A server that returns correct responses is table stakes. Understanding its performance envelopewhere it saturates, how it handles concurrency, and what the hardware ceiling looks likeis what separates a toy deployment from one you can actually put in front of users. The vLLM toolchain ships three complementary benchmark modes, each measuring a different dimension.

3a. vllm bench serveConcurrent Load Testing

This mode simulates real-world serving by firing requests from multiple concurrent clients against the running server using synthetic random data. It's the closest proxy to what your users actually experience. The four metrics it reports map directly to user-facing quality:

TTFT (Time to First Token)how quickly the first token arrives after request submission. Governs perceived responsiveness in streaming UIs.
TPOT (Time Per Output Token)average latency between consecutive generated tokens. Controls how smooth the streaming experience feels.
ITL (Inter-Token Latency)similar to TPOT; jitter between tokens.
E2EL (End-to-End Latency)total wall-clock time from request submission to the final token. Relevant for non-streaming, request-response applications.

Results from a representative run80 requests, max concurrency 8, input length 4096 tokens, output length 1024 tokens:

Metric	Mean	Median	P99
TTFT (ms)	864.75	833.79	2091.68
TPOT (ms)	15.87	15.85	16.96
ITL (ms)	15.87	15.22	17.26
E2EL (ms)	17,097	16,968	17,943

bench serve · notebook output

Successful requests:              80
Benchmark duration (s):           171.00
Total input tokens:               327463
Total generated tokens:           81920
Request throughput (req/s):       0.47
Output token throughput (tok/s):  479.07
Total Token throughput (tok/s):   2394.07

Mean TTFT (ms):                   864.75
Median TTFT (ms):                 833.79
P99 TTFT (ms):                    2091.68

Mean TPOT (ms):                   15.87
Median TPOT (ms):                 15.85
P99 TPOT (ms):                    16.96

Mean ITL (ms):                    15.87
Median ITL (ms):                  15.22
P99 ITL (ms):                     17.26

Mean E2EL (ms):                   17097.11
Median E2EL (ms):                 16968.59
P99 E2EL (ms):                    17943.96

TPOT and ITL hovering around 15–16 ms means the model produces roughly 63 tokens per second per streamwell above the threshold for imperceptible streaming. The P99 TTFT of 2,091 ms reflects queue wait at peak concurrency; tune --max-concurrency to enforce an SLA ceiling. The recommended practice is to set --num-prompts to 10× --max-concurrency for statistically stable results.

3b. vllm bench latencySingle-Request Baseline

Where bench serve tests a crowd, bench latency tests a single user. It measures prompt processing and token generation latency with no batching or concurrency effects whatsoevergiving you the hardware's raw latency floor: the theoretical best-case response time with zero queue contention.

This mode is most useful for two things: establishing a performance baseline when you first bring up a new hardware configuration, and sizing for low-traffic deployments where individual response time matters more than aggregate throughput.

bench latency · notebook output

Profiling iterations: 30/30 [09:28<00:00, 18.94s/it]

Avg latency:            18.94 s
10 percentile latency:  18.60 s
25 percentile latency:  18.72 s
50 percentile latency:  18.89 s
75 percentile latency:  19.07 s
90 percentile latency:  19.33 s
99 percentile latency:  19.74 s

3c. vllm bench throughputPeak Hardware Throughput

The throughput benchmark asks a different question: how many tokens per second can this hardware produce when it's working as hard as possible? It uses batch processinggrouping multiple requests to maximise GPU utilisation at the cost of higher per-request latency. This is the right tool for offline or batch workloads: document processing pipelines, batch annotation jobs, nightly inference runs.

Results with --tensor-parallel 1 on a single MI300X card:

1239 Total tok/s (throughput bench)

249 Output tok/s (throughput bench)

bench throughput · notebook output

Throughput:              0.24 req/s
Total token throughput:  1239.56 tok/s
Output token throughput: 249.15 tok/s

Total num prompt tokens: 16282
Total num output tokens: 4096

The lower numbers compared to bench serve reflect the different measurement methodologythroughput mode accounts for all tokens end-to-end including prefill, while serve mode reports output tokens generated during the active decode phase. Neither is “wrong”; they measure different things.

Choosing the Right Benchmark for Your Use Case

Scenario	Benchmark	What it tells you
Real-time chat / streaming API	bench serve	TTFT, TPOT, E2E latency under concurrency
Hardware baseline, new deployment	bench latency	Raw single-request latency floor
Batch pipelines, offline inference	bench throughput	Peak tok/s at maximum GPU utilisation

The Architecture Behind the Numbers

Two design decisions explain why this stack delivers the numbers it does on the MI300X.

PagedAttention manages the KV-cache in fixed-size pages rather than pre-allocating a contiguous block per request. This eliminates internal fragmentationon a model with variable output lengths, the difference between a naive allocator and PagedAttention can be 2–4× more concurrent sequences per GB of HBM. On a 192 GB card, that multiplier matters.

AITER unified attention replaces the standard ROCm attention dispatch with kernels hand-tuned for the CDNA3 architecture. The unified attention path reduces memory round-trips in the attention block and is specifically compiled for the MI300X's HBM3 bandwidth characteristicssomething generic PyTorch ops don't account for.

◆

Full CUDA graph capture ("full_cuda_graph": true) is a significant win on long inference sessions. Without it, every decode step pays a Python-side kernel launch overhead. With the graph captured at warmup, each step replays a pre-recorded plan at native speedthe CPU overhead becomes a non-issue.

Key Takeaways

Deploying a 120B parameter model as a reliable, OpenAI-compatible LLM service on a single AMD MI300X is not just feasibleit's practical today. The combination of vLLM's PagedAttention engine, AMD's AITER kernels, and full CUDA graph capture delivers production-competitive throughput and latency from a one-card, one-node setup.

192 GB HBM3 is the enabling constraintit's what makes a 120B model fit on one card without tensor parallelism across nodes.
AITER + unified attention are not optional extras; they're the difference between a functional server and a fast one on ROCm hardware.
Three benchmark modes give you a complete picture: concurrent load (bench serve), single-request floor (bench latency), and peak capacity (bench throughput).
Set --max-concurrency in bench serve based on your TTFT SLA, not just your GPU's throughput ceilingthe P99 tells the real story.

Whether you're building a real-time chatbot, a high-volume document processor, or an internal AI API, this stack gives you full control over your inference infrastructure — on your own hardware, with no cloud dependency.

vLLM AMD MI300X ROCm AITER PagedAttention LLM Serving Inference

LLMs as a Production Service on AMD MI300X

Why vLLM on AMD MI300X?

Step 1Launch the vLLM Server

Step 2Send Your First Request

Step 3Benchmarking vLLM Performance

3a. vllm bench serveConcurrent Load Testing

3b. vllm bench latencySingle-Request Baseline

3c. vllm bench throughputPeak Hardware Throughput

Choosing the Right Benchmark for Your Use Case

The Architecture Behind the Numbers

Key Takeaways