All Posts
Inference · AMD · ROCm

LLMs as a Production Service on AMD MI300X

Launching a 120B-parameter model on a single GPU, wiring up an OpenAI-compatible API, and stress-testing it with three complementary benchmarking modesall on AMD's MI300X with vLLM and AITER kernels.

Apr 2026 10 min read vLLM · AMD · ROCm · AITER

Running an LLM in a notebook is one thing. Running it as a reliable, low-latency API server that handles dozens of concurrent users, streams tokens, and survives peak load is something else entirely. This post walks through exactly thatfrom spinning up a vLLM server on AMD's MI300X, to sending the first request, to measuring real performance with three complementary benchmarking tools.

Why vLLM on AMD MI300X?

vLLM is an open-source, high-throughput inference engine built around PagedAttentiona memory management technique that treats the KV-cache like virtual memory in an OS, eliminating fragmentation and dramatically reducing GPU memory waste per request. More efficient memory use means higher concurrency on the same hardware.

The AMD MI300X pairs well with this design. Its 192 GB of HBM3 memorynearly six times an H100 SXMmeans a 120B-parameter model that would otherwise require a multi-node setup fits on a single card. Combined with AMD's AITER (AI Triton Extension for ROCm) kernels and unified attention backend, the MI300X can serve openai/gpt-oss-120b from one node with throughput numbers that are genuinely competitive for production workloads.

192 GB HBM3 per card
2394 Total tok/s (serve bench)
479 Output tok/s (serve bench)

Step 1Launch the vLLM Server

With the ROCm container environment ready, the entire server launch is a single terminal command. The flags are what make the difference between a baseline inference process and a production-grade serving stack.

bash
export TP=1
export MODEL_ID="openai/gpt-oss-120b"
export VLLM_ROCM_USE_AITER=1
export VLLM_USE_AITER_UNIFIED_ATTENTION=1
export VLLM_ROCM_USE_AITER_MHA=0

vllm serve \
    $MODEL_ID \
    --port 8000 \
    --tensor-parallel $TP \
    --no-enable-prefix-caching \
    --disable-log-requests \
    --compilation-config '{"full_cuda_graph": true}'

Three flags do most of the heavy lifting:

Once launched you'll see INFO: Application complete. in the logs. At that point you have a self-hosted, OpenAI-compatible inference serveressentially a private ChatGPT backend you control end-to-end, running entirely on your own hardware.

Step 2Send Your First Request

vLLM implements the OpenAI REST API spec, so any client that speaks to the OpenAI SDK works here without modification. The simplest test is a plain requests call:

python
import requests

response = requests.post(
    "http://localhost:8000/v1/completions",
    json={
        "model": "openai/gpt-oss-120b",
        "prompt": "The future of AI is",
        "max_tokens": 100,
        "temperature": 0
    }
)
print(response.json())

The response is a standard OpenAI completion objectsame schema, same field names. Swap the base URL in your existing app, and this server becomes a drop-in replacement for the cloud API with zero application-layer changes.

Step 3Benchmarking vLLM Performance

A server that returns correct responses is table stakes. Understanding its performance envelopewhere it saturates, how it handles concurrency, and what the hardware ceiling looks likeis what separates a toy deployment from one you can actually put in front of users. The vLLM toolchain ships three complementary benchmark modes, each measuring a different dimension.

3a. vllm bench serveConcurrent Load Testing

This mode simulates real-world serving by firing requests from multiple concurrent clients against the running server using synthetic random data. It's the closest proxy to what your users actually experience. The four metrics it reports map directly to user-facing quality:

Results from a representative run80 requests, max concurrency 8, input length 4096 tokens, output length 1024 tokens:

Metric Mean Median P99
TTFT (ms)864.75833.792091.68
TPOT (ms)15.8715.8516.96
ITL (ms)15.8715.2217.26
E2EL (ms)17,09716,96817,943
bench serve · notebook output
Successful requests:              80
Benchmark duration (s):           171.00
Total input tokens:               327463
Total generated tokens:           81920
Request throughput (req/s):       0.47
Output token throughput (tok/s):  479.07
Total Token throughput (tok/s):   2394.07

Mean TTFT (ms):                   864.75
Median TTFT (ms):                 833.79
P99 TTFT (ms):                    2091.68

Mean TPOT (ms):                   15.87
Median TPOT (ms):                 15.85
P99 TPOT (ms):                    16.96

Mean ITL (ms):                    15.87
Median ITL (ms):                  15.22
P99 ITL (ms):                     17.26

Mean E2EL (ms):                   17097.11
Median E2EL (ms):                 16968.59
P99 E2EL (ms):                    17943.96

TPOT and ITL hovering around 15–16 ms means the model produces roughly 63 tokens per second per streamwell above the threshold for imperceptible streaming. The P99 TTFT of 2,091 ms reflects queue wait at peak concurrency; tune --max-concurrency to enforce an SLA ceiling. The recommended practice is to set --num-prompts to 10× --max-concurrency for statistically stable results.

3b. vllm bench latencySingle-Request Baseline

Where bench serve tests a crowd, bench latency tests a single user. It measures prompt processing and token generation latency with no batching or concurrency effects whatsoevergiving you the hardware's raw latency floor: the theoretical best-case response time with zero queue contention.

This mode is most useful for two things: establishing a performance baseline when you first bring up a new hardware configuration, and sizing for low-traffic deployments where individual response time matters more than aggregate throughput.

bench latency · notebook output
Profiling iterations: 30/30 [09:28<00:00, 18.94s/it]

Avg latency:            18.94 s
10 percentile latency:  18.60 s
25 percentile latency:  18.72 s
50 percentile latency:  18.89 s
75 percentile latency:  19.07 s
90 percentile latency:  19.33 s
99 percentile latency:  19.74 s

3c. vllm bench throughputPeak Hardware Throughput

The throughput benchmark asks a different question: how many tokens per second can this hardware produce when it's working as hard as possible? It uses batch processinggrouping multiple requests to maximise GPU utilisation at the cost of higher per-request latency. This is the right tool for offline or batch workloads: document processing pipelines, batch annotation jobs, nightly inference runs.

Results with --tensor-parallel 1 on a single MI300X card:

1239 Total tok/s (throughput bench)
249 Output tok/s (throughput bench)
bench throughput · notebook output
Throughput:              0.24 req/s
Total token throughput:  1239.56 tok/s
Output token throughput: 249.15 tok/s

Total num prompt tokens: 16282
Total num output tokens: 4096

The lower numbers compared to bench serve reflect the different measurement methodologythroughput mode accounts for all tokens end-to-end including prefill, while serve mode reports output tokens generated during the active decode phase. Neither is “wrong”; they measure different things.

Choosing the Right Benchmark for Your Use Case

Scenario Benchmark What it tells you
Real-time chat / streaming APIbench serveTTFT, TPOT, E2E latency under concurrency
Hardware baseline, new deploymentbench latencyRaw single-request latency floor
Batch pipelines, offline inferencebench throughputPeak tok/s at maximum GPU utilisation

The Architecture Behind the Numbers

Two design decisions explain why this stack delivers the numbers it does on the MI300X.

PagedAttention manages the KV-cache in fixed-size pages rather than pre-allocating a contiguous block per request. This eliminates internal fragmentationon a model with variable output lengths, the difference between a naive allocator and PagedAttention can be 2–4× more concurrent sequences per GB of HBM. On a 192 GB card, that multiplier matters.

AITER unified attention replaces the standard ROCm attention dispatch with kernels hand-tuned for the CDNA3 architecture. The unified attention path reduces memory round-trips in the attention block and is specifically compiled for the MI300X's HBM3 bandwidth characteristicssomething generic PyTorch ops don't account for.

Full CUDA graph capture ("full_cuda_graph": true) is a significant win on long inference sessions. Without it, every decode step pays a Python-side kernel launch overhead. With the graph captured at warmup, each step replays a pre-recorded plan at native speedthe CPU overhead becomes a non-issue.

Key Takeaways

Deploying a 120B parameter model as a reliable, OpenAI-compatible LLM service on a single AMD MI300X is not just feasibleit's practical today. The combination of vLLM's PagedAttention engine, AMD's AITER kernels, and full CUDA graph capture delivers production-competitive throughput and latency from a one-card, one-node setup.

Whether you're building a real-time chatbot, a high-volume document processor, or an internal AI API, this stack gives you full control over your inference infrastructure — on your own hardware, with no cloud dependency.

vLLM AMD MI300X ROCm AITER PagedAttention LLM Serving Inference