Running an LLM in a notebook is one thing. Running it as a reliable, low-latency API server that handles dozens of concurrent users, streams tokens, and survives peak load is something else entirely. This post walks through exactly thatfrom spinning up a vLLM server on AMD's MI300X, to sending the first request, to measuring real performance with three complementary benchmarking tools.
Why vLLM on AMD MI300X?
vLLM is an open-source, high-throughput inference engine built around PagedAttentiona memory management technique that treats the KV-cache like virtual memory in an OS, eliminating fragmentation and dramatically reducing GPU memory waste per request. More efficient memory use means higher concurrency on the same hardware.
The AMD MI300X pairs well with this design. Its 192 GB of HBM3
memorynearly six times an H100 SXMmeans a 120B-parameter model that would
otherwise require a multi-node setup fits on a single card. Combined with AMD's
AITER (AI Triton Extension for ROCm) kernels and unified attention backend,
the MI300X can serve openai/gpt-oss-120b from one node with throughput numbers
that are genuinely competitive for production workloads.
Step 1Launch the vLLM Server
With the ROCm container environment ready, the entire server launch is a single terminal command. The flags are what make the difference between a baseline inference process and a production-grade serving stack.
export TP=1 export MODEL_ID="openai/gpt-oss-120b" export VLLM_ROCM_USE_AITER=1 export VLLM_USE_AITER_UNIFIED_ATTENTION=1 export VLLM_ROCM_USE_AITER_MHA=0 vllm serve \ $MODEL_ID \ --port 8000 \ --tensor-parallel $TP \ --no-enable-prefix-caching \ --disable-log-requests \ --compilation-config '{"full_cuda_graph": true}'
Three flags do most of the heavy lifting:
- --tensor-parallel 1the entire 120B model runs on one MI300X. The 192 GB HBM3 pool makes this possible without NVLink or inter-node communication overhead.
- VLLM_ROCM_USE_AITER=1 & VLLM_USE_AITER_UNIFIED_ATTENTION=1activates AMD's AITER kernels and the unified attention backend, both optimised for the ROCm stack and delivering meaningful throughput gains over stock attention implementations.
- --compilation-config '{"full_cuda_graph": true}'captures the full execution graph at warmup so inference replays a pre-recorded plan, eliminating per-step kernel launch overhead and flattening latency variance.
Once launched you'll see INFO: Application complete. in the logs. At that point
you have a self-hosted, OpenAI-compatible inference serveressentially a
private ChatGPT backend you control end-to-end, running entirely on your own hardware.
Step 2Send Your First Request
vLLM implements the OpenAI REST API spec, so any client that speaks to the OpenAI SDK
works here without modification. The simplest test is a plain requests call:
import requests response = requests.post( "http://localhost:8000/v1/completions", json={ "model": "openai/gpt-oss-120b", "prompt": "The future of AI is", "max_tokens": 100, "temperature": 0 } ) print(response.json())
The response is a standard OpenAI completion objectsame schema, same field names. Swap the base URL in your existing app, and this server becomes a drop-in replacement for the cloud API with zero application-layer changes.
Step 3Benchmarking vLLM Performance
A server that returns correct responses is table stakes. Understanding its performance envelopewhere it saturates, how it handles concurrency, and what the hardware ceiling looks likeis what separates a toy deployment from one you can actually put in front of users. The vLLM toolchain ships three complementary benchmark modes, each measuring a different dimension.
3a. vllm bench serveConcurrent Load Testing
This mode simulates real-world serving by firing requests from multiple concurrent clients against the running server using synthetic random data. It's the closest proxy to what your users actually experience. The four metrics it reports map directly to user-facing quality:
- TTFT (Time to First Token)how quickly the first token arrives after request submission. Governs perceived responsiveness in streaming UIs.
- TPOT (Time Per Output Token)average latency between consecutive generated tokens. Controls how smooth the streaming experience feels.
- ITL (Inter-Token Latency)similar to TPOT; jitter between tokens.
- E2EL (End-to-End Latency)total wall-clock time from request submission to the final token. Relevant for non-streaming, request-response applications.
Results from a representative run80 requests, max concurrency 8, input length 4096 tokens, output length 1024 tokens:
| Metric | Mean | Median | P99 |
|---|---|---|---|
| TTFT (ms) | 864.75 | 833.79 | 2091.68 |
| TPOT (ms) | 15.87 | 15.85 | 16.96 |
| ITL (ms) | 15.87 | 15.22 | 17.26 |
| E2EL (ms) | 17,097 | 16,968 | 17,943 |
Successful requests: 80 Benchmark duration (s): 171.00 Total input tokens: 327463 Total generated tokens: 81920 Request throughput (req/s): 0.47 Output token throughput (tok/s): 479.07 Total Token throughput (tok/s): 2394.07 Mean TTFT (ms): 864.75 Median TTFT (ms): 833.79 P99 TTFT (ms): 2091.68 Mean TPOT (ms): 15.87 Median TPOT (ms): 15.85 P99 TPOT (ms): 16.96 Mean ITL (ms): 15.87 Median ITL (ms): 15.22 P99 ITL (ms): 17.26 Mean E2EL (ms): 17097.11 Median E2EL (ms): 16968.59 P99 E2EL (ms): 17943.96
TPOT and ITL hovering around 15–16 ms means the model produces roughly
63 tokens per second per streamwell above the threshold for imperceptible streaming.
The P99 TTFT of 2,091 ms reflects queue wait at peak concurrency; tune
--max-concurrency to enforce an SLA ceiling. The recommended practice is
to set --num-prompts to 10× --max-concurrency for statistically
stable results.
3b. vllm bench latencySingle-Request Baseline
Where bench serve tests a crowd, bench latency tests a single user.
It measures prompt processing and token generation latency with no batching or concurrency
effects whatsoevergiving you the hardware's raw latency floor: the
theoretical best-case response time with zero queue contention.
This mode is most useful for two things: establishing a performance baseline when you first bring up a new hardware configuration, and sizing for low-traffic deployments where individual response time matters more than aggregate throughput.
Profiling iterations: 30/30 [09:28<00:00, 18.94s/it] Avg latency: 18.94 s 10 percentile latency: 18.60 s 25 percentile latency: 18.72 s 50 percentile latency: 18.89 s 75 percentile latency: 19.07 s 90 percentile latency: 19.33 s 99 percentile latency: 19.74 s
3c. vllm bench throughputPeak Hardware Throughput
The throughput benchmark asks a different question: how many tokens per second can this hardware produce when it's working as hard as possible? It uses batch processinggrouping multiple requests to maximise GPU utilisation at the cost of higher per-request latency. This is the right tool for offline or batch workloads: document processing pipelines, batch annotation jobs, nightly inference runs.
Results with --tensor-parallel 1 on a single MI300X card:
Throughput: 0.24 req/s Total token throughput: 1239.56 tok/s Output token throughput: 249.15 tok/s Total num prompt tokens: 16282 Total num output tokens: 4096
The lower numbers compared to bench serve reflect the different measurement
methodologythroughput mode accounts for all tokens end-to-end including prefill, while
serve mode reports output tokens generated during the active decode phase. Neither is
“wrong”; they measure different things.
Choosing the Right Benchmark for Your Use Case
| Scenario | Benchmark | What it tells you |
|---|---|---|
| Real-time chat / streaming API | bench serve | TTFT, TPOT, E2E latency under concurrency |
| Hardware baseline, new deployment | bench latency | Raw single-request latency floor |
| Batch pipelines, offline inference | bench throughput | Peak tok/s at maximum GPU utilisation |
The Architecture Behind the Numbers
Two design decisions explain why this stack delivers the numbers it does on the MI300X.
PagedAttention manages the KV-cache in fixed-size pages rather than pre-allocating a contiguous block per request. This eliminates internal fragmentationon a model with variable output lengths, the difference between a naive allocator and PagedAttention can be 2–4× more concurrent sequences per GB of HBM. On a 192 GB card, that multiplier matters.
AITER unified attention replaces the standard ROCm attention dispatch with kernels hand-tuned for the CDNA3 architecture. The unified attention path reduces memory round-trips in the attention block and is specifically compiled for the MI300X's HBM3 bandwidth characteristicssomething generic PyTorch ops don't account for.
Full CUDA graph capture ("full_cuda_graph": true) is a significant win on
long inference sessions. Without it, every decode step pays a Python-side kernel launch
overhead. With the graph captured at warmup, each step replays a pre-recorded plan at
native speedthe CPU overhead becomes a non-issue.
Key Takeaways
Deploying a 120B parameter model as a reliable, OpenAI-compatible LLM service on a single AMD MI300X is not just feasibleit's practical today. The combination of vLLM's PagedAttention engine, AMD's AITER kernels, and full CUDA graph capture delivers production-competitive throughput and latency from a one-card, one-node setup.
- 192 GB HBM3 is the enabling constraintit's what makes a 120B model fit on one card without tensor parallelism across nodes.
- AITER + unified attention are not optional extras; they're the difference between a functional server and a fast one on ROCm hardware.
- Three benchmark modes give you a complete picture: concurrent load (bench serve), single-request floor (bench latency), and peak capacity (bench throughput).
- Set
--max-concurrencyin bench serve based on your TTFT SLA, not just your GPU's throughput ceilingthe P99 tells the real story.
Whether you're building a real-time chatbot, a high-volume document processor, or an internal AI API, this stack gives you full control over your inference infrastructure — on your own hardware, with no cloud dependency.