All Posts
Inference

vLLM vs SGLang vs TRT-LLM on RTX 5090

A three-way benchmark of the leading LLM serving backends on Qwen2.5-7B measuring throughput, latency, and quality across practical, overload, and extreme concurrency levels. TRT-LLM wins at scale-out; vLLM and SGLang win when the queue never empties.

Apr 2026 14 min read vLLM · SGLang · TensorRT-LLM
View on GitHub →

Picking an LLM serving backend used to be simple vLLM was the obvious choice. But SGLang has caught up fast, and TensorRT-LLM now supports an OpenAI-compatible HTTP server. All three handle the same interface; the differences live entirely in compiled kernels, batching strategies, and how each reacts to load. I wanted hard numbers, not marketing copy.

Hardware & Setup

All runs used Qwen/Qwen2.5-7B-Instruct on an RTX 5090 (32 GB, Blackwell SM 12.0). vLLM ran with --attention-backend FLASHINFER, SGLang with its default radix-cache scheduler, TRT-LLM with --backend tensorrt on CUDA 12.9 and FlashInfer enabled. The benchmark harness fired async concurrent requests over the OpenAI-compatible HTTP API and measured wall-clock RPS, output TPS, TTFT, and inter-token latency.

The Three Backends

4108 TRT-LLM TPS @ c=64
2917 vLLM TPS @ c=64
22 TRT-LLM TTFT ms @ c=1

Each backend exposes an OpenAI-compatible endpoint, so the benchmark harness is identical for all three only the server changes. The architectural differences that matter most at serving time are:

Benchmark Design

Three test suites, each with a different goal:

benchmark harness (simplified)
async def load_test(base_url, model, concurrency, num_requests):
    semaphore = asyncio.Semaphore(concurrency)
    results = []

    async def single_request():
        async with semaphore:
            t0 = time.perf_counter()
            first_token_time = None
            tokens = 0
            async with client.stream("POST", f"{base_url}/v1/completions",
                json={"model": model, "stream": True, ...}) as r:
                async for chunk in r.aiter_lines():
                    if first_token_time is None:
                        first_token_time = time.perf_counter() - t0
                    tokens += 1
            results.append({"ttft": first_token_time, "tps": tokens / (time.perf_counter() - t0)})

    await asyncio.gather(*[single_request() for _ in range(num_requests)])
    return results

Baseline Sweep: Concurrency 1 → 50

At low to moderate concurrency, all three backends are close. The gap opens at c=50 where TRT-LLM's compiled kernels begin to dominate:

Concurrency vLLM TPS SGLang TPS TRT-LLM TPS vLLM TTFT SGLang TTFT TRT-LLM TTFT
11011029325 ms33 ms22 ms ★
549647945843 ms118 ms33 ms ★
1091791389653 ms52 ms48 ms ★
201,4741,479 ★1,47785 ms63 ms ★83 ms
502,9072,5764,063 ★160 ms ★186 ms191 ms

At c=1, TRT-LLM posts the lowest TTFT (22 ms) single-request latency is where compiled CUDA graphs shine most. At c=50, TRT-LLM pulls ahead on throughput by 40% over vLLM and 58% over SGLang.

vLLM dashboard overview vLLM dashboard overview & KPIs
SGLang dashboard overview SGLang dashboard overview & KPIs

Fine-Grained Sweep: Finding Saturation

All three backends saturate at c=64 adding more concurrent users past that point yields no additional throughput. But the saturation TPS is very different:

Concurrency vLLM TPS SGLang TPS TRT-LLM TPS
110110293
8708704649
161,1691,1441,132
322,1582,0692,179
642,9172,8874,108 ★
1282,8822,8364,101 ★

At saturation, TRT-LLM delivers 4,108 TPS 41% more than vLLM (2,917) and 42% more than SGLang (2,887). The plateau at c=128 vs c=64 confirms the GPU is fully utilised; the marginal regression is queue overhead, not GPU capacity.

vLLM throughput/latency tradeoff vLLM throughput / latency tradeoff
SGLang throughput/latency tradeoff SGLang throughput / latency tradeoff

Why TRT-LLM Wins at Practical Concurrency

At c=64, TRT-LLM does in one kernel what vLLM does in three or four. The startup compilation step produces:

Overload: Where vLLM & SGLang Take the Lead

Once the queue is permanently saturated (c=100+), continuous batching closes the gap and then reverses it. vLLM and SGLang hit ~5,640–5,650 TPS at the GPU ceiling while TRT-LLM plateaus at ~4,220 TPS:

Concurrency vLLM TPS SGLang TPS TRT-LLM TPS vLLM TTFT TRT-LLM TTFT
1005,5505,638 ★4,200264 ms1,323 ms
2005,5295,644 ★4,226244 ms1,322 ms
5005,658 ★5,6134,220174 ms1,300 ms

The reversal happens because TRT-LLM uses static compiled batch sizes (1, 2, 4 … 64, 128). When 300 requests are queued, it still processes them in groups of 128, 128, 44 padding waste grows. vLLM and SGLang inspect the full queue every decode step and pack exactly as many sequences as the GPU can hold, so every step runs at 100% GPU utilisation with no padding.

TRT-LLM's TTFT also breaks the 500 ms SLA much earlier at c≈100 (1,323 ms average) vs c≈1,000 for vLLM and SGLang. The static-batch architecture has less runtime flexibility to interleave prefill and decode phases under deep queues.

Extreme Overload: The True Breaking Point

At c=1,000+, vLLM and SGLang never hard-fail they queue everything. The breaking point is latency, not errors:

Concurrency vLLM TPS SGLang TPS vLLM TTFT SGLang TTFT Errors
5005,6585,613174 ms197 ms0
1,0005,5125,5742,538 ms2,480 ms0
3,0005,4915,4942,555 ms2,521 ms0
5,0005,4915,5262,548 ms2,497 ms0

TTFT jumps 14× between c=500 and c=1,000 (174 ms → 2,538 ms) while TPS stays flat at the GPU ceiling (~5,500). The practical SLA boundary for TTFT < 500 ms is c ≈ 500–1,000. To enforce a hard limit, set --max-num-seqs on the vLLM server or apply a client-side timeout; the GPU is never the bottleneck at these concurrency levels.

Quality Evaluation

Serving the same weights through different engines shouldn't change model quality but it can surface sampling implementation differences. Across MMLU, GSM8K, and HumanEval:

Dataset Metric vLLM SGLang TRT-LLM
MMLUAccuracy100%100%100%
GSM8KExact Match50%62.5%87.5% ★
HumanEvalpass@1100%100%100%

MMLU and HumanEval are consistent across all three. The GSM8K spread (50% → 87.5%) is driven by small-sample noise the eval runs only 8 questions, so a single extra correct answer moves the score by 12.5 percentage points. These are not evidence of engine-level quality differences.

vLLM quality metrics vLLM quality metrics
SGLang quality metrics SGLang quality metrics

TRT-LLM on Blackwell: The CUDA Version Trap

Getting TRT-LLM running on the RTX 5090 required more than a pip install. TRT-LLM 1.2.0 was built against CUDA 13 cuBLAS and required OpenMPI, and the RTX 5090's SM 12.0 (Blackwell) architecture isn't supported by CUDA 12.8's nvcc. The upgrade from the PyTorch backend (CUDA 12.8) to the TRT engine (CUDA 12.9) changed everything:

Metric PyTorch backend (CUDA 12.8) TRT engine (CUDA 12.9) Improvement
TPS @ c=501,5654,063+160%
Saturation TPS~1,742 @ c=32~4,108 @ c=64+136%
GPU ceiling TPS~1,950~4,220+116%
TTFT avg @ c=138 ms22 ms−42%

The CUDA version matters more than any tuning knob. Upgrading CUDA 12.8 → 12.9 and switching to --backend tensorrt more than doubled throughput and cut TTFT by 42%. If you're on Blackwell and benchmarking TRT-LLM without CUDA 12.9 and the TRT engine, you're measuring the wrong thing.

When to Use Each

Use case Best choice Reason
Production serving, TTFT SLA < 200 msTRT-LLMLowest latency, highest TPS at c=1–64
Batch processing, max raw throughputvLLM or SGLangContinuous batching wins at c=100+
Simplest deployment, any GPUvLLM or SGLangNo engine compilation step
Single-digit concurrency, lowest TTFTTRT-LLM22 ms vs 25–33 ms for others

The Streamlit Dashboard

Every run writes results to Parquet and CSV under data/. The Streamlit dashboard surfaces five views: Overview KPIs, TPS/TTFT vs concurrency line charts, a TPS vs TTFT tradeoff scatter (bubble size = concurrency), quality metrics per dataset, and a raw data table with CSV export. Running in demo mode (no GPU required):

quick start
# Demo mode no GPU required
python main.py --demo
streamlit run dashboard/app.py

# Against a live vLLM server
vllm serve Qwen/Qwen2.5-7B-Instruct \
  --host 0.0.0.0 --port 8000 \
  --attention-backend FLASHINFER

python main.py \
  --model Qwen/Qwen2.5-7B-Instruct \
  --base-url http://localhost:8000 \
  --concurrency 1,5,10,20,50 \
  --num-requests 50

What I'd Do Next

Python vLLM SGLang TensorRT-LLM Streamlit CUDA Qwen2.5 RTX 5090