All Posts
SLM

BENCHMARKING SMALL LLMs ON 4 GB VRAM

Qwen2.5:3B vs Gemma2:2B vs Llama3.2 — measured on TTFT, tokens/sec, and quality across 20 prompts in 6 categories on an RTX 3050 Laptop. The fastest model isn't the best, and the best isn't even close on speed.

Mar 2025 12 min read SLM · Ollama · Streamlit
View on GitHub →

Running large language models locally is still mostly a hobbyist luxury — 16 GB+ of VRAM, high-end desktop GPUs, noisy fans. But small language models (2–4B parameters) change the equation entirely. They fit in 4 GB of VRAM, run on a laptop, and for many everyday tasks they're good enough. The question I wanted to answer: which one is actually best, and on what tasks?

Hardware & Setup

All benchmarks ran on an RTX 3050 Laptop GPU (4 GB VRAM) with 16 GB DDR5 RAM. Ollama version 0.3.x managed model loading, context window, and inference. The benchmark harness called Ollama's REST API and measured wall-clock timings directly — no driver-level profiling.

The Three Models

3B Qwen2.5 params
2B Gemma2 params
3.2B Llama3.2 params

All three were chosen because they fit in 4 GB VRAM with a comfortable context window. Qwen2.5 is Alibaba's latest small model; Gemma2 is Google DeepMind's; Llama3.2 is Meta's. Three different training recipes, three different architecture choices.

The 6 Prompt Categories

Each model was evaluated on 20 prompts spread across 6 categories — chosen to cover the real use-cases where you'd actually reach for a local model:

The 5 Scoring Strategies

No single scoring method works for all categories. The harness uses a category-aware scorer that picks the right strategy automatically:

Scorer How it works Used for
Keyword overlapChecks presence of expected key terms in outputFactual, Reasoning
Exact matchString equality after normalisationMaths, short Factual
Code fenceParses fenced code blocks, validates syntaxCoding
StructuralWord count, line count, list formattingInstruction-following
ROUGE-1Unigram overlap between output and referenceSummarisation

Measuring TTFT & Throughput

Time to First Token (TTFT) matters as much as total speed for interactive use. A model that generates 25 tokens/sec but takes 4 seconds to start feels much slower than one that starts in 300 ms and generates 18 tokens/sec. The harness measures both separately:

benchmark harness (simplified)
import httpx, time

async def measure(model: str, prompt: str) -> dict:
    t0 = time.perf_counter()
    first_token_time = None
    tokens = 0

    async with httpx.AsyncClient() as client:
        async with client.stream("POST", "http://localhost:11434/api/generate",
            json={"model": model, "prompt": prompt}) as r:

            async for chunk in r.aiter_lines():
                if first_token_time is None:
                    first_token_time = time.perf_counter() - t0
                tokens += 1

    total = time.perf_counter() - t0
    return {
        "ttft": first_token_time,
        "throughput": tokens / total,
        "total_s": total
    }

Results

Model Avg quality score Avg TTFT (ms) Avg tokens/sec
Qwen2.5:3B 0.74 ★ 410 18.2
Gemma2:2B 0.61 290 ★ 26.8 ★
Llama3.2 0.67 380 21.4

Qwen2.5:3B ranked #1 on quality — particularly dominant on coding (+18 pts over Gemma2) and reasoning (+12 pts). It also handled instruction-following better than either competitor. Gemma2:2B is the speed champion — 47% lower TTFT than Qwen2.5 and 47% higher throughput. For latency-sensitive interactive use, Gemma2 wins. Llama3.2 is the middle ground — decent on both axes, best at nothing.

Overall quality score by model Overall quality score
Average throughput by model Average throughput (tokens/sec)

Category Breakdown: Where Each Model Wins

Quality radar chart by category Quality radar — per model strength profile
Quality heatmap per prompt Quality heatmap — per prompt
TTFT distribution by model Time-to-first-token distribution (lower is better)

The Streamlit Dashboard

Raw numbers are hard to reason about across 6 categories and 3 models. The dashboard makes the data navigable through four views:

What I'd Improve

Python Ollama Streamlit FastAPI ROUGE Scoring Qwen2.5 Gemma2 Llama3.2