Most tutorials skip the foundations and jump straight to YAML. That works until something breaks, and then you have no mental model to reason from. This post builds the full picture: what vLLM is and why it matters for LLM serving, what Kubernetes actually does and how its planes are structured, how AMD GPUs fit into that model, and finally the concrete steps to get a production-grade inference cluster running on bare metal. The reference implementation is Yu-amd/Kubernetes-MI300X on GitHub.
What is vLLM
vLLM is an open-source inference engine built specifically for large language models. The core problem it solves is GPU memory efficiency. Naive LLM serving allocates a fixed KV cache block for every request upfront, even if that request only uses a fraction of it. At any meaningful concurrency level, this wastes the majority of GPU memory and throttles throughput.
vLLM's answer is PagedAttention: it manages the KV cache the same way an OS manages RAM, dividing it into fixed-size pages and allocating them on demand as tokens are generated. Pages belonging to different requests can be interleaved in GPU memory, and pages can be shared across requests that have the same prompt prefix. The result is dramatically higher utilisation and, as a direct consequence, much higher throughput at the same hardware budget.
Beyond PagedAttention, vLLM provides continuous batching (new requests join the batch as slots free up, rather than waiting for a whole batch to finish), tensor parallelism for multi-GPU serving, and a built-in Prometheus metrics endpoint. For AMD Instinct GPUs specifically, vLLM uses the ROCm software stack and AITER kernels that are tuned for AMD's HBM3 memory architecture.
vLLM starts an OpenAI-compatible HTTP server. Any code that calls POST /v1/chat/completions against OpenAI can point at vLLM with a one-line URL change. No client code modifications required.
What is Kubernetes
Kubernetes is a container orchestration system. It takes a cluster of machines and presents them as a single pool of compute resources. You declare what you want to run (a Deployment, a Service, a Job) and Kubernetes figures out where to run it, restarts it if it crashes, and scales it up or down as load changes. The operative word is declarative: you describe the desired state and the system continuously works to reconcile reality with that description.
For LLM inference this matters because a single vLLM process running on one machine is not fault-tolerant and cannot scale. Kubernetes wraps it in a layer that handles restarts, rolling updates, health checks, load balancing, and resource scheduling automatically.
The Kubernetes Planes
Control Plane
The control plane is the brain of the cluster. It makes all scheduling and state-management decisions. It does not run your workloads. Every control plane component runs on dedicated master nodes.
| Component | What it does |
|---|---|
| kube-apiserver | The front door. Every kubectl command and every controller talks to the API server. It validates and stores all cluster state. |
| etcd | Distributed key-value store. The single source of truth for all cluster state. If etcd is healthy, the cluster can recover from anything. |
| kube-scheduler | Watches for unscheduled pods and assigns them to nodes. For GPU workloads it evaluates amd.com/gpu resource availability. |
| kube-controller-manager | Runs a collection of controllers (Deployment, ReplicaSet, Node, etc.) that drive current state toward desired state. |
| cloud-controller-manager | Integrates with cloud provider APIs. On bare metal this is replaced by MetalLB for LoadBalancer services. |
Data Plane (Worker Nodes)
The data plane is where workloads actually run. Each worker node runs three agents that take instructions from the control plane and execute them locally.
| Component | What it does |
|---|---|
| kubelet | The node agent. Watches the API server for pods assigned to its node, then tells the container runtime to start or stop them. Reports node status and pod health back up. |
| kube-proxy | Manages network rules (iptables or IPVS) that implement Service load balancing within the cluster. |
| Container runtime | The engine that actually runs containers. containerd is standard. It pulls images, creates namespaces, and manages the container lifecycle. |
Add-ons
Beyond the core planes, two add-ons are essential for GPU inference on bare metal. CNI (Container Network Interface) provides pod-to-pod networking across nodes. Calico is the standard choice for bare-metal clusters. MetalLB is the bare-metal equivalent of a cloud LoadBalancer: it assigns real IP addresses from a configured pool to Services of type LoadBalancer, making the vLLM API reachable from outside the cluster.
┌──────────────────── Control Plane ─────────────────────┐
│ kube-apiserver · etcd · scheduler · controller-mgr │
└────────────────────────────────────────────────────────┘
↕ (watch / reconcile loop)
┌──────────── Worker Node ───────────────────────────────┐
│ kubelet · kube-proxy · containerd │
│ ┌──────────────────────────────────────────────────┐ │
│ │ AMD GPU Operator (Device Plugin + Node Labeller)│ │
│ └──────────────────────────────────────────────────┘ │
│ ┌────────────────┐ ┌────────────────┐ │
│ │ vLLM Pod │ │ vLLM Pod │ │
│ │ amd.com/gpu:1 │ │ amd.com/gpu:1 │ │
│ └────────────────┘ └────────────────┘ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ AMD Instinct MI300X (GPU 0 · GPU 1 · ...) │ │
│ └──────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────┘
↕
┌──────────── MetalLB ───────────────────────────────────┐
│ LoadBalancer Service · IPAddressPool · L2Advertisement│
│ EXTERNAL-IP: 192.168.x.x:8000 │
└────────────────────────────────────────────────────────┘
↕
Client (POST /v1/chat/completions)
How AMD GPUs Fit In
Standard Kubernetes knows nothing about GPUs. It schedules pods based on CPU and memory
requests. To make amd.com/gpu a schedulable resource, you need
the AMD GPU Operator, which installs two things on every GPU node:
- Device Plugin: advertises GPU count to the API server so the scheduler can place pods correctly
- Node Labeller: tags nodes with GPU model, VRAM size, and ROCm version so workloads can target specific hardware
Once the operator is healthy, a pod can request a GPU by adding
amd.com/gpu: 1 to its resource limits. Kubernetes then ensures that pod lands
on a node with a free GPU and that no other pod gets access to that GPU while the first
one is running.
resources:
requests:
amd.com/gpu: 1
memory: "16Gi"
cpu: "4"
limits:
amd.com/gpu: 1
memory: "32Gi"
cpu: "8"
Setting Up the Cluster
The reference implementation at Yu-amd/Kubernetes-MI300X breaks the setup into three sequential scripts. Each one is idempotent: re-running it skips already-installed components.
Step 1: Install Kubernetes
The first script installs vanilla Kubernetes 1.28+ on Ubuntu/Debian. It configures containerd as the container runtime, sets up Calico CNI for pod networking, removes the control-plane taint so a single-node cluster can also run workloads, disables swap (a hard Kubernetes requirement), and applies the necessary kernel settings.
# Before running, verify system readiness sudo ./check-system-enhanced.sh # Install Kubernetes 1.28+ with containerd + Calico CNI sudo ./install-kubernetes.sh # Verify cluster is healthy kubectl get nodes # NAME STATUS ROLES AGE VERSION # mi300x-1 Ready control-plane 2m v1.28.0
Step 2: Install AMD GPU Operator
The second script installs Helm, deploys cert-manager (a prerequisite for the operator's
webhooks), then installs the AMD GPU Operator itself. It also sets up a 50 GB
PersistentVolume at /mnt/data/ai-models using a local storage class,
which is where model weights will live.
./install-amd-gpu-operator.sh # Confirm GPU resources are visible to the scheduler kubectl get nodes -o json | grep "amd.com/gpu" # "amd.com/gpu": "8" ← MI300X exposes 8 logical GPUs # Check operator pods are running kubectl get pods -n kube-amd-gpu
The DeviceConfig custom resource tells the operator how to configure the device plugin. For most setups the empty spec is correct: it uses all detected GPUs with default settings.
apiVersion: amd.com/v1alpha1
kind: DeviceConfig
metadata:
name: amd-gpu-device-config
namespace: kube-amd-gpu
spec: {} # defaults: all GPUs, ROCm auto-detected
Step 3: Deploy vLLM with MetalLB
The third script does three things in sequence: installs MetalLB v0.14, configures an IP pool from the node's subnet, then deploys the vLLM inference server. The script auto-detects the node IP and carves out a small pool (e.g. 192.168.x.240 to .250) for LoadBalancer services.
kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.14.3/config/manifests/metallb-native.yaml # Wait for MetalLB pods to be ready kubectl wait --namespace metallb-system \ --for=condition=ready pod \ --selector=app=metallb \ --timeout=90s
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: default-pool
namespace: metallb-system
spec:
addresses:
- 192.168.1.240-192.168.1.250 # adjust to your subnet
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
name: default-advertisement
namespace: metallb-system
spec:
ipAddressPools:
- default-pool
The vLLM deployment requests exactly one GPU, mounts the model PVC at
/models, and includes readiness and liveness probes against the
/health endpoint that vLLM exposes once the model is loaded.
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-inference
spec:
replicas: 1
template:
spec:
containers:
- name: vllm-container
image: rocm/vllm:latest
command: ["python", "-m", "vllm.entrypoints.openai.api_server"]
args:
- "--model"
- "Qwen/Qwen2.5-1.5B-Instruct"
- "--host"
- "0.0.0.0"
- "--port"
- "8080"
- "--download-dir"
- "/models"
- "--tensor-parallel-size"
- "1"
resources:
requests:
amd.com/gpu: 1
memory: "8Gi"
limits:
amd.com/gpu: 1
memory: "16Gi"
volumeMounts:
- name: model-storage
mountPath: /models
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
apiVersion: v1
kind: PersistentVolume
metadata:
name: ai-models-pv
spec:
capacity:
storage: 50Gi
accessModes: [ReadWriteOnce]
persistentVolumeReclaimPolicy: Retain
storageClassName: local-storage
hostPath:
path: /mnt/data/ai-models # pre-download weights here
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ai-models-pvc
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 50Gi
storageClassName: local-storage
Once the pod is running and the service has an external IP, test the endpoint:
# Get the MetalLB-assigned IP kubectl get svc vllm-service # vllm-service LoadBalancer 10.96.x.x 192.168.1.240 8080 # Call the OpenAI-compatible endpoint curl http://192.168.1.240:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen2.5-1.5B-Instruct", "messages": [{"role":"user","content":"Hello"}] }'
Troubleshooting: The Four Failure Modes
Every failure in this stack maps to one of four layers. Identifying the layer first saves hours of random debugging.
| Layer | Component | Symptom |
|---|---|---|
| GPU | AMD GPU Operator | Pod stays Pending |
| Networking | MetalLB | Service unreachable / connection refused |
| Storage | PVC / volume mount | Model loading error / file not found |
| Runtime | vLLM process | OOMKilled / CrashLoopBackOff |
1. Pod Pending: insufficient amd.com/gpu
Kubernetes cannot schedule the pod because no node has free GPU resources. Either the AMD GPU Operator is not installed, the device plugin is unhealthy, or all GPUs are already allocated.
Events:
Warning FailedScheduling ... 0/3 nodes are available:
3 Insufficient amd.com/gpu.
# Check what GPU resources nodes actually advertise kubectl get nodes -o json | grep "amd.com/gpu" # Healthy: "amd.com/gpu": "8" # Missing: GPU Operator device plugin not running # Check operator health kubectl get pods -n kube-amd-gpu kubectl describe pod <operator-pod> -n kube-amd-gpu
2. Service Unreachable: Connection Refused
The pod is running but the API is not reachable. The first question is whether MetalLB
assigned an external IP. If EXTERNAL-IP shows
<pending>, the IPAddressPool is not configured or does not overlap
with the node network, or L2Advertisement is missing.
kubectl get svc # EXTERNAL-IP should show 192.168.x.x, not <pending> kubectl get endpoints vllm-service # Must show pod IPs. If empty, selector does not match pod labels. kubectl describe svc vllm-service # Check Events section for MetalLB allocation messages
3. Model Loading Errors: File Not Found
vLLM acquires the GPU, then crashes looking for weights. Almost always a volume mount
mismatch: the --download-dir argument does not match the
mountPath, or the PVC is Pending (WaitForFirstConsumer is expected
before the pod starts, but Unbound after means a storage problem).
kubectl get pvc # STATUS must be Bound. Pending before pod = ok. Pending after = problem. kubectl describe pod <pod-name> # Look at: Mounts section (paths), Events (volume errors)
4. Memory Issues: OOMKilled / CrashLoopBackOff
The pod starts and loads the model, then dies under load. vLLM preallocates the entire
KV cache at startup based on --max-model-len and
--max-num-seqs. If the sum of model weights plus KV cache exceeds the pod
memory limit, the OOM killer terminates the process.
kubectl describe pod <pod-name> # Look for: Reason: OOMKilled under Last State # Fixes in order of preference: # 1. Raise memory limits in the Deployment spec # 2. --max-model-len 4096 (reduce from default 8192) # --max-num-seqs 64 (reduce max concurrent sequences) # 3. Use quantised weights (GPTQ/AWQ) to shrink KV cache footprint
Production Patterns
Horizontal Pod Autoscaling
CPU-based HPA is useless for LLM serving: the CPU sits idle while the GPU is saturated.
You need custom metrics. vLLM's built-in
/metrics endpoint exposes queue depth, TTFT percentiles, and KV cache
utilisation as Prometheus metrics. Feed those through the custom metrics API adapter
and the HPA can scale on signals that actually reflect inference pressure.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-inference
minReplicas: 1
maxReplicas: 8
metrics:
- type: Pods
pods:
metric:
name: vllm_request_queue_depth
target:
type: AverageValue
averageValue: "10"
GPU Sharing
A single AMD Instinct MI300X carries 192 GB of HBM3. Running one large model on it is efficient, but smaller models leave most of that capacity idle. AMD ROCm supports multi-process sharing: multiple vLLM instances can timeshare one GPU with isolated memory spaces. Unlike NVIDIA MIG, AMD does not expose hardware-level GPU partitioning the same way, but ROCm compute partitioning modes provide comparable isolation for most serving workloads.
Observability Stack
Instrument three layers: GPU hardware via the ROCm SMI Exporter, vLLM serving metrics via its built-in Prometheus endpoint, and Kubernetes pod-level metrics via cAdvisor. Prometheus collects all three. Grafana visualises them.
# vLLM (port 8000/metrics) vllm:time_to_first_token_seconds_p99 # user-visible latency vllm:e2e_request_latency_seconds_p99 # full round-trip vllm:request_queue_depth # HPA trigger signal vllm:gpu_cache_usage_perc # scale-out indicator # ROCm SMI Exporter rocm_gpu_utilization # compute % rocm_gpu_memory_used_bytes # HBM consumption rocm_gpu_temperature_celsius # thermal headroom
GitOps with Argo CD
All manifests (Deployment, Service, PVC, MetalLB config, HPA) should live in a Git repository. Argo CD watches that repo and continuously reconciles the cluster to match it. This gives you instant rollback (revert the commit), reproducible environments, and a complete audit trail for every change to the inference stack.
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: vllm-inference
namespace: argocd
spec:
source:
repoURL: https://github.com/Yu-amd/Kubernetes-MI300X
targetRevision: main
path: yaml-configs
destination:
server: https://kubernetes.default.svc
namespace: default
syncPolicy:
automated:
prune: true
selfHeal: true
Security and Multi-tenancy
When multiple teams share a GPU cluster, three primitives prevent interference. RBAC restricts which service accounts can schedule GPU pods. NetworkPolicy limits which namespaces can reach the inference endpoints. ResourceQuota caps how many GPUs each team can consume, preventing one workload from starving others.
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-quota
namespace: team-a
spec:
hard:
requests.amd.com/gpu: "4"
limits.amd.com/gpu: "4"
Putting It Together
| Step | What it solves | When to add it |
|---|---|---|
| check-system-enhanced.sh | Catch OS / disk / lock issues before install | Before anything |
| install-kubernetes.sh | containerd + Calico + single-node cluster | Day one |
| install-amd-gpu-operator.sh | amd.com/gpu resource + model PV | Day one |
| MetalLB + IPAddressPool | External IP on bare metal | Before external traffic |
| deploy-vllm-inference.sh | vLLM pod + LoadBalancer service | After GPU operator healthy |
| Prometheus + Grafana | Latency / cache / GPU visibility | Once stable in dev |
| Custom HPA | Request-driven autoscaling | When load is variable |
| Argo CD | Reproducible GitOps deployments | When team grows beyond one |
| RBAC + ResourceQuota | Multi-tenant GPU isolation | When multiple teams share cluster |
Full scripts and YAML configs at github.com/Yu-amd/Kubernetes-MI300X. The repo includes a system check script that auto-resolves common Ubuntu package lock issues before any installation begins.