All Posts
Kubernetes · AMD · ROCm

Kubernetes on AMD Instinct GPUs

From first principles to production: what Kubernetes actually is, how its planes work, why AMD Instinct GPUs need special treatment, and how to deploy vLLM on bare metal without cloud abstractions getting in the way.

Apr 2026 20 min read Chinmay Hebbal

Most tutorials skip the foundations and jump straight to YAML. That works until something breaks, and then you have no mental model to reason from. This post builds the full picture: what vLLM is and why it matters for LLM serving, what Kubernetes actually does and how its planes are structured, how AMD GPUs fit into that model, and finally the concrete steps to get a production-grade inference cluster running on bare metal. The reference implementation is Yu-amd/Kubernetes-MI300X on GitHub.

What is vLLM

vLLM is an open-source inference engine built specifically for large language models. The core problem it solves is GPU memory efficiency. Naive LLM serving allocates a fixed KV cache block for every request upfront, even if that request only uses a fraction of it. At any meaningful concurrency level, this wastes the majority of GPU memory and throttles throughput.

vLLM's answer is PagedAttention: it manages the KV cache the same way an OS manages RAM, dividing it into fixed-size pages and allocating them on demand as tokens are generated. Pages belonging to different requests can be interleaved in GPU memory, and pages can be shared across requests that have the same prompt prefix. The result is dramatically higher utilisation and, as a direct consequence, much higher throughput at the same hardware budget.

24x Throughput vs HuggingFace Transformers
OpenAI Compatible API (/v1/chat/completions)
ROCm Native AMD GPU support

Beyond PagedAttention, vLLM provides continuous batching (new requests join the batch as slots free up, rather than waiting for a whole batch to finish), tensor parallelism for multi-GPU serving, and a built-in Prometheus metrics endpoint. For AMD Instinct GPUs specifically, vLLM uses the ROCm software stack and AITER kernels that are tuned for AMD's HBM3 memory architecture.

!

vLLM starts an OpenAI-compatible HTTP server. Any code that calls POST /v1/chat/completions against OpenAI can point at vLLM with a one-line URL change. No client code modifications required.

What is Kubernetes

Kubernetes is a container orchestration system. It takes a cluster of machines and presents them as a single pool of compute resources. You declare what you want to run (a Deployment, a Service, a Job) and Kubernetes figures out where to run it, restarts it if it crashes, and scales it up or down as load changes. The operative word is declarative: you describe the desired state and the system continuously works to reconcile reality with that description.

For LLM inference this matters because a single vLLM process running on one machine is not fault-tolerant and cannot scale. Kubernetes wraps it in a layer that handles restarts, rolling updates, health checks, load balancing, and resource scheduling automatically.

The Kubernetes Planes

Control Plane

The control plane is the brain of the cluster. It makes all scheduling and state-management decisions. It does not run your workloads. Every control plane component runs on dedicated master nodes.

ComponentWhat it does
kube-apiserverThe front door. Every kubectl command and every controller talks to the API server. It validates and stores all cluster state.
etcdDistributed key-value store. The single source of truth for all cluster state. If etcd is healthy, the cluster can recover from anything.
kube-schedulerWatches for unscheduled pods and assigns them to nodes. For GPU workloads it evaluates amd.com/gpu resource availability.
kube-controller-managerRuns a collection of controllers (Deployment, ReplicaSet, Node, etc.) that drive current state toward desired state.
cloud-controller-managerIntegrates with cloud provider APIs. On bare metal this is replaced by MetalLB for LoadBalancer services.

Data Plane (Worker Nodes)

The data plane is where workloads actually run. Each worker node runs three agents that take instructions from the control plane and execute them locally.

ComponentWhat it does
kubeletThe node agent. Watches the API server for pods assigned to its node, then tells the container runtime to start or stop them. Reports node status and pod health back up.
kube-proxyManages network rules (iptables or IPVS) that implement Service load balancing within the cluster.
Container runtimeThe engine that actually runs containers. containerd is standard. It pulls images, creates namespaces, and manages the container lifecycle.

Add-ons

Beyond the core planes, two add-ons are essential for GPU inference on bare metal. CNI (Container Network Interface) provides pod-to-pod networking across nodes. Calico is the standard choice for bare-metal clusters. MetalLB is the bare-metal equivalent of a cloud LoadBalancer: it assigns real IP addresses from a configured pool to Services of type LoadBalancer, making the vLLM API reachable from outside the cluster.

cluster topology
┌──────────────────── Control Plane ─────────────────────┐
│  kube-apiserver · etcd · scheduler · controller-mgr    │
└────────────────────────────────────────────────────────┘
              ↕  (watch / reconcile loop)
┌──────────── Worker Node ───────────────────────────────┐
│  kubelet · kube-proxy · containerd                     │
│  ┌──────────────────────────────────────────────────┐  │
│  │  AMD GPU Operator (Device Plugin + Node Labeller)│  │
│  └──────────────────────────────────────────────────┘  │
│  ┌────────────────┐  ┌────────────────┐                │
│  │  vLLM Pod      │  │  vLLM Pod      │                │
│  │  amd.com/gpu:1 │  │  amd.com/gpu:1 │                │
│  └────────────────┘  └────────────────┘                │
│  ┌──────────────────────────────────────────────────┐  │
│  │  AMD Instinct MI300X  (GPU 0 · GPU 1 · ...)      │  │
│  └──────────────────────────────────────────────────┘  │
└────────────────────────────────────────────────────────┘
              ↕
┌──────────── MetalLB ───────────────────────────────────┐
│  LoadBalancer Service · IPAddressPool · L2Advertisement│
│  EXTERNAL-IP: 192.168.x.x:8000                         │
└────────────────────────────────────────────────────────┘
              ↕
           Client (POST /v1/chat/completions)

How AMD GPUs Fit In

Standard Kubernetes knows nothing about GPUs. It schedules pods based on CPU and memory requests. To make amd.com/gpu a schedulable resource, you need the AMD GPU Operator, which installs two things on every GPU node:

Once the operator is healthy, a pod can request a GPU by adding amd.com/gpu: 1 to its resource limits. Kubernetes then ensures that pod lands on a node with a free GPU and that no other pod gets access to that GPU while the first one is running.

gpu resource in pod spec
resources:
  requests:
    amd.com/gpu: 1
    memory: "16Gi"
    cpu: "4"
  limits:
    amd.com/gpu: 1
    memory: "32Gi"
    cpu: "8"

Setting Up the Cluster

The reference implementation at Yu-amd/Kubernetes-MI300X breaks the setup into three sequential scripts. Each one is idempotent: re-running it skips already-installed components.

Step 1: Install Kubernetes

The first script installs vanilla Kubernetes 1.28+ on Ubuntu/Debian. It configures containerd as the container runtime, sets up Calico CNI for pod networking, removes the control-plane taint so a single-node cluster can also run workloads, disables swap (a hard Kubernetes requirement), and applies the necessary kernel settings.

install-kubernetes.sh
# Before running, verify system readiness
sudo ./check-system-enhanced.sh

# Install Kubernetes 1.28+ with containerd + Calico CNI
sudo ./install-kubernetes.sh

# Verify cluster is healthy
kubectl get nodes
# NAME       STATUS   ROLES           AGE   VERSION
# mi300x-1   Ready    control-plane   2m    v1.28.0

Step 2: Install AMD GPU Operator

The second script installs Helm, deploys cert-manager (a prerequisite for the operator's webhooks), then installs the AMD GPU Operator itself. It also sets up a 50 GB PersistentVolume at /mnt/data/ai-models using a local storage class, which is where model weights will live.

install-amd-gpu-operator.sh
./install-amd-gpu-operator.sh

# Confirm GPU resources are visible to the scheduler
kubectl get nodes -o json | grep "amd.com/gpu"
# "amd.com/gpu": "8"   ← MI300X exposes 8 logical GPUs

# Check operator pods are running
kubectl get pods -n kube-amd-gpu

The DeviceConfig custom resource tells the operator how to configure the device plugin. For most setups the empty spec is correct: it uses all detected GPUs with default settings.

yaml-configs/device-config-cr.yaml
apiVersion: amd.com/v1alpha1
kind: DeviceConfig
metadata:
  name: amd-gpu-device-config
  namespace: kube-amd-gpu
spec: {}   # defaults: all GPUs, ROCm auto-detected

Step 3: Deploy vLLM with MetalLB

The third script does three things in sequence: installs MetalLB v0.14, configures an IP pool from the node's subnet, then deploys the vLLM inference server. The script auto-detects the node IP and carves out a small pool (e.g. 192.168.x.240 to .250) for LoadBalancer services.

install MetalLB
kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.14.3/config/manifests/metallb-native.yaml

# Wait for MetalLB pods to be ready
kubectl wait --namespace metallb-system \
  --for=condition=ready pod \
  --selector=app=metallb \
  --timeout=90s
metallb ip pool config
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: default-pool
  namespace: metallb-system
spec:
  addresses:
    - 192.168.1.240-192.168.1.250   # adjust to your subnet
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: default-advertisement
  namespace: metallb-system
spec:
  ipAddressPools:
    - default-pool

The vLLM deployment requests exactly one GPU, mounts the model PVC at /models, and includes readiness and liveness probes against the /health endpoint that vLLM exposes once the model is loaded.

vllm deployment yaml (abridged)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-inference
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: vllm-container
        image: rocm/vllm:latest
        command: ["python", "-m", "vllm.entrypoints.openai.api_server"]
        args:
          - "--model"
          - "Qwen/Qwen2.5-1.5B-Instruct"
          - "--host"
          - "0.0.0.0"
          - "--port"
          - "8080"
          - "--download-dir"
          - "/models"
          - "--tensor-parallel-size"
          - "1"
        resources:
          requests:
            amd.com/gpu: 1
            memory: "8Gi"
          limits:
            amd.com/gpu: 1
            memory: "16Gi"
        volumeMounts:
          - name: model-storage
            mountPath: /models
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
persistent storage for model weights
apiVersion: v1
kind: PersistentVolume
metadata:
  name: ai-models-pv
spec:
  capacity:
    storage: 50Gi
  accessModes: [ReadWriteOnce]
  persistentVolumeReclaimPolicy: Retain
  storageClassName: local-storage
  hostPath:
    path: /mnt/data/ai-models   # pre-download weights here
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ai-models-pvc
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 50Gi
  storageClassName: local-storage

Once the pod is running and the service has an external IP, test the endpoint:

test the api
# Get the MetalLB-assigned IP
kubectl get svc vllm-service
# vllm-service  LoadBalancer  10.96.x.x  192.168.1.240  8080

# Call the OpenAI-compatible endpoint
curl http://192.168.1.240:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role":"user","content":"Hello"}]
  }'

Troubleshooting: The Four Failure Modes

Every failure in this stack maps to one of four layers. Identifying the layer first saves hours of random debugging.

LayerComponentSymptom
GPUAMD GPU OperatorPod stays Pending
NetworkingMetalLBService unreachable / connection refused
StoragePVC / volume mountModel loading error / file not found
RuntimevLLM processOOMKilled / CrashLoopBackOff

1. Pod Pending: insufficient amd.com/gpu

Kubernetes cannot schedule the pod because no node has free GPU resources. Either the AMD GPU Operator is not installed, the device plugin is unhealthy, or all GPUs are already allocated.

kubectl describe pod output
Events:
  Warning  FailedScheduling  ...  0/3 nodes are available:
           3 Insufficient amd.com/gpu.
diagnose
# Check what GPU resources nodes actually advertise
kubectl get nodes -o json | grep "amd.com/gpu"
# Healthy: "amd.com/gpu": "8"
# Missing: GPU Operator device plugin not running

# Check operator health
kubectl get pods -n kube-amd-gpu
kubectl describe pod <operator-pod> -n kube-amd-gpu

2. Service Unreachable: Connection Refused

The pod is running but the API is not reachable. The first question is whether MetalLB assigned an external IP. If EXTERNAL-IP shows <pending>, the IPAddressPool is not configured or does not overlap with the node network, or L2Advertisement is missing.

diagnose
kubectl get svc
# EXTERNAL-IP should show 192.168.x.x, not <pending>

kubectl get endpoints vllm-service
# Must show pod IPs. If empty, selector does not match pod labels.

kubectl describe svc vllm-service
# Check Events section for MetalLB allocation messages

3. Model Loading Errors: File Not Found

vLLM acquires the GPU, then crashes looking for weights. Almost always a volume mount mismatch: the --download-dir argument does not match the mountPath, or the PVC is Pending (WaitForFirstConsumer is expected before the pod starts, but Unbound after means a storage problem).

diagnose
kubectl get pvc
# STATUS must be Bound. Pending before pod = ok. Pending after = problem.

kubectl describe pod <pod-name>
# Look at: Mounts section (paths), Events (volume errors)

4. Memory Issues: OOMKilled / CrashLoopBackOff

The pod starts and loads the model, then dies under load. vLLM preallocates the entire KV cache at startup based on --max-model-len and --max-num-seqs. If the sum of model weights plus KV cache exceeds the pod memory limit, the OOM killer terminates the process.

diagnose and fix
kubectl describe pod <pod-name>
# Look for: Reason: OOMKilled under Last State

# Fixes in order of preference:
# 1. Raise memory limits in the Deployment spec
# 2. --max-model-len 4096   (reduce from default 8192)
# --max-num-seqs 64        (reduce max concurrent sequences)
# 3. Use quantised weights (GPTQ/AWQ) to shrink KV cache footprint

Production Patterns

Horizontal Pod Autoscaling

CPU-based HPA is useless for LLM serving: the CPU sits idle while the GPU is saturated. You need custom metrics. vLLM's built-in /metrics endpoint exposes queue depth, TTFT percentiles, and KV cache utilisation as Prometheus metrics. Feed those through the custom metrics API adapter and the HPA can scale on signals that actually reflect inference pressure.

hpa on queue depth
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-inference
  minReplicas: 1
  maxReplicas: 8
  metrics:
    - type: Pods
      pods:
        metric:
          name: vllm_request_queue_depth
        target:
          type: AverageValue
          averageValue: "10"

GPU Sharing

A single AMD Instinct MI300X carries 192 GB of HBM3. Running one large model on it is efficient, but smaller models leave most of that capacity idle. AMD ROCm supports multi-process sharing: multiple vLLM instances can timeshare one GPU with isolated memory spaces. Unlike NVIDIA MIG, AMD does not expose hardware-level GPU partitioning the same way, but ROCm compute partitioning modes provide comparable isolation for most serving workloads.

192 GB MI300X HBM3
ROCm Compute partitioning
8 Logical GPUs per node

Observability Stack

Instrument three layers: GPU hardware via the ROCm SMI Exporter, vLLM serving metrics via its built-in Prometheus endpoint, and Kubernetes pod-level metrics via cAdvisor. Prometheus collects all three. Grafana visualises them.

key metrics
# vLLM (port 8000/metrics)
vllm:time_to_first_token_seconds_p99   # user-visible latency
vllm:e2e_request_latency_seconds_p99   # full round-trip
vllm:request_queue_depth               # HPA trigger signal
vllm:gpu_cache_usage_perc              # scale-out indicator

# ROCm SMI Exporter
rocm_gpu_utilization                   # compute %
rocm_gpu_memory_used_bytes             # HBM consumption
rocm_gpu_temperature_celsius           # thermal headroom

GitOps with Argo CD

All manifests (Deployment, Service, PVC, MetalLB config, HPA) should live in a Git repository. Argo CD watches that repo and continuously reconciles the cluster to match it. This gives you instant rollback (revert the commit), reproducible environments, and a complete audit trail for every change to the inference stack.

argo cd application
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: vllm-inference
  namespace: argocd
spec:
  source:
    repoURL: https://github.com/Yu-amd/Kubernetes-MI300X
    targetRevision: main
    path: yaml-configs
  destination:
    server: https://kubernetes.default.svc
    namespace: default
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Security and Multi-tenancy

When multiple teams share a GPU cluster, three primitives prevent interference. RBAC restricts which service accounts can schedule GPU pods. NetworkPolicy limits which namespaces can reach the inference endpoints. ResourceQuota caps how many GPUs each team can consume, preventing one workload from starving others.

gpu quota per namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: team-a
spec:
  hard:
    requests.amd.com/gpu: "4"
    limits.amd.com/gpu: "4"

Putting It Together

StepWhat it solvesWhen to add it
check-system-enhanced.shCatch OS / disk / lock issues before installBefore anything
install-kubernetes.shcontainerd + Calico + single-node clusterDay one
install-amd-gpu-operator.shamd.com/gpu resource + model PVDay one
MetalLB + IPAddressPoolExternal IP on bare metalBefore external traffic
deploy-vllm-inference.shvLLM pod + LoadBalancer serviceAfter GPU operator healthy
Prometheus + GrafanaLatency / cache / GPU visibilityOnce stable in dev
Custom HPARequest-driven autoscalingWhen load is variable
Argo CDReproducible GitOps deploymentsWhen team grows beyond one
RBAC + ResourceQuotaMulti-tenant GPU isolationWhen multiple teams share cluster
!

Full scripts and YAML configs at github.com/Yu-amd/Kubernetes-MI300X. The repo includes a system check script that auto-resolves common Ubuntu package lock issues before any installation begins.

Kubernetes AMD Instinct ROCm MetalLB vLLM PagedAttention HPA GitOps Observability