Kubernetes on AMD Instinct GPUs : Chinmay Hebbal

From first principles to production: what Kubernetes actually is, how its planes work, why AMD Instinct GPUs need special treatment, and how to deploy vLLM on bare metal without cloud abstractions getting in the way.

Most tutorials skip the foundations and jump straight to YAML. That works until something breaks, and then you have no mental model to reason from. This post builds the full picture: what vLLM is and why it matters for LLM serving, what Kubernetes actually does and how its planes are structured, how AMD GPUs fit into that model, and finally the concrete steps to get a production-grade inference cluster running on bare metal. The reference implementation is Yu-amd/Kubernetes-MI300X on GitHub.

What is vLLM

vLLM is an open-source inference engine built specifically for large language models. The core problem it solves is GPU memory efficiency. Naive LLM serving allocates a fixed KV cache block for every request upfront, even if that request only uses a fraction of it. At any meaningful concurrency level, this wastes the majority of GPU memory and throttles throughput.

vLLM's answer is PagedAttention: it manages the KV cache the same way an OS manages RAM, dividing it into fixed-size pages and allocating them on demand as tokens are generated. Pages belonging to different requests can be interleaved in GPU memory, and pages can be shared across requests that have the same prompt prefix. The result is dramatically higher utilisation and, as a direct consequence, much higher throughput at the same hardware budget.

24x Throughput vs HuggingFace Transformers

OpenAI Compatible API (/v1/chat/completions)

ROCm Native AMD GPU support

Beyond PagedAttention, vLLM provides continuous batching (new requests join the batch as slots free up, rather than waiting for a whole batch to finish), tensor parallelism for multi-GPU serving, and a built-in Prometheus metrics endpoint. For AMD Instinct GPUs specifically, vLLM uses the ROCm software stack and AITER kernels that are tuned for AMD's HBM3 memory architecture.

vLLM starts an OpenAI-compatible HTTP server. Any code that calls POST /v1/chat/completions against OpenAI can point at vLLM with a one-line URL change. No client code modifications required.

What is Kubernetes

Kubernetes is a container orchestration system. It takes a cluster of machines and presents them as a single pool of compute resources. You declare what you want to run (a Deployment, a Service, a Job) and Kubernetes figures out where to run it, restarts it if it crashes, and scales it up or down as load changes. The operative word is declarative: you describe the desired state and the system continuously works to reconcile reality with that description.

For LLM inference this matters because a single vLLM process running on one machine is not fault-tolerant and cannot scale. Kubernetes wraps it in a layer that handles restarts, rolling updates, health checks, load balancing, and resource scheduling automatically.

The Kubernetes Planes

Control Plane

The control plane is the brain of the cluster. It makes all scheduling and state-management decisions. It does not run your workloads. Every control plane component runs on dedicated master nodes.

Component	What it does
kube-apiserver	The front door. Every kubectl command and every controller talks to the API server. It validates and stores all cluster state.
etcd	Distributed key-value store. The single source of truth for all cluster state. If etcd is healthy, the cluster can recover from anything.
kube-scheduler	Watches for unscheduled pods and assigns them to nodes. For GPU workloads it evaluates `amd.com/gpu` resource availability.
kube-controller-manager	Runs a collection of controllers (Deployment, ReplicaSet, Node, etc.) that drive current state toward desired state.
cloud-controller-manager	Integrates with cloud provider APIs. On bare metal this is replaced by MetalLB for LoadBalancer services.

Data Plane (Worker Nodes)

The data plane is where workloads actually run. Each worker node runs three agents that take instructions from the control plane and execute them locally.

Component	What it does
kubelet	The node agent. Watches the API server for pods assigned to its node, then tells the container runtime to start or stop them. Reports node status and pod health back up.
kube-proxy	Manages network rules (iptables or IPVS) that implement Service load balancing within the cluster.
Container runtime	The engine that actually runs containers. containerd is standard. It pulls images, creates namespaces, and manages the container lifecycle.

Add-ons

Beyond the core planes, two add-ons are essential for GPU inference on bare metal. CNI (Container Network Interface) provides pod-to-pod networking across nodes. Calico is the standard choice for bare-metal clusters. MetalLB is the bare-metal equivalent of a cloud LoadBalancer: it assigns real IP addresses from a configured pool to Services of type LoadBalancer, making the vLLM API reachable from outside the cluster.

cluster topology

┌──────────────────── Control Plane ─────────────────────┐
│  kube-apiserver · etcd · scheduler · controller-mgr    │
└────────────────────────────────────────────────────────┘
              ↕  (watch / reconcile loop)
┌──────────── Worker Node ───────────────────────────────┐
│  kubelet · kube-proxy · containerd                     │
│  ┌──────────────────────────────────────────────────┐  │
│  │  AMD GPU Operator (Device Plugin + Node Labeller)│  │
│  └──────────────────────────────────────────────────┘  │
│  ┌────────────────┐  ┌────────────────┐                │
│  │  vLLM Pod      │  │  vLLM Pod      │                │
│  │  amd.com/gpu:1 │  │  amd.com/gpu:1 │                │
│  └────────────────┘  └────────────────┘                │
│  ┌──────────────────────────────────────────────────┐  │
│  │  AMD Instinct MI300X  (GPU 0 · GPU 1 · ...)      │  │
│  └──────────────────────────────────────────────────┘  │
└────────────────────────────────────────────────────────┘
              ↕
┌──────────── MetalLB ───────────────────────────────────┐
│  LoadBalancer Service · IPAddressPool · L2Advertisement│
│  EXTERNAL-IP: 192.168.x.x:8000                         │
└────────────────────────────────────────────────────────┘
              ↕
           Client (POST /v1/chat/completions)

How AMD GPUs Fit In

Standard Kubernetes knows nothing about GPUs. It schedules pods based on CPU and memory requests. To make amd.com/gpu a schedulable resource, you need the AMD GPU Operator, which installs two things on every GPU node:

Device Plugin: advertises GPU count to the API server so the scheduler can place pods correctly
Node Labeller: tags nodes with GPU model, VRAM size, and ROCm version so workloads can target specific hardware

Once the operator is healthy, a pod can request a GPU by adding amd.com/gpu: 1 to its resource limits. Kubernetes then ensures that pod lands on a node with a free GPU and that no other pod gets access to that GPU while the first one is running.

gpu resource in pod spec

resources:
  requests:
    amd.com/gpu: 1
    memory: "16Gi"
    cpu: "4"
  limits:
    amd.com/gpu: 1
    memory: "32Gi"
    cpu: "8"

Setting Up the Cluster

The reference implementation at Yu-amd/Kubernetes-MI300X breaks the setup into three sequential scripts. Each one is idempotent: re-running it skips already-installed components.

Step 1: Install Kubernetes

The first script installs vanilla Kubernetes 1.28+ on Ubuntu/Debian. It configures containerd as the container runtime, sets up Calico CNI for pod networking, removes the control-plane taint so a single-node cluster can also run workloads, disables swap (a hard Kubernetes requirement), and applies the necessary kernel settings.

install-kubernetes.sh

# Before running, verify system readiness
sudo ./check-system-enhanced.sh

# Install Kubernetes 1.28+ with containerd + Calico CNI
sudo ./install-kubernetes.sh

# Verify cluster is healthy
kubectl get nodes
# NAME       STATUS   ROLES           AGE   VERSION
# mi300x-1   Ready    control-plane   2m    v1.28.0

Step 2: Install AMD GPU Operator

The second script installs Helm, deploys cert-manager (a prerequisite for the operator's webhooks), then installs the AMD GPU Operator itself. It also sets up a 50 GB PersistentVolume at /mnt/data/ai-models using a local storage class, which is where model weights will live.

install-amd-gpu-operator.sh

./install-amd-gpu-operator.sh

# Confirm GPU resources are visible to the scheduler
kubectl get nodes -o json | grep "amd.com/gpu"
# "amd.com/gpu": "8"   ← MI300X exposes 8 logical GPUs

# Check operator pods are running
kubectl get pods -n kube-amd-gpu

The DeviceConfig custom resource tells the operator how to configure the device plugin. For most setups the empty spec is correct: it uses all detected GPUs with default settings.

yaml-configs/device-config-cr.yaml

apiVersion: amd.com/v1alpha1
kind: DeviceConfig
metadata:
  name: amd-gpu-device-config
  namespace: kube-amd-gpu
spec: {}   # defaults: all GPUs, ROCm auto-detected

Step 3: Deploy vLLM with MetalLB

The third script does three things in sequence: installs MetalLB v0.14, configures an IP pool from the node's subnet, then deploys the vLLM inference server. The script auto-detects the node IP and carves out a small pool (e.g. 192.168.x.240 to .250) for LoadBalancer services.

install MetalLB

kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.14.3/config/manifests/metallb-native.yaml

# Wait for MetalLB pods to be ready
kubectl wait --namespace metallb-system \
  --for=condition=ready pod \
  --selector=app=metallb \
  --timeout=90s

metallb ip pool config

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: default-pool
  namespace: metallb-system
spec:
  addresses:
    - 192.168.1.240-192.168.1.250   # adjust to your subnet
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: default-advertisement
  namespace: metallb-system
spec:
  ipAddressPools:
    - default-pool

The vLLM deployment requests exactly one GPU, mounts the model PVC at /models, and includes readiness and liveness probes against the /health endpoint that vLLM exposes once the model is loaded.

vllm deployment yaml (abridged)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-inference
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: vllm-container
        image: rocm/vllm:latest
        command: ["python", "-m", "vllm.entrypoints.openai.api_server"]
        args:
          - "--model"
          - "Qwen/Qwen2.5-1.5B-Instruct"
          - "--host"
          - "0.0.0.0"
          - "--port"
          - "8080"
          - "--download-dir"
          - "/models"
          - "--tensor-parallel-size"
          - "1"
        resources:
          requests:
            amd.com/gpu: 1
            memory: "8Gi"
          limits:
            amd.com/gpu: 1
            memory: "16Gi"
        volumeMounts:
          - name: model-storage
            mountPath: /models
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10

persistent storage for model weights

apiVersion: v1
kind: PersistentVolume
metadata:
  name: ai-models-pv
spec:
  capacity:
    storage: 50Gi
  accessModes: [ReadWriteOnce]
  persistentVolumeReclaimPolicy: Retain
  storageClassName: local-storage
  hostPath:
    path: /mnt/data/ai-models   # pre-download weights here
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ai-models-pvc
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 50Gi
  storageClassName: local-storage

Once the pod is running and the service has an external IP, test the endpoint:

test the api

# Get the MetalLB-assigned IP
kubectl get svc vllm-service
# vllm-service  LoadBalancer  10.96.x.x  192.168.1.240  8080

# Call the OpenAI-compatible endpoint
curl http://192.168.1.240:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role":"user","content":"Hello"}]
  }'

Troubleshooting: The Four Failure Modes

Every failure in this stack maps to one of four layers. Identifying the layer first saves hours of random debugging.

Layer	Component	Symptom
GPU	AMD GPU Operator	Pod stays Pending
Networking	MetalLB	Service unreachable / connection refused
Storage	PVC / volume mount	Model loading error / file not found
Runtime	vLLM process	OOMKilled / CrashLoopBackOff

1. Pod Pending: insufficient amd.com/gpu

Kubernetes cannot schedule the pod because no node has free GPU resources. Either the AMD GPU Operator is not installed, the device plugin is unhealthy, or all GPUs are already allocated.

kubectl describe pod output

Events:
  Warning  FailedScheduling  ...  0/3 nodes are available:
           3 Insufficient amd.com/gpu.

diagnose

# Check what GPU resources nodes actually advertise
kubectl get nodes -o json | grep "amd.com/gpu"
# Healthy: "amd.com/gpu": "8"
# Missing: GPU Operator device plugin not running

# Check operator health
kubectl get pods -n kube-amd-gpu
kubectl describe pod <operator-pod> -n kube-amd-gpu

2. Service Unreachable: Connection Refused

The pod is running but the API is not reachable. The first question is whether MetalLB assigned an external IP. If EXTERNAL-IP shows <pending>, the IPAddressPool is not configured or does not overlap with the node network, or L2Advertisement is missing.

diagnose

kubectl get svc
# EXTERNAL-IP should show 192.168.x.x, not <pending>

kubectl get endpoints vllm-service
# Must show pod IPs. If empty, selector does not match pod labels.

kubectl describe svc vllm-service
# Check Events section for MetalLB allocation messages

3. Model Loading Errors: File Not Found

vLLM acquires the GPU, then crashes looking for weights. Almost always a volume mount mismatch: the --download-dir argument does not match the mountPath, or the PVC is Pending (WaitForFirstConsumer is expected before the pod starts, but Unbound after means a storage problem).

diagnose

kubectl get pvc
# STATUS must be Bound. Pending before pod = ok. Pending after = problem.

kubectl describe pod <pod-name>
# Look at: Mounts section (paths), Events (volume errors)

4. Memory Issues: OOMKilled / CrashLoopBackOff

The pod starts and loads the model, then dies under load. vLLM preallocates the entire KV cache at startup based on --max-model-len and --max-num-seqs. If the sum of model weights plus KV cache exceeds the pod memory limit, the OOM killer terminates the process.

diagnose and fix

kubectl describe pod <pod-name>
# Look for: Reason: OOMKilled under Last State

# Fixes in order of preference:
# 1. Raise memory limits in the Deployment spec
# 2. --max-model-len 4096   (reduce from default 8192)
# --max-num-seqs 64        (reduce max concurrent sequences)
# 3. Use quantised weights (GPTQ/AWQ) to shrink KV cache footprint

Production Patterns

Horizontal Pod Autoscaling

CPU-based HPA is useless for LLM serving: the CPU sits idle while the GPU is saturated. You need custom metrics. vLLM's built-in /metrics endpoint exposes queue depth, TTFT percentiles, and KV cache utilisation as Prometheus metrics. Feed those through the custom metrics API adapter and the HPA can scale on signals that actually reflect inference pressure.

hpa on queue depth

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-inference
  minReplicas: 1
  maxReplicas: 8
  metrics:
    - type: Pods
      pods:
        metric:
          name: vllm_request_queue_depth
        target:
          type: AverageValue
          averageValue: "10"

GPU Sharing

A single AMD Instinct MI300X carries 192 GB of HBM3. Running one large model on it is efficient, but smaller models leave most of that capacity idle. AMD ROCm supports multi-process sharing: multiple vLLM instances can timeshare one GPU with isolated memory spaces. Unlike NVIDIA MIG, AMD does not expose hardware-level GPU partitioning the same way, but ROCm compute partitioning modes provide comparable isolation for most serving workloads.

192 GB MI300X HBM3

ROCm Compute partitioning

8 Logical GPUs per node

Observability Stack

Instrument three layers: GPU hardware via the ROCm SMI Exporter, vLLM serving metrics via its built-in Prometheus endpoint, and Kubernetes pod-level metrics via cAdvisor. Prometheus collects all three. Grafana visualises them.

key metrics

# vLLM (port 8000/metrics)
vllm:time_to_first_token_seconds_p99   # user-visible latency
vllm:e2e_request_latency_seconds_p99   # full round-trip
vllm:request_queue_depth               # HPA trigger signal
vllm:gpu_cache_usage_perc              # scale-out indicator

# ROCm SMI Exporter
rocm_gpu_utilization                   # compute %
rocm_gpu_memory_used_bytes             # HBM consumption
rocm_gpu_temperature_celsius           # thermal headroom

GitOps with Argo CD

All manifests (Deployment, Service, PVC, MetalLB config, HPA) should live in a Git repository. Argo CD watches that repo and continuously reconciles the cluster to match it. This gives you instant rollback (revert the commit), reproducible environments, and a complete audit trail for every change to the inference stack.

argo cd application

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: vllm-inference
  namespace: argocd
spec:
  source:
    repoURL: https://github.com/Yu-amd/Kubernetes-MI300X
    targetRevision: main
    path: yaml-configs
  destination:
    server: https://kubernetes.default.svc
    namespace: default
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Security and Multi-tenancy

When multiple teams share a GPU cluster, three primitives prevent interference. RBAC restricts which service accounts can schedule GPU pods. NetworkPolicy limits which namespaces can reach the inference endpoints. ResourceQuota caps how many GPUs each team can consume, preventing one workload from starving others.

gpu quota per namespace

apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: team-a
spec:
  hard:
    requests.amd.com/gpu: "4"
    limits.amd.com/gpu: "4"

Putting It Together

Step	What it solves	When to add it
check-system-enhanced.sh	Catch OS / disk / lock issues before install	Before anything
install-kubernetes.sh	containerd + Calico + single-node cluster	Day one
install-amd-gpu-operator.sh	amd.com/gpu resource + model PV	Day one
MetalLB + IPAddressPool	External IP on bare metal	Before external traffic
deploy-vllm-inference.sh	vLLM pod + LoadBalancer service	After GPU operator healthy
Prometheus + Grafana	Latency / cache / GPU visibility	Once stable in dev
Custom HPA	Request-driven autoscaling	When load is variable
Argo CD	Reproducible GitOps deployments	When team grows beyond one
RBAC + ResourceQuota	Multi-tenant GPU isolation	When multiple teams share cluster

Full scripts and YAML configs at github.com/Yu-amd/Kubernetes-MI300X. The repo includes a system check script that auto-resolves common Ubuntu package lock issues before any installation begins.

Kubernetes AMD Instinct ROCm MetalLB vLLM PagedAttention HPA GitOps Observability