Running Gemma 2B on Kubernetes (k3d) with Ollama: A Complete Local AI Setup
I was fascinated by how people were running large language models locally, fully offline, without depending on expensive GPU clusters or cloud APIs.
But when I tried deploying Gemma 2B manually on my machine, the process was messy:
Large model weights needed downloading
Restarting the container meant re-downloading everything
No orchestration or resilience β if the container died, my setup was gone
So, I asked myself:
βCan I run Gemma 2B efficiently, fully containerized, orchestrated by Kubernetes, with a clean local setup?β
The answer: Yes. Using k3d + Ollama + Kubernetes + Gemma 2B.
π― What Youβll Learn
Deploy Gemma 2B using Ollama inside a k3d Kubernetes cluster
Expose it via a service for local access
Persist model weights to avoid re-downloading
Basic troubleshooting for pods and containers
π οΈ Tech Stack
Component Purpose
k3d Lightweight Kubernetes cluster inside Docker
Ollama Container for running LLMs locally
Gemma 2B Lightweight LLM (~1.7GB) from Google, runs locally
WSL2 Linux environment on Windows
π Concepts Before We Start
- What is Ollama?
Ollama is a simple tool for running LLMs locally:
Pulls models like Gemma, Llama, Phi
Provides a REST API for inference
Runs entirely offline once weights are downloaded
Example:
ollama run gemma:2b
Gives you a local chatbot with zero cloud dependency.
- Why Kubernetes (k3d)?
Instead of running Ollama bare-metal, we use k3d:
Local K8s Cluster β k3d runs Kubernetes inside Docker, very lightweight
Pods & PVCs β Pods run containers, PVCs store model weights
Services β Expose Ollama API on localhost easily
- Storage with PVC
Without PVCs, if your pod dies, you lose model weights.
PVC ensures models survive restarts and redeployments.
π§βπ» Step-by-Step Setup
Step 1: Install k3d
curl -s https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash
k3d cluster create gemma-cluster --agents 1 --servers 1
Step 2: Deploy Ollama + Gemma 2B
Create ollama-deployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
volumeMounts:
- name: model-storage
mountPath: /root/.ollama
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: ollama-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ollama-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
---
apiVersion: v1
kind: Service
metadata:
name: ollama-service
spec:
selector:
app: ollama
ports:
- protocol: TCP
port: 11434
targetPort: 11434
type: LoadBalancer
Apply it:
kubectl apply -f ollama-deployment.yaml
Step 3: Pull Gemma 2B Model
kubectl exec -it deploy/ollama -- ollama pull gemma:2b
Step 4: Test the API
curl http://localhost:11434/api/generate -d '{
"model": "gemma:2b",
"prompt": "Write a short poem about Kubernetes"
}'
π Problems I Faced & Fixes
- Pod in CrashLoopBackOff Increased CPU/RAM in deployment spec
- Model re-downloading on restart Used PVC to persist weights
- Port not accessible Used LoadBalancer + k3d port mapping
π Final Project Structure
gemma-k3d/
βββ ollama-deployment.yaml
βββ k3d-cluster-setup.sh
βββ README.md
π Next Steps
In the next article, weβll add Prometheus + Grafana to monitor:
- CPU usage
- Memory usage
- Latency per inference