Running Gemma 2B on Kubernetes (k3d) with Ollama: A Complete Local AI Setup

I was fascinated by how people were running large language models locally, fully offline, without depending on expensive GPU clusters or cloud APIs.

But when I tried deploying Gemma 2B manually on my machine, the process was messy:

Large model weights needed downloading

Restarting the container meant re-downloading everything

No orchestration or resilience — if the container died, my setup was gone

So, I asked myself:

“Can I run Gemma 2B efficiently, fully containerized, orchestrated by Kubernetes, with a clean local setup?”

The answer: Yes. Using k3d + Ollama + Kubernetes + Gemma 2B.

🎯 What You’ll Learn

Deploy Gemma 2B using Ollama inside a k3d Kubernetes cluster

Expose it via a service for local access

Persist model weights to avoid re-downloading

Basic troubleshooting for pods and containers

🛠️ Tech Stack
Component Purpose
k3d Lightweight Kubernetes cluster inside Docker
Ollama Container for running LLMs locally
Gemma 2B Lightweight LLM (~1.7GB) from Google, runs locally
WSL2 Linux environment on Windows

📚 Concepts Before We Start

What is Ollama?

Ollama is a simple tool for running LLMs locally:

Pulls models like Gemma, Llama, Phi

Provides a REST API for inference

Runs entirely offline once weights are downloaded

Example:

ollama run gemma:2b

Gives you a local chatbot with zero cloud dependency.

Why Kubernetes (k3d)?

Instead of running Ollama bare-metal, we use k3d:

Local K8s Cluster → k3d runs Kubernetes inside Docker, very lightweight

Pods & PVCs → Pods run containers, PVCs store model weights

Services → Expose Ollama API on localhost easily

Storage with PVC

Without PVCs, if your pod dies, you lose model weights.
PVC ensures models survive restarts and redeployments.

🧑‍💻 Step-by-Step Setup
Step 1: Install k3d
curl -s https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash
k3d cluster create gemma-cluster --agents 1 --servers 1

Step 2: Deploy Ollama + Gemma 2B

Create ollama-deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
        volumeMounts:
        - name: model-storage
          mountPath: /root/.ollama
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: ollama-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-pvc
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
---
apiVersion: v1
kind: Service
metadata:
  name: ollama-service
spec:
  selector:
    app: ollama
  ports:
  - protocol: TCP
    port: 11434
    targetPort: 11434
  type: LoadBalancer

Apply it:

kubectl apply -f ollama-deployment.yaml

Step 3: Pull Gemma 2B Model
kubectl exec -it deploy/ollama -- ollama pull gemma:2b

Step 4: Test the API

curl http://localhost:11434/api/generate -d '{
  "model": "gemma:2b",
  "prompt": "Write a short poem about Kubernetes"
}'

🐞 Problems I Faced & Fixes

Pod in CrashLoopBackOff Increased CPU/RAM in deployment spec
Model re-downloading on restart Used PVC to persist weights
Port not accessible Used LoadBalancer + k3d port mapping

📂 Final Project Structure
gemma-k3d/
├── ollama-deployment.yaml
├── k3d-cluster-setup.sh
└── README.md

🚀 Next Steps

In the next article, we’ll add Prometheus + Grafana to monitor:

CPU usage
Memory usage
Latency per inference

🎬 Watch the Video

Running Gemma 2B on Kubernetes (k3d) with Ollama: A Complete Local AI Setup

About Taxum, or why I wrote my own NodeJS Framework

I Built a VS Code Extension to Stop the Copy-Paste Madness

Coding Challenge Practice – Question 49

A portable 4K monitor at a low price sounds great, but having tested it, I can say the QQH Z12-4 isn’t perfect

Can’t think of a good password for every account? It’s not your fault – you can also blame the websites themselves, a new study says

Could the AI bubble be real? This sage of the 2008 market crash and central character of The Big Short, certainly thinks so

Similar Posts