Running Gemma 2B on Kubernetes (k3d) with Ollama: A Complete Local AI Setup

I was fascinated by how people were running large language models locally, fully offline, without depending on expensive GPU clusters or cloud APIs.

But when I tried deploying Gemma 2B manually on my machine, the process was messy:

Large model weights needed downloading

Restarting the container meant re-downloading everything

No orchestration or resilience β€” if the container died, my setup was gone

So, I asked myself:

β€œCan I run Gemma 2B efficiently, fully containerized, orchestrated by Kubernetes, with a clean local setup?”

The answer: Yes. Using k3d + Ollama + Kubernetes + Gemma 2B.

🎯 What You’ll Learn

Deploy Gemma 2B using Ollama inside a k3d Kubernetes cluster

Expose it via a service for local access

Persist model weights to avoid re-downloading

Basic troubleshooting for pods and containers

πŸ› οΈ Tech Stack
Component Purpose
k3d Lightweight Kubernetes cluster inside Docker
Ollama Container for running LLMs locally
Gemma 2B Lightweight LLM (~1.7GB) from Google, runs locally
WSL2 Linux environment on Windows

πŸ“š Concepts Before We Start

  1. What is Ollama?

Ollama is a simple tool for running LLMs locally:

Pulls models like Gemma, Llama, Phi

Provides a REST API for inference

Runs entirely offline once weights are downloaded

Example:

ollama run gemma:2b

Gives you a local chatbot with zero cloud dependency.

  1. Why Kubernetes (k3d)?

Instead of running Ollama bare-metal, we use k3d:

Local K8s Cluster β†’ k3d runs Kubernetes inside Docker, very lightweight

Pods & PVCs β†’ Pods run containers, PVCs store model weights

Services β†’ Expose Ollama API on localhost easily

  1. Storage with PVC

Without PVCs, if your pod dies, you lose model weights.
PVC ensures models survive restarts and redeployments.

πŸ§‘β€πŸ’» Step-by-Step Setup
Step 1: Install k3d
curl -s https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash
k3d cluster create gemma-cluster --agents 1 --servers 1

Step 2: Deploy Ollama + Gemma 2B

Create ollama-deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
        volumeMounts:
        - name: model-storage
          mountPath: /root/.ollama
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: ollama-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-pvc
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
---
apiVersion: v1
kind: Service
metadata:
  name: ollama-service
spec:
  selector:
    app: ollama
  ports:
  - protocol: TCP
    port: 11434
    targetPort: 11434
  type: LoadBalancer

Apply it:

kubectl apply -f ollama-deployment.yaml

Step 3: Pull Gemma 2B Model
kubectl exec -it deploy/ollama -- ollama pull gemma:2b

Step 4: Test the API

curl http://localhost:11434/api/generate -d '{
  "model": "gemma:2b",
  "prompt": "Write a short poem about Kubernetes"
}'

🐞 Problems I Faced & Fixes

  1. Pod in CrashLoopBackOff Increased CPU/RAM in deployment spec
  2. Model re-downloading on restart Used PVC to persist weights
  3. Port not accessible Used LoadBalancer + k3d port mapping

πŸ“‚ Final Project Structure
gemma-k3d/
β”œβ”€β”€ ollama-deployment.yaml
β”œβ”€β”€ k3d-cluster-setup.sh
└── README.md

πŸš€ Next Steps

In the next article, we’ll add Prometheus + Grafana to monitor:

  1. CPU usage
  2. Memory usage
  3. Latency per inference

Similar Posts