Why Your AI Agents Need a Shell (And How to Give Them One Safely)

Claude Code changed how I think about agent architecture. It outperforms agents loaded up with 50 different MCP servers and custom integrations, and under the hood it’s remarkably simple. No tool sprawl. No massive schema definitions eating up context. Just a filesystem and bash.

The AI community is starting to piece together why this works so well. And it’s making a lot of us reconsider the complexity we’ve been bolting onto our own agents.

The Problem With How We Build Agent Context

Most of us default to one of two approaches when building agents that need to work with data:

  1. Prompt stuffing. Throw everything into the context window and hope for the best.
  2. Tool libraries. Connect MCP servers, define custom tools, give the agent ways to fetch what it needs.
  3. Vector search. Embed your data, run semantic similarity, pray the retrieval is relevant.

These approaches work. But they each have tradeoffs.

Prompt stuffing hits token limits fast. Vector search is great for semantic similarity but struggles when you need a specific value from structured data. Tool libraries solve the capability problem, but every tool you add means more schema definitions in context, more options for the model to reason about, more surface area for things to go wrong.

There’s a third option that’s been hiding in plain sight.

LLMs Already Know How to Navigate Filesystems

Think about what LLMs were trained on. Billions of lines of code. Countless examples of developers navigating directories, grepping through files, managing state across complex codebases. These models have seen grep, cat, find, and awk more times than any human ever will.

This isn’t some new capability we need to teach them. Filesystem operations are native to how these models think. They’re part of the training distribution, not bolted on behavior.

The Vercel team figured this out when rebuilding their internal agents. They replaced most of their custom tooling with just two things: a filesystem tool and a bash tool. Their sales call summarization agent dropped from around $1.00 per call to about $0.25 on Claude Opus. And the output quality improved.

Why This Works

When you give an agent bash access with a filesystem, you’re not just giving it a way to read files. You’re giving it a general-purpose interface to everything.

It can connect to anything. curl to hit APIs. CLI tools to talk to databases, cloud services, Kubernetes clusters. The ecosystem of command-line tools is massive and battle-tested. Instead of wiring up a custom integration, the agent uses the same tools developers have used for decades.

It can store and retrieve its own context. The agent can write findings to a file, move on to something else, and come back later. It can build up notes, keep track of what it’s tried, maintain state across a long task. The filesystem becomes working memory.

Retrieval is precise. grep -r "pricing objection" transcripts/ returns exact matches. When you need a specific value, you get that value. No similarity scores, no “top k” approximations.

Data maps naturally to directories. Customer records, ticket history, CRM exports. These have natural hierarchies. You’re not flattening relationships or deciding upfront what to embed.

Everything is debuggable. When something goes wrong, you can see exactly what files the agent read, what commands it ran, what it wrote. No black boxes.

The intuition is simple: if an agent can navigate a codebase to find bugs, it can navigate your business data the same way. And if it can run shell commands, it can interact with almost any system you already use.

The Security Problem

Here’s where things get interesting. And where most people stop.

Giving an AI agent bash access to your actual system is obviously terrifying. You trust the model’s reasoning, but you can’t trust unconstrained execution. One hallucinated rm -rf and you’re having a very bad day.

The solution is sandboxing. Run the agent’s commands in an isolated environment that can’t touch your production systems. The agent reasons about files, the sandbox handles execution safely.

This is the core insight: separate what the agent can think about from what it can actually do. Let it explore a mounted directory structure freely while keeping everything else completely inaccessible.

The architecture looks like this:

Agent receives task
        ↓
Explores filesystem (ls, find)
        ↓
Searches for relevant content (grep, cat)
        ↓
Sends context + request to LLM
        ↓
Returns structured output

The bash execution runs in an isolated sandbox. You get the power of native filesystem operations without the risk.

What Good Isolation Looks Like

A proper sandbox needs a few things:

Process isolation. Commands run in a contained environment that can’t break out to the host system. WebAssembly runtimes are great for this because they provide memory-safe execution by design.

Directory mounting. You expose only the directories your agent needs. Mount your project folder at /workspace, and everything else simply doesn’t exist from the agent’s perspective.

Session persistence. For multi-step agent workflows, you want configuration and state to persist across commands without bleeding into other sessions.

Visible execution paths. You should be able to see exactly what happened. stdout captured, stderr captured, full command history. No black boxes.

The security model here is defense in depth. Even if the agent generates something unexpected, the sandbox constrains what can actually happen.

But Wait, What About MCP?

If you’ve been following the AI tooling space, you’ve probably heard of MCP. The Model Context Protocol. Anthropic released it in late 2024, and it’s become the de facto standard for connecting agents to external tools and data. OpenAI, Google, and pretty much everyone else has adopted it. The ecosystem exploded fast.

MCP is genuinely clever. Before it existed, if you wanted to connect an AI agent to GitHub and Slack and your database, you needed custom integrations for each pairing. N tools times M agents equals a lot of duplicated work. MCP turned that into an N+M problem: build one MCP server for your tool, one MCP client for your agent, and everything connects.

So why am I talking about CLI tools instead?

Here’s the thing. As MCP usage has scaled, some real problems have emerged. When you connect too many MCP servers, all those tool definitions start eating your context window. Every tool has a description, parameters, return schemas. Connect to dozens of servers with hundreds of tools, and you’re burning tokens before the agent even starts working.

Think about it. CLI tools follow the Unix philosophy: small, modular, composable. They take text in, give text out, and can be chained together. They’re battle-tested. git, grep, curl, jq have been in production for decades. They’re stable, well-documented, and complete.

And critically: LLMs have seen these tools billions of times during training. They know the syntax. They know the flags. They know common patterns and error messages. This isn’t in-context learning. It’s deep, internalized knowledge.

MCP servers, by contrast, are new. The agent has to learn each tool from scratch based on the schema you provide. It’s bolted-on capability versus native understanding.

There’s also the composability problem. MCP isn’t truly composable in the Unix-pipe sense. You can’t easily chain MCP tool outputs into other MCP tool inputs the way you can with CLI commands. Meanwhile, grep "error" logs.txt | wc -l is something every model already knows how to construct.

Don’t get me wrong. MCP has its place. It’s great for distribution and discoverability, especially if you’re building tools for non-technical users who can’t configure CLI environments. And for certain complex, domain-specific workflows where you want to encapsulate expert knowledge, a purpose-built MCP server can be more reliable than hoping the agent strings together the right CLI calls.

But for raw capability? For giving an agent the power to actually do things? CLI tools win. They’re more token-efficient, more composable, and leverage exactly what models are already good at.

The catch is security. Letting an agent run arbitrary shell commands is dangerous. Which brings us back to the core problem: how do you give an agent CLI access safely?

Introducing Bashlet

This is exactly what I built Bashlet to solve.

Bashlet is an open-source tool that gives AI agents sandboxed bash access. It supports multiple isolation backends depending on your platform and security needs:

Wasmer (WASM). Cross-platform, lightweight sandbox. Works on macOS, Linux, and Windows. Startup time around 50ms.

Firecracker (microVM). Full Linux VM isolation for when you need hardware-level security. Linux only, requires KVM. Boots in about 125ms.

By default, Bashlet auto-selects the best available backend. On Linux with KVM, you get Firecracker’s VM isolation. Everywhere else, you get Wasmer’s WASM sandbox.

The basic workflow is simple:

# Create a session with a mounted directory
bashlet create --name demo --mount ./src:/workspace

# Run commands in isolation
bashlet run demo "ls /workspace"
bashlet run demo "grep -r 'TODO' /workspace"

# Terminate when done
bashlet terminate demo

For quick one-off commands, skip the session management:

bashlet exec --mount ./src:/workspace "ls /workspace"

Presets for Reproducible Environments

One thing I found myself doing constantly was setting up the same environment over and over. Mount this directory, set these env vars, run these setup commands. So I added presets.

Define them once in your config file (~/.config/bashlet/config.toml):

[presets.kubectl]
mounts = [
  ["/usr/local/bin/kubectl", "/usr/local/bin/kubectl", true],
  ["~/.kube", "/home/.kube", true]
]
env_vars = [["KUBECONFIG", "/home/.kube/config"]]
setup_commands = ["kubectl version --client"]

[presets.nodejs]
mounts = [["~/.npm", "/home/.npm", false]]
env_vars = [["NODE_ENV", "development"]]
workdir = "/app"

Then use them anywhere:

# Create a session with a preset
bashlet create --name k8s-env --preset kubectl

# One-shot command with a preset
bashlet exec --preset kubectl "kubectl get pods"

# Auto-create session if it doesn't exist, apply preset
bashlet run dev -C --preset nodejs "npm install"

Presets can also specify a backend. Want certain workloads to always run in Firecracker VMs? Define it in the preset:

[presets.dev-vm]
backend = "firecracker"
rootfs_image = "~/.bashlet/images/dev.ext4"
env_vars = [["EDITOR", "vim"]]

Changes to the rootfs persist across sessions. Install packages once, have them available forever.

Available on GitHub. Install takes about 30 seconds:

curl -fsSL https://bashlet.dev/install.sh | sh

This installs Bashlet along with Wasmer automatically. On Linux, it also grabs Firecracker.

Why This Matters Now

As models get better at coding, agents built on filesystem primitives automatically get better too. You’re leveraging improvements in the training distribution instead of fighting against custom tooling that needs constant maintenance.

The future of agents might be surprisingly simple. Maybe the best architecture is almost no architecture at all. Just filesystems and bash, running safely in a sandbox.

Give your agents the tools they were trained on.

Similar Posts