Architecting “Agentic” systems at the edge: an analysis of the Cloudflare platform

Introduction: The End of Lonely Inference, Eh?

2025 marks a massive breakaway in distributed architecture. We are witnessing the slow death of the classic “Request-Response” model for AI, making way for autonomous, persistent, and asynchronous agentic systems.

Let’s be honest: the old “Stateless Serverless” paradigm—looking at you, AWS Lambda and ephemeral containers—is about as useful for modern AI agents as a screen door on a submarine. Modern agents need long-term contextual memory, real-time state coordination, and execution that lives close to the user, not in some datacenter five provinces away.¹

This report isn’t just a fluff piece; it’s a comprehensive technical teardown of Cloudflare’s answer to this mess. Far from being a random bag of tools, their “Developer Platform” is a coherent attempt to rebuild the AI execution stack by moving compute, state, and memory to the Edge. We’re gonna look under the hood at Durable Objects, the Infire engine, and see if this stack can actually go toe-to-toe with the big hyperscalers (AWS, GCP) and the inference keeners (Groq, Pinecone).

Grab a double-double, let’s dig in. ☕

Inference Infrastructure: Deconstructing “Infire” and Latency

Running AI at the Edge isn’t just about scattering GPUs across the map like Timbits. It requires a complete re-engineering of the software stack to maximize constrained resources and minimize Time to First Token (TTFT).

Infire: Rust and the CUDA “Hat Trick”

Cloudflare pulled a power move by dumping generic Python/C++ inference servers for Infire, their proprietary engine written in Rust.² This was a necessary play to serve massive models (like Llama 3.1) on the Edge without burning through cash.

Here’s the “secret sauce” that makes Infire faster than a slapshot:

Granular CUDA Graphs & JIT: Instead of launching GPU kernels sequentially (which drags), Infire creates a dedicated “CUDA graph” for every possible batch size on the fly. Conceptually, this lets the driver replace a series of kernel launches with a single monolith structure. It cuts down the CPU overhead drastically—a huge deal on powerful cards like the H100.³

Paged KV Caching: Managing Key-Value (KV) cache memory is the bottleneck for LLM throughput. Infire breaks the memory needed for each request into non-contiguous blocks (pages). This eliminates fragmentation and allows for aggressive “continuous batching” without messing up individual latency.

The Rust Advantage: By ditching Python’s GIL (Global Interpreter Lock) and avoiding garbage collector pauses, Infire guarantees stable tail latency. Internal benchmarks show Infire finishes inference tasks up to 7% faster than vLLM 0.10.0 on unloaded H100s, with ~82% lower CPU usage (Infire uses 25% CPU compared to vLLM’s >140%).³ That’s pure efficiency, bud.

Fighting the “Cold Start” (The Block Heater Approach)

We all know the pain of serverless “cold starts.” Waiting for a container to wake up is like trying to start your car at -40°C without a block heater.

Cloudflare deployed a routing strategy called “Shard and Conquer” to fix this.⁴

  • The Mechanism: Instead of random load balancing, they use a consistent hash ring to route traffic for a specific Worker to a specific subset of machines (“shards”).
  • Architectural Impact: This maximizes the chance that a V8 Isolate is already “warm” (memory loaded, DB connection open).
  • The Result: Cloudflare claims a sustained 99.99% warm request rate, reducing cold starts by a factor of 10.⁵ Basically, the engine is always running and the cabin is toasty.

The Tale of the Tape: Workers AI vs. The Big Guys

Critical Metric Cloudflare Workers AI AWS Bedrock Groq (LPU) The “Tech-Savvy” Verdict
Time To First Token (TTFT) Decent. Optimized by the Edge network, but limited by general-purpose GPUs. Variable. Depends on the service tier and region. Intrinsic network latency due to centralization. Insane. Their LPU architecture (SRAM-based) delivers unbeatable speeds (e.g., 0.14s on Llama 70B).⁶ For pure speed (voice agents), Groq wins. For text/multimodal, Cloudflare is solid.
Cost Model Neurons/Usage. Granular billing based on compute. Ideal for sporadic workloads. Complex. On-Demand (pricey) vs Provisioned Throughput (expensive commitment). Token-based. Aggressive pricing to gain market share, but long-term viability is TBD. Cloudflare offers predictability. AWS is for when you have corporate budget to burn.⁷
Data Locality Edge. Inference runs physically near the user. Huge plus for GDPR/Sovereignty. Regional. Data has to travel to us-east-1 or eu-central-1. Centralized. Proprietary hardware in centralized hubs. Cloudflare has a unique advantage for sovereignty-sensitive apps.

“Stateful Serverless”: Durable Objects as Agent Primitives

The biggest innovation for AI agents here isn’t the AI—it’s the memory. Durable Objects (DO) reintroduce the Actor Model to infrastructure, solving the “agent memory” problem without the headache of managing a Redis cluster.¹

Anatomy of a Durable Object for AI

Think of a DO as a tiny, persistent micro-server that’s globally addressable.

  • Zero-Latency SQLite: This is a game-changer. A DO has a built-in SQLite database. No expensive TCP/SSL handshake to a remote RDS. It executes SQL queries in local memory or on local SSD.⁸ This means insane read/write speeds for your agent’s state (conversation history, plans, context).

  • WebSocket Coordination: A single DO can handle thousands of concurrent WebSocket connections. For a collaborative agent (e.g., multiple users talking to the same assistant), the DO acts as the quarterback, syncing everything without complex distributed locking.¹

  • The “Alarm” Pattern: Agents need to wake up on their own. DOs have setAlarm, allowing an agent to schedule its own future execution (“Wake me up in 10 mins to check that API”). This is essential for autonomous agents running long-term tasks.¹

The “Gotchas” (Don’t be a Hoser)

Even the best tools have limits. Here’s how not to mess it up:

Single-Threaded: Like Node.js, a DO is single-threaded. If you run heavy CPU tasks (like parsing a massive JSON or heavy encryption), you’ll block every request to that object. Workaround: Offload the heavy lifting to a standard stateless Worker.

Storage Limits: You get 10GB per object.¹⁰ It’s not a Data Lake, eh? Don’t treat it like one. Store references (pointers) to R2 (Object Storage) or Vectorize, not the raw data.

Fixed Locality: Once created, a DO lives in a specific datacenter. If your user flies to the other side of the world, latency increases.

Semantic Memory & RAG: Analyzing Vectorize

Your agent needs to remember context (RAG). That’s where Vectorize comes in.

Architecture of Vectorize

Vectorize isn’t just another vector DB; it’s designed for the Edge. It lets Workers query embeddings without the network latency of calling a third-party service like Pinecone (usually stuck in us-east-1).¹¹

Comparative Matrix: Vectorize vs. The Rest

Feature Cloudflare Vectorize Pinecone pgvector (PostgreSQL) Expert Verdict
Network Latency Minimal. Internal RPC call from the Worker. Medium. HTTP/gRPC latency to the cloud provider. Variable. Depends on hosting (Supabase, Neon). Vectorize wins on pure latency for Edge apps.
Indexing Speed Asynchronous indexing. Optimized for fast reads. High performance, proprietary algos. Depends on extension (HNSW, IVFFlat). pgvectorscale beats Pinecone on some benchmarks.¹² Pinecone leads on brute force performance at scale, but pgvector is catching up.
Features Limited. Basic semantic search. Simple metadata filtering. Rich. Hybrid search, Reranking, Unlimited namespaces. Exhaustive. Full SQL, complex JOINS, ACID compliance. Vectorize is still a “young buck”. If you need complex hybrid queries, stick with pgvector.¹³
Cost Predictable. Based on stored/queried dimensions. Included in Workers plans.¹¹ High. Billed by Pod or Read/Write units. Variable. Instance + Storage costs. Vectorize often has a lower TCO for mid-sized projects (no egress fees).

Insight: The word on the street (Reddit) is that Vectorize is still a bit green on metadata filtering compared to Weaviate or Qdrant.¹⁴ For enterprise RAG needing granular RBAC filtering, pgvector on D1 might be the safer bet for now.

Agent Frameworks & Standardization: MCP

Cloudflare isn’t just building the rink; they’re defining the rules of the game.

Cloudflare Agents SDK

The cloudflare/agents SDK gives you a structured way to deploy agents on Durable Objects. It abstracts away the WebSocket and SQL complexity, giving you “ready-to-use” primitives for memory and tool calling.¹⁵

Model Context Protocol (MCP): Playing Nice with Others

Cloudflare is going all-in on MCP, an open standard for connecting AI assistants to data systems.

  • Workers as MCP Servers: You can deploy a Worker as an MCP server, exposing APIs and data as “tools” for MCP clients (like Claude Desktop).

  • Security Architecture: They solved the biggest MCP headache: auth. Via workers-oauth-provider, a Cloudflare-hosted MCP server handles OAuth natively. This positions Cloudflare as the secure “broker” between LLMs and your private data (SaaS, internal APIs).¹⁶

Security & Governance: The Bouncer at the Door

Deploying agents without security is like leaving your front door unlocked in bear country.

AI-SPM: Shining a Light on “Shadow AI”

Companies are losing control over AI usage. Cloudflare uses its network position (Gateway/WARP) to offer AI Security Posture Management (AI-SPM).

  • Passive Detection: It sniffs out traffic to known and unknown AI services.
  • Governance: It applies policies like “Block uploading PII to public chatbots.” It’s basically a radar for your network.¹⁸

Firewall for AI: WAF for LLMs

This sits in front of your models.

  • Prompt Analysis: It scans JSON payloads for prompt injection attacks (“Ignore all previous instructions…”) and toxic content.
  • Mechanism: It uses lightweight models (like Llama Guard) to score requests in real-time before they hit the expensive model. It saves you money by rejecting bad requests early.¹⁹

Economic Analysis: The Baselime Case Study

Let’s talk loonies and toonies. The acquisition of Baselime by Cloudflare gives us hard data on the savings.²⁰

Cost Breakdown (Annualized)

They slashed their bill from $708,100 (AWS) to $118,625 (Cloudflare). That’s an 83% reduction.

Component AWS Cost (Old) Cloudflare Cost (New) Reduction Root Cause Analysis
Compute $237,250 (Lambda) $9,125 (Workers) ~96% Billing Model. Lambda charges for “Wall-clock time” (including waiting for DBs). Workers charges for active CPU time. For I/O-heavy agents, this is massive.
CDN $51,100 (CloudFront) $0 (Included) 100% CDN is native and included. No extra line item.
Data $419,750 (Kinesis/EC2) $109,500 (Analytics Engine) ~74% Replacing a self-managed ClickHouse cluster on EC2 with a serverless managed service.

Strategic Insight: This proves that for I/O-bound workloads (like AI orchestration), Cloudflare’s economic model is structurally superior to AWS Lambda.

Pay-per-Crawl: New Rules for the Web (x402)

Bots are scraping the web to train AI, often for free. Cloudflare introduced a protocol innovation: Pay-per-crawl based on HTTP 402.²¹

  • The Problem: Sites block AI bots to save bandwidth/IP, creating a “dark web” for LLMs.
  • The Solution (x402): A Worker returns HTTP 402 Payment Required. The bot can replay the request with a payment token.
  • Impact: Cloudflare acts as the “Merchant of Record.” It turns a fight into a marketplace, letting agents buy high-quality real-time data.

Synthesis and Final Recommendations

Architectural Decision Matrix

Scenario Recommended Architecture Technical Justification
Conversational Agent (SaaS) Full Cloudflare (DO + Workers AI) Durable Objects are mandatory for WebSocket coordination and session state. Infire is good enough for standard inference.
Complex Enterprise RAG Workers + Pinecone/Supabase Vectorize is too limited for complex metadata filtering (RBAC). Use Workers for orchestration, but a mature vector DB for storage.
Real-Time Voice AI Workers + Groq Cloudflare’s inference latency might be too high for voice. Use Workers for logic, Groq for speed.
Heavy Batch Processing AWS (Fargate) or Modal Worker limits (CPU time/RAM) make heavy batch jobs (ingesting thousands of PDFs) risky.

Conclusion: The “Cloud of Agents”

Cloudflare has successfully built the first true “Serverless Cloud” for the AI Agent era. By vertically integrating Compute (Workers), State (Durable Objects), Memory (Vectorize), and Inference (Infire), they’ve removed the “integration tax” that makes AWS/GCP a headache.

For the seasoned pro, it’s not perfect: the ecosystem is closed (Vendor Lock-in on DOs is real), and some tools like Vectorize are still maturing. But if you want to build distributed, performant, and economically viable agentic systems in 2025, Cloudflare offers the best bang for your buck. You just have to shift your mindset: stop thinking “servers” and “databases,” and start thinking “distributed objects” and “event streams.”

So, give’r. Start coding, and keep your stick on the ice. 🏒

Works Cited

  1. A list of reasons why you should be using Cloudflare Workers for building your AI agent infrastructure/product/personal assistant – Sunil Pai, accessed December 7, 2025, https://sunilpai.dev/posts/cloudflare-workers-for-ai-agents/

  2. AI Week 2025 – Updates and announcements – Cloudflare, accessed December 7, 2025, https://www.cloudflare.com/innovation-week/ai-week-2025/updates/

  3. How we built the most efficient inference engine for Cloudflare’s …, accessed December 7, 2025, https://blog.cloudflare.com/cloudflares-most-efficient-ai-inference-engine/

  4. Eliminating Cold Starts 2: shard and conquer – The Cloudflare Blog, accessed December 7, 2025, https://blog.cloudflare.com/eliminating-cold-starts-2-shard-and-conquer/

  5. Cloudflare Achieves 99.99% Warm Start Rate for Workers with ‘Shard and Conquer’ Consistent Hashing – InfoQ, accessed December 7, 2025, https://www.infoq.com/news/2025/10/workers-shard-conquer-cold-start/

  6. AWS Bedrock vs Groq: Picking the Right AI Inference Engine for Your Workloads – Medium, accessed December 7, 2025, https://medium.com/@cloudshim/aws-bedrock-vs-groq-picking-the-right-ai-inference-engine-for-your-workloads-f1eb2a795856

  7. New Amazon Bedrock service tiers help you match AI workload performance with cost, accessed December 7, 2025, https://aws.amazon.com/blogs/aws/new-amazon-bedrock-service-tiers-help-you-match-ai-workload-performance-with-cost/

  8. Durable Objects on Workers Free plan · Changelog – Cloudflare Docs, accessed December 7, 2025, https://developers.cloudflare.com/changelog/2025-04-07-durable-objects-freetier/

  9. Overview · Cloudflare Durable Objects docs, accessed December 7, 2025, https://developers.cloudflare.com/durable-objects/

  10. Limits · Cloudflare Durable Objects docs, accessed December 7, 2025, https://developers.cloudflare.com/durable-objects/platform/limits/

  11. Vectorize: a vector database for shipping AI-powered applications to …, accessed December 7, 2025, https://blog.cloudflare.com/vectorize-vector-database-open-beta/

  12. Pgvector vs. Pinecone: Vector Database Performance and Cost Comparison – Tiger Data, accessed December 7, 2025, https://www.tigerdata.com/learn/pgvector-vs-pinecone

  13. CloudFlare is the cheapest + fastest option for Cloud Computing… yet the slowest and most expensive option for Artificial Intelligence – Reddit, accessed December 7, 2025, https://www.reddit.com/r/CloudFlare/comments/1olswdi/cloudflare_is_the_cheapest_fastest_option_for/

  14. cloudflare Vectorize comparison (eg w PgVector) : r/vectordatabase – Reddit, accessed December 7, 2025, https://www.reddit.com/r/vectordatabase/comments/1h51nel/cloudflare_vectorize_comparison_eg_w_pgvector/

  15. cloudflare/agents: Build and deploy AI Agents on Cloudflare – GitHub, accessed December 7, 2025, https://github.com/cloudflare/agents

  16. Build a Remote MCP server · Cloudflare Agents docs, accessed December 7, 2025, https://developers.cloudflare.com/agents/guides/remote-mcp-server/

  17. Build and deploy Remote Model Context Protocol (MCP) servers to Cloudflare, accessed December 7, 2025, https://blog.cloudflare.com/remote-model-context-protocol-servers-mcp/

  18. AI Security Suite | Securely scale AI adoption – Cloudflare, accessed December 7, 2025, https://www.cloudflare.com/ai-security/

  19. Firewall for AI – Cloudflare, accessed December 7, 2025, https://www.cloudflare.com/application-services/products/firewall-for-ai/

  20. Moving Baselime from AWS to Cloudflare: simpler architecture, improved performance, over 80% lower cloud costs, accessed December 7, 2025, https://blog.cloudflare.com/80-percent-lower-cloud-cost-how-baselime-moved-from-aws-to-cloudflare/

  21. Launching the x402 Foundation with Coinbase, and support for x402 transactions, accessed December 7, 2025, https://blog.cloudflare.com/x402/

Similar Posts