HLD

Building an “Unstoppable” Serverless Payment System on AWS (Circuit Breaker Pattern)

Building an “Unstoppable” Serverless Payment System on AWS

What happens when your payment gateway goes down? In a traditional app, the user sees a spinner, then a “500 Server Error,” and you lose the sale.
I wanted to build a system that refuses to crash. Even if the backend database is on fire, the user’s order should be accepted, queued, and processed automatically when the system heals.

Here is how I implemented the Circuit Breaker Pattern using AWS Step Functions, Java Lambda, and Event-Driven Architecture—without provisioning a single server.

The Tech Stack

I chose a hybrid, cloud-native stack to enforce strict decoupling between the Frontend and the Backend.

  • Frontend: Python (Streamlit) – Acts as the Store & Admin Dashboard.
  • Orchestration: AWS Step Functions – The “Brain” handling the logic.
  • Compute: AWS Lambda (Java 11) – The “Worker” handling business logic.
  • State Store: Amazon DynamoDB – Stores circuit status (Open/Closed) and Order History.
  • Resiliency: Amazon SQS – The “Parking Lot” for failed orders.
  • Observability: Grafana Cloud (Loki) – Log aggregation.
  • Infrastructure: Terraform – Complete IaC.

Note : Use terraform to manage resources ,best practice to keep all your resources terraform in separate file for creation/deletion/any kind of update.

The Problem: Cascading Failures

In microservices, if Service A calls Service B, and Service B hangs, Service A eventually hangs too. If thousands of users click “Pay,” your database gets hammered with retries, effectively DDoS-ing yourself.
The Solution? A Circuit Breaker.
Just like in your house: if there is a surge, the breaker trips to save the house from burning down.

High Level Architecture:

I designed the system to handle three distinct states:

  1. Green Path (Closed): The backend is healthy. Orders process immediately.
  2. Red Path (Open): The backend is crashing. The system detects this, “Trips” the circuit, and stops sending traffic to the backend.
  3. Yellow Path (Recovery): Orders are routed to a Queue (SQS) to be retried later automatically.

HLD may look scary but it is will make your app unstoppable.

HLD

How It Works The Logic Flow :

The core of this project is an AWS Step Functions State Machine. It acts as a traffic controller.

  1. The Check
    Every time a user clicks “Pay,” the workflow first checks DynamoDB.
    Is the Circuit Status OPEN?
    If YES: Skip the backend entirely.
    If NO: Proceed to the Java Lambda.
  2. The Execution
    The workflow invokes a Java Lambda to process the payment.
    Success: It updates the Order History to COMPLETED and emits an event to EventBridge (triggering a customer email via SNS).
    Failure: It catches the error and retries with Exponential Backoff (Wait 1s, then 2s).
    3.** The “Trip”**
    If the backend fails repeatedly, the Step Function:
    Writes Status: OPEN to DynamoDB.
    Alerts the SysAdmin via SNS (“Critical: Circuit Tripped”).
    Marks the order as FAILED in the dashboard.
  3. The Self-Healing (Auto-Retry)
    This is the coolest part. If the circuit is Open, new orders are not rejected. They are marked as QUEUED and sent to Amazon SQS.
    A “Retry Handler” Lambda listens to this queue.
    It waits for a delay (e.g., 30s).It re-submits the order to the Step Function.If the backend is fixed, the order processes. If not, it goes back to the queue.

lld

Tested Data Scenarios:

  1. SUCCESS
    success

  2. CHAOS Mode

chaos

Observability & Monitoring: I integrated Grafana Cloud (Loki) to ingest logs from CloudWatch.
Streamlit Dashboard: Shows live status of orders (PENDING → COMPLETED or FAILED).
Grafana Explore: Allows deep searching of logs using {service=”order-processor”} to find specific stack traces.

Key Learnings & Trade-offs

  1. Complexity vs. Reliability
    This architecture is more complex than a simple API call. You have more moving parts (Queues, State Machines). However, the trade-off is High Availability. The frontend never sees a crash.
  2. The “Ghost” Data
    When using Catch blocks in Step Functions, the original input (Order ID) is replaced by the Error Message. I learned to use ResultPath to preserve the original data so I could update the database even after a crash.
  3. Cost Optimization
    Step Functions Standard Workflows are expensive at scale. For production, I would switch this to Express Workflows and use ARM64 (Graviton) for the Lambdas to reduce costs by ~40%.

Application looks like:

  1. Order placing UI reference
    Order placing ui

  2. Admin UI

Conclusion
This project demonstrates how Event-Driven Architecture allows you to build systems that degrade gracefully. Instead of losing revenue during a crash, we simply “pause” the traffic and process it when the storm passes.
Technologies used: AWS, Java, Python, Terraform, Grafana.

Follow for more. Thanks for reading !

Similar Posts