API Rate Limiting Design: What I Learned After a Traffic Spike Took Down Our Service

It was a Tuesday afternoon when our entire API went down. Not because of a bug, not because of a server crash — but because a single client was hammering our endpoints with 47,000 requests per minute. Their integration had a retry loop with no backoff, and every failed request triggered three more retries. Within fifteen minutes, our database connection pool was exhausted, and every other customer was getting 503 errors.

That incident cost us a full day of downtime and a very uncomfortable call with our biggest enterprise customer. It also taught me the most important lesson of my backend career: rate limiting is not optional — it is infrastructure. If you do not control how your API is consumed, your worst client will eventually dictate your availability for everyone else.

This guide covers everything I have learned about designing rate limiting systems, from algorithm selection to Redis-based implementation to the organizational decisions that matter just as much as the code.

TL;DR — Rate Limiting at a Glance

Algorithm	Best For	Burst Handling	Complexity
Fixed Window	Simple APIs, internal tools	Poor (boundary bursts)	Low
Sliding Window Log	Precise limiting, audit trails	Good	Medium
Sliding Window Counter	Most production APIs	Good	Medium
Token Bucket	APIs needing controlled bursts	Excellent	Medium
Leaky Bucket	Smooth, predictable throughput	None (by design)	Low

If you want a single recommendation: token bucket for public APIs, sliding window counter for internal services.

Why Rate Limiting Matters More in 2026

Rate limiting has always been important, but two trends have made it critical.

First, AI agents are consuming APIs at machine speed. When I started building APIs ten years ago, the fastest client was a developer running a script. Today, autonomous AI agents make hundreds of API calls per second as part of multi-step workflows. A single misconfigured agent can generate more traffic than your entire human user base.

Second, API-first architectures mean more internal service-to-service calls. In a microservices system, one user request can fan out to dozens of internal API calls. Without rate limiting between services, a spike in user traffic amplifies exponentially as it cascades through your backend.

I learned this the hard way when we added a recommendation engine that called our product API for every item in a user’s cart. During a flash sale, cart sizes jumped from an average of 3 items to 15, and our product service collapsed under 5x its normal internal traffic. Rate limiting between services would have turned a cascading failure into a graceful degradation.

Rate Limiting Algorithms Explained

Fixed Window

The simplest approach: count requests in fixed time windows (e.g., per minute, per hour) and reject requests that exceed the limit.

import redis
import time

r = redis.Redis()

def fixed_window_rate_limit(user_id: str, limit: int, window_seconds: int) -> bool:
    key = f"ratelimit:{user_id}:{int(time.time()) // window_seconds}"
    
    current = r.incr(key)
    if current == 1:
        r.expire(key, window_seconds)
    
    return current <= limit

The problem: boundary bursts. If your limit is 100 requests per minute, a client can send 100 requests at 11:59:59 and another 100 at 12:00:01 — effectively 200 requests in 2 seconds. I have seen this cause real issues in production, particularly with batch-processing clients that queue up work and fire it all at window boundaries.

Sliding Window Counter

This is what I use for most production systems. It combines the simplicity of fixed windows with better burst protection by weighting the previous window’s count.

def sliding_window_rate_limit(user_id: str, limit: int, window_seconds: int) -> bool:
    now = time.time()
    current_window = int(now) // window_seconds
    previous_window = current_window - 1
    elapsed = now % window_seconds
    weight = 1 - (elapsed / window_seconds)
    
    current_key = f"ratelimit:{user_id}:{current_window}"
    previous_key = f"ratelimit:{user_id}:{previous_window}"
    
    pipe = r.pipeline()
    pipe.get(previous_key)
    pipe.incr(current_key)
    pipe.expire(current_key, window_seconds * 2)
    results = pipe.execute()
    
    previous_count = int(results[0] or 0)
    current_count = int(results[1])
    
    effective_count = (previous_count * weight) + current_count
    return effective_count <= limit

The weighted approach smooths out the boundary problem. In my experience, sliding window counters reduce false rate-limit rejections by about 30% compared to fixed windows at the same nominal limit.

Token Bucket

The token bucket is my favorite algorithm for public-facing APIs. It allows controlled bursts while maintaining an average rate — which matches how real clients actually behave.

def token_bucket_rate_limit(user_id: str, capacity: int, refill_rate: float) -> bool:
    key = f"bucket:{user_id}"
    now = time.time()
    
    pipe = r.pipeline()
    pipe.hgetall(key)
    result = pipe.execute()[0]
    
    if result:
        tokens = float(result[b'tokens'])
        last_refill = float(result[b'last_refill'])
        elapsed = now - last_refill
        tokens = min(capacity, tokens + elapsed * refill_rate)
    else:
        tokens = capacity
    
    if tokens >= 1:
        tokens -= 1
        r.hset(key, mapping={'tokens': tokens, 'last_refill': now})
        r.expire(key, int(capacity / refill_rate) + 60)
        return True
    
    return False

Think of it like a bucket that holds tokens. Each request costs one token. Tokens refill at a constant rate. The bucket has a maximum capacity, which determines the maximum burst size. When I set a limit of “100 requests per minute with bursts up to 20,” I configure: capacity = 20, refill rate = 100/60 = 1.67 tokens per second.

Leaky Bucket

The leaky bucket processes requests at a fixed rate, queueing excess requests instead of rejecting them immediately. It produces the smoothest output but adds latency for queued requests.

def leaky_bucket_rate_limit(user_id: str, capacity: int, leak_rate: float) -> bool:
    key = f"leaky:{user_id}"
    now = time.time()
    
    pipe = r.pipeline()
    pipe.hgetall(key)
    result = pipe.execute()[0]
    
    if result:
        water_level = float(result[b'level'])
        last_check = float(result[b'last_check'])
        elapsed = now - last_check
        water_level = max(0, water_level - elapsed * leak_rate)
    else:
        water_level = 0
    
    if water_level < capacity:
        water_level += 1
        r.hset(key, mapping={'level': water_level, 'last_check': now})
        r.expire(key, int(capacity / leak_rate) + 60)
        return True
    
    return False

I used a leaky bucket once for a payment processing API where we needed exactly-even throughput to stay within a downstream provider’s limits. It worked perfectly for that use case but felt overly restrictive for general-purpose APIs.

Implementing Rate Limiting in Production

Choosing Your Rate Limit Key

The key you rate-limit by determines what gets throttled. This decision has significant implications.

Key	Use Case	Tradeoff
API key	SaaS products with API keys	Best for paid tiers; no protection against key sharing
User ID	Authenticated APIs	Fair per-user limiting; requires authentication
IP address	Public endpoints, login pages	Simple; breaks with NAT/proxies (shared IPs)
IP + endpoint	Fine-grained public limiting	Better isolation; more Redis keys
Composite	Enterprise APIs	Most flexible; most complex

In my experience, the best approach for most APIs is API key as primary, IP address as fallback for unauthenticated endpoints. I once relied solely on IP-based limiting and discovered that an entire corporate office (thousands of users) shared a single IP through their proxy. We were rate-limiting legitimate users while actual abusers on residential IPs flew under the radar.

Response Headers

Always communicate rate limit status in response headers. This is not just good practice — it is what separates APIs that developers love from APIs they dread.

HTTP/1.1 200 OK
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 742
X-RateLimit-Reset: 1712764800

When the limit is exceeded:

HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1712764800
Content-Type: application/json

{
  "error": {
    "code": "RATE_LIMIT_EXCEEDED",
    "message": "You have exceeded the rate limit of 1000 requests per hour.",
    "retry_after": 30
  }
}

The Retry-After header is critical. Without it, clients have to guess when to retry, and most will retry immediately — making the problem worse. When I added Retry-After to our API, support tickets about rate limiting dropped by over 60% because well-built clients handled it automatically.

Middleware Implementation

Here is a complete rate limiting middleware for Express.js using the sliding window counter:

const Redis = require('ioredis');
const redis = new Redis();

function rateLimiter({ limit, windowSeconds, keyPrefix }) {
  return async (req, res, next) => {
    const identifier = req.user?.id || req.ip;
    const now = Date.now() / 1000;
    const currentWindow = Math.floor(now / windowSeconds);
    const previousWindow = currentWindow - 1;
    const elapsed = now % windowSeconds;
    const weight = 1 - elapsed / windowSeconds;

    const currentKey = `${keyPrefix}:${identifier}:${currentWindow}`;
    const previousKey = `${keyPrefix}:${identifier}:${previousWindow}`;

    const pipe = redis.pipeline();
    pipe.get(previousKey);
    pipe.incr(currentKey);
    pipe.expire(currentKey, windowSeconds * 2);
    const results = await pipe.exec();

    const previousCount = parseInt(results[0][1] || 0);
    const currentCount = parseInt(results[1][1]);
    const effectiveCount = Math.floor(previousCount * weight) + currentCount;

    res.set('X-RateLimit-Limit', limit);
    res.set('X-RateLimit-Remaining', Math.max(0, limit - effectiveCount));
    res.set('X-RateLimit-Reset', (currentWindow + 1) * windowSeconds);

    if (effectiveCount > limit) {
      res.set('Retry-After', Math.ceil(windowSeconds - elapsed));
      return res.status(429).json({
        error: {
          code: 'RATE_LIMIT_EXCEEDED',
          message: `Rate limit of ${limit} requests per ${windowSeconds}s exceeded.`,
          retry_after: Math.ceil(windowSeconds - elapsed),
        },
      });
    }

    next();
  };
}

// Usage
app.use('/api/v1/', rateLimiter({ 
  limit: 1000, 
  windowSeconds: 3600,
  keyPrefix: 'rl:api'
}));

app.use('/api/v1/search', rateLimiter({ 
  limit: 100, 
  windowSeconds: 60,
  keyPrefix: 'rl:search'
}));

Tiered Rate Limiting

For SaaS APIs, different pricing tiers should get different limits. I implement this by storing tier configurations and looking them up at request time:

const TIER_LIMITS = {
  free:       { limit: 100,   windowSeconds: 3600 },
  starter:    { limit: 1000,  windowSeconds: 3600 },
  pro:        { limit: 10000, windowSeconds: 3600 },
  enterprise: { limit: 100000, windowSeconds: 3600 },
};

function tieredRateLimiter() {
  return async (req, res, next) => {
    const tier = req.user?.tier || 'free';
    const config = TIER_LIMITS[tier];
    return rateLimiter({
      ...config,
      keyPrefix: `rl:${tier}`,
    })(req, res, next);
  };
}

Distributed Rate Limiting Challenges

When you run multiple API servers behind a load balancer, rate limiting gets harder. Each server needs to agree on how many requests a client has made, which means you need a shared store — usually Redis.

Race Conditions

The naive approach of GET-then-SET creates a race condition where two servers can read the same count and both allow a request through. I always use atomic Redis operations or Lua scripts to avoid this:

-- rate_limit.lua
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])

local current = redis.call('INCR', key)
if current == 1 then
  redis.call('EXPIRE', key, window)
end

if current > limit then
  return 0
end
return 1

Redis Failure Handling

What happens when Redis is down? You have two options: fail open (allow all requests) or fail closed (reject all requests). Neither is ideal.

In my production systems, I fail open with a local fallback. Each server maintains an in-memory counter as a rough backup. It is not perfectly accurate, but it prevents complete chaos when Redis has a blip. The alternative — rejecting all traffic because your rate limiter is down — is almost always worse than temporarily allowing unlimited traffic.

async function resilientRateLimit(identifier, limit) {
  try {
    return await redisRateLimit(identifier, limit);
  } catch (err) {
    console.error('Redis rate limit failed, using local fallback', err);
    return localRateLimit(identifier, limit);
  }
}

Designing Your Rate Limit Strategy

Endpoint-Level Limits

Not all endpoints deserve the same limits. In every API I have built, I categorize endpoints into tiers:

Endpoint Type	Example	Typical Limit	Why
Read (light)	GET /users/me	1000/hour	Low cost, high frequency
Read (heavy)	GET /search?q=…	100/minute	Database-intensive
Write	POST /orders	200/hour	Creates state, harder to undo
Auth	POST /login	10/minute	Brute-force protection
Upload	POST /files	20/hour	Storage cost, bandwidth

Communicating Limits to Developers

Document your rate limits clearly. When I redesigned our API documentation, I added a dedicated “Rate Limits” page with:

Default limits for each endpoint category
How to check remaining quota via response headers
What happens when limits are exceeded (429 response format)
How to request higher limits
Best practices for client-side handling (exponential backoff)

Developer experience around rate limiting is a differentiator. The first API I built had opaque rate limiting — no headers, generic 429 errors, no documentation. Support tickets about “random failures” consumed hours every week. After adding proper headers and docs, those tickets nearly disappeared.

Frequently Asked Questions

What Is the Best Rate Limiting Algorithm for a Public API?

The token bucket algorithm is the best choice for most public APIs. It allows controlled bursts (which matches how real clients behave — batch processing, page loads with multiple parallel requests) while maintaining a predictable average rate. If you do not need burst support, the sliding window counter is simpler to implement and provides good accuracy.

Should I Rate Limit Internal Service-to-Service Calls?

Yes, absolutely. I learned this lesson the hard way when a misconfigured internal service created a retry storm that cascaded through our entire backend. Internal rate limiting prevents a single misbehaving service from taking down others. Set higher limits than external-facing ones, but set them. Even a generous limit of 10,000 requests per minute per service catches runaway loops before they cause real damage.

How Do I Handle Rate Limiting with API Keys vs IP Addresses?

Use API keys as the primary identifier for authenticated endpoints and IP addresses as a fallback for public endpoints like login or registration. Be cautious with IP-based limiting in environments where many users share an IP (corporate networks, VPNs, mobile carriers). Consider using a composite key that combines IP with other signals like User-Agent or geographic region for more accurate identification.

What Should I Do When Redis Is Down and I Cannot Check Rate Limits?

In most production systems, fail open is the safer default — allow requests through rather than rejecting everyone. Implement a local in-memory fallback counter on each server to provide rough protection during Redis outages. The key insight is that a brief period without rate limiting is almost always less damaging than rejecting all traffic. Log the event so you can investigate and ensure your Redis setup is highly available.

How Do I Set Rate Limits When I Do Not Know My Traffic Patterns Yet?

Start with generous limits based on what your infrastructure can handle, then tighten based on real data. I typically begin with 1000 requests per hour for general endpoints and 100 per minute for expensive operations. After two weeks of production traffic, analyze your p95 and p99 usage patterns. Set your limits at roughly 2-3x the p99 — this accommodates legitimate spikes while catching abuse. Revisit quarterly as your user base grows.

The Bottom Line

Rate limiting is one of those backend concerns that feels unnecessary until the day it becomes the most important thing in your system. After living through that Tuesday afternoon outage, I now treat rate limiting as a first-class design concern — implemented before launch, not after the first incident.

Start with a sliding window counter or token bucket in Redis. Add proper response headers so clients can self-regulate. Set per-endpoint limits that match each endpoint’s cost profile. And most importantly, monitor your rate limiting — the data it generates about how your API is consumed is invaluable for capacity planning and pricing decisions.

The best rate limiting system is one your clients never notice — because it only kicks in for genuinely abusive traffic while legitimate users operate freely within generous boundaries.

Product recommendations are based on independent research and testing. We may earn a commission through affiliate links at no extra cost to you.