API Rate Limiting Design: What I Learned After a Traffic Spike Took Down Our Service
Design effective API rate limiting with token bucket, sliding window, and fixed window algorithms. Redis implementation included.
It was a Tuesday afternoon when our entire API went down. Not because of a bug, not because of a server crash — but because a single client was hammering our endpoints with 47,000 requests per minute. Their integration had a retry loop with no backoff, and every failed request triggered three more retries. Within fifteen minutes, our database connection pool was exhausted, and every other customer was getting 503 errors.
That incident cost us a full day of downtime and a very uncomfortable call with our biggest enterprise customer. It also taught me the most important lesson of my backend career: rate limiting is not optional — it is infrastructure. If you do not control how your API is consumed, your worst client will eventually dictate your availability for everyone else.
This guide covers everything I have learned about designing rate limiting systems, from algorithm selection to Redis-based implementation to the organizational decisions that matter just as much as the code.
TL;DR — Rate Limiting at a Glance
| Algorithm | Best For | Burst Handling | Complexity |
|---|---|---|---|
| Fixed Window | Simple APIs, internal tools | Poor (boundary bursts) | Low |
| Sliding Window Log | Precise limiting, audit trails | Good | Medium |
| Sliding Window Counter | Most production APIs | Good | Medium |
| Token Bucket | APIs needing controlled bursts | Excellent | Medium |
| Leaky Bucket | Smooth, predictable throughput | None (by design) | Low |
If you want a single recommendation: token bucket for public APIs, sliding window counter for internal services.
Why Rate Limiting Matters More in 2026
Rate limiting has always been important, but two trends have made it critical.
First, AI agents are consuming APIs at machine speed. When I started building APIs ten years ago, the fastest client was a developer running a script. Today, autonomous AI agents make hundreds of API calls per second as part of multi-step workflows. A single misconfigured agent can generate more traffic than your entire human user base.
Second, API-first architectures mean more internal service-to-service calls. In a microservices system, one user request can fan out to dozens of internal API calls. Without rate limiting between services, a spike in user traffic amplifies exponentially as it cascades through your backend.
I learned this the hard way when we added a recommendation engine that called our product API for every item in a user’s cart. During a flash sale, cart sizes jumped from an average of 3 items to 15, and our product service collapsed under 5x its normal internal traffic. Rate limiting between services would have turned a cascading failure into a graceful degradation.
Rate Limiting Algorithms Explained
Fixed Window
The simplest approach: count requests in fixed time windows (e.g., per minute, per hour) and reject requests that exceed the limit.
import redis
import time
r = redis.Redis()
def fixed_window_rate_limit(user_id: str, limit: int, window_seconds: int) -> bool:
key = f"ratelimit:{user_id}:{int(time.time()) // window_seconds}"
current = r.incr(key)
if current == 1:
r.expire(key, window_seconds)
return current <= limit
The problem: boundary bursts. If your limit is 100 requests per minute, a client can send 100 requests at 11:59:59 and another 100 at 12:00:01 — effectively 200 requests in 2 seconds. I have seen this cause real issues in production, particularly with batch-processing clients that queue up work and fire it all at window boundaries.
Sliding Window Counter
This is what I use for most production systems. It combines the simplicity of fixed windows with better burst protection by weighting the previous window’s count.
def sliding_window_rate_limit(user_id: str, limit: int, window_seconds: int) -> bool:
now = time.time()
current_window = int(now) // window_seconds
previous_window = current_window - 1
elapsed = now % window_seconds
weight = 1 - (elapsed / window_seconds)
current_key = f"ratelimit:{user_id}:{current_window}"
previous_key = f"ratelimit:{user_id}:{previous_window}"
pipe = r.pipeline()
pipe.get(previous_key)
pipe.incr(current_key)
pipe.expire(current_key, window_seconds * 2)
results = pipe.execute()
previous_count = int(results[0] or 0)
current_count = int(results[1])
effective_count = (previous_count * weight) + current_count
return effective_count <= limit
The weighted approach smooths out the boundary problem. In my experience, sliding window counters reduce false rate-limit rejections by about 30% compared to fixed windows at the same nominal limit.
Token Bucket
The token bucket is my favorite algorithm for public-facing APIs. It allows controlled bursts while maintaining an average rate — which matches how real clients actually behave.
def token_bucket_rate_limit(user_id: str, capacity: int, refill_rate: float) -> bool:
key = f"bucket:{user_id}"
now = time.time()
pipe = r.pipeline()
pipe.hgetall(key)
result = pipe.execute()[0]
if result:
tokens = float(result[b'tokens'])
last_refill = float(result[b'last_refill'])
elapsed = now - last_refill
tokens = min(capacity, tokens + elapsed * refill_rate)
else:
tokens = capacity
if tokens >= 1:
tokens -= 1
r.hset(key, mapping={'tokens': tokens, 'last_refill': now})
r.expire(key, int(capacity / refill_rate) + 60)
return True
return False
Think of it like a bucket that holds tokens. Each request costs one token. Tokens refill at a constant rate. The bucket has a maximum capacity, which determines the maximum burst size. When I set a limit of “100 requests per minute with bursts up to 20,” I configure: capacity = 20, refill rate = 100/60 = 1.67 tokens per second.
Leaky Bucket
The leaky bucket processes requests at a fixed rate, queueing excess requests instead of rejecting them immediately. It produces the smoothest output but adds latency for queued requests.
def leaky_bucket_rate_limit(user_id: str, capacity: int, leak_rate: float) -> bool:
key = f"leaky:{user_id}"
now = time.time()
pipe = r.pipeline()
pipe.hgetall(key)
result = pipe.execute()[0]
if result:
water_level = float(result[b'level'])
last_check = float(result[b'last_check'])
elapsed = now - last_check
water_level = max(0, water_level - elapsed * leak_rate)
else:
water_level = 0
if water_level < capacity:
water_level += 1
r.hset(key, mapping={'level': water_level, 'last_check': now})
r.expire(key, int(capacity / leak_rate) + 60)
return True
return False
I used a leaky bucket once for a payment processing API where we needed exactly-even throughput to stay within a downstream provider’s limits. It worked perfectly for that use case but felt overly restrictive for general-purpose APIs.
Implementing Rate Limiting in Production
Choosing Your Rate Limit Key
The key you rate-limit by determines what gets throttled. This decision has significant implications.
| Key | Use Case | Tradeoff |
|---|---|---|
| API key | SaaS products with API keys | Best for paid tiers; no protection against key sharing |
| User ID | Authenticated APIs | Fair per-user limiting; requires authentication |
| IP address | Public endpoints, login pages | Simple; breaks with NAT/proxies (shared IPs) |
| IP + endpoint | Fine-grained public limiting | Better isolation; more Redis keys |
| Composite | Enterprise APIs | Most flexible; most complex |
In my experience, the best approach for most APIs is API key as primary, IP address as fallback for unauthenticated endpoints. I once relied solely on IP-based limiting and discovered that an entire corporate office (thousands of users) shared a single IP through their proxy. We were rate-limiting legitimate users while actual abusers on residential IPs flew under the radar.
Response Headers
Always communicate rate limit status in response headers. This is not just good practice — it is what separates APIs that developers love from APIs they dread.
HTTP/1.1 200 OK
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 742
X-RateLimit-Reset: 1712764800
When the limit is exceeded:
HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1712764800
Content-Type: application/json
{
"error": {
"code": "RATE_LIMIT_EXCEEDED",
"message": "You have exceeded the rate limit of 1000 requests per hour.",
"retry_after": 30
}
}
The Retry-After header is critical. Without it, clients have to guess when to retry, and most will retry immediately — making the problem worse. When I added Retry-After to our API, support tickets about rate limiting dropped by over 60% because well-built clients handled it automatically.
Middleware Implementation
Here is a complete rate limiting middleware for Express.js using the sliding window counter:
const Redis = require('ioredis');
const redis = new Redis();
function rateLimiter({ limit, windowSeconds, keyPrefix }) {
return async (req, res, next) => {
const identifier = req.user?.id || req.ip;
const now = Date.now() / 1000;
const currentWindow = Math.floor(now / windowSeconds);
const previousWindow = currentWindow - 1;
const elapsed = now % windowSeconds;
const weight = 1 - elapsed / windowSeconds;
const currentKey = `${keyPrefix}:${identifier}:${currentWindow}`;
const previousKey = `${keyPrefix}:${identifier}:${previousWindow}`;
const pipe = redis.pipeline();
pipe.get(previousKey);
pipe.incr(currentKey);
pipe.expire(currentKey, windowSeconds * 2);
const results = await pipe.exec();
const previousCount = parseInt(results[0][1] || 0);
const currentCount = parseInt(results[1][1]);
const effectiveCount = Math.floor(previousCount * weight) + currentCount;
res.set('X-RateLimit-Limit', limit);
res.set('X-RateLimit-Remaining', Math.max(0, limit - effectiveCount));
res.set('X-RateLimit-Reset', (currentWindow + 1) * windowSeconds);
if (effectiveCount > limit) {
res.set('Retry-After', Math.ceil(windowSeconds - elapsed));
return res.status(429).json({
error: {
code: 'RATE_LIMIT_EXCEEDED',
message: `Rate limit of ${limit} requests per ${windowSeconds}s exceeded.`,
retry_after: Math.ceil(windowSeconds - elapsed),
},
});
}
next();
};
}
// Usage
app.use('/api/v1/', rateLimiter({
limit: 1000,
windowSeconds: 3600,
keyPrefix: 'rl:api'
}));
app.use('/api/v1/search', rateLimiter({
limit: 100,
windowSeconds: 60,
keyPrefix: 'rl:search'
}));
Tiered Rate Limiting
For SaaS APIs, different pricing tiers should get different limits. I implement this by storing tier configurations and looking them up at request time:
const TIER_LIMITS = {
free: { limit: 100, windowSeconds: 3600 },
starter: { limit: 1000, windowSeconds: 3600 },
pro: { limit: 10000, windowSeconds: 3600 },
enterprise: { limit: 100000, windowSeconds: 3600 },
};
function tieredRateLimiter() {
return async (req, res, next) => {
const tier = req.user?.tier || 'free';
const config = TIER_LIMITS[tier];
return rateLimiter({
...config,
keyPrefix: `rl:${tier}`,
})(req, res, next);
};
}
Distributed Rate Limiting Challenges
When you run multiple API servers behind a load balancer, rate limiting gets harder. Each server needs to agree on how many requests a client has made, which means you need a shared store — usually Redis.
Race Conditions
The naive approach of GET-then-SET creates a race condition where two servers can read the same count and both allow a request through. I always use atomic Redis operations or Lua scripts to avoid this:
-- rate_limit.lua
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local current = redis.call('INCR', key)
if current == 1 then
redis.call('EXPIRE', key, window)
end
if current > limit then
return 0
end
return 1
Redis Failure Handling
What happens when Redis is down? You have two options: fail open (allow all requests) or fail closed (reject all requests). Neither is ideal.
In my production systems, I fail open with a local fallback. Each server maintains an in-memory counter as a rough backup. It is not perfectly accurate, but it prevents complete chaos when Redis has a blip. The alternative — rejecting all traffic because your rate limiter is down — is almost always worse than temporarily allowing unlimited traffic.
async function resilientRateLimit(identifier, limit) {
try {
return await redisRateLimit(identifier, limit);
} catch (err) {
console.error('Redis rate limit failed, using local fallback', err);
return localRateLimit(identifier, limit);
}
}
Designing Your Rate Limit Strategy
Endpoint-Level Limits
Not all endpoints deserve the same limits. In every API I have built, I categorize endpoints into tiers:
| Endpoint Type | Example | Typical Limit | Why |
|---|---|---|---|
| Read (light) | GET /users/me | 1000/hour | Low cost, high frequency |
| Read (heavy) | GET /search?q=… | 100/minute | Database-intensive |
| Write | POST /orders | 200/hour | Creates state, harder to undo |
| Auth | POST /login | 10/minute | Brute-force protection |
| Upload | POST /files | 20/hour | Storage cost, bandwidth |
Communicating Limits to Developers
Document your rate limits clearly. When I redesigned our API documentation, I added a dedicated “Rate Limits” page with:
- Default limits for each endpoint category
- How to check remaining quota via response headers
- What happens when limits are exceeded (429 response format)
- How to request higher limits
- Best practices for client-side handling (exponential backoff)
Developer experience around rate limiting is a differentiator. The first API I built had opaque rate limiting — no headers, generic 429 errors, no documentation. Support tickets about “random failures” consumed hours every week. After adding proper headers and docs, those tickets nearly disappeared.
Frequently Asked Questions
What Is the Best Rate Limiting Algorithm for a Public API?
The token bucket algorithm is the best choice for most public APIs. It allows controlled bursts (which matches how real clients behave — batch processing, page loads with multiple parallel requests) while maintaining a predictable average rate. If you do not need burst support, the sliding window counter is simpler to implement and provides good accuracy.
Should I Rate Limit Internal Service-to-Service Calls?
Yes, absolutely. I learned this lesson the hard way when a misconfigured internal service created a retry storm that cascaded through our entire backend. Internal rate limiting prevents a single misbehaving service from taking down others. Set higher limits than external-facing ones, but set them. Even a generous limit of 10,000 requests per minute per service catches runaway loops before they cause real damage.
How Do I Handle Rate Limiting with API Keys vs IP Addresses?
Use API keys as the primary identifier for authenticated endpoints and IP addresses as a fallback for public endpoints like login or registration. Be cautious with IP-based limiting in environments where many users share an IP (corporate networks, VPNs, mobile carriers). Consider using a composite key that combines IP with other signals like User-Agent or geographic region for more accurate identification.
What Should I Do When Redis Is Down and I Cannot Check Rate Limits?
In most production systems, fail open is the safer default — allow requests through rather than rejecting everyone. Implement a local in-memory fallback counter on each server to provide rough protection during Redis outages. The key insight is that a brief period without rate limiting is almost always less damaging than rejecting all traffic. Log the event so you can investigate and ensure your Redis setup is highly available.
How Do I Set Rate Limits When I Do Not Know My Traffic Patterns Yet?
Start with generous limits based on what your infrastructure can handle, then tighten based on real data. I typically begin with 1000 requests per hour for general endpoints and 100 per minute for expensive operations. After two weeks of production traffic, analyze your p95 and p99 usage patterns. Set your limits at roughly 2-3x the p99 — this accommodates legitimate spikes while catching abuse. Revisit quarterly as your user base grows.
The Bottom Line
Rate limiting is one of those backend concerns that feels unnecessary until the day it becomes the most important thing in your system. After living through that Tuesday afternoon outage, I now treat rate limiting as a first-class design concern — implemented before launch, not after the first incident.
Start with a sliding window counter or token bucket in Redis. Add proper response headers so clients can self-regulate. Set per-endpoint limits that match each endpoint’s cost profile. And most importantly, monitor your rate limiting — the data it generates about how your API is consumed is invaluable for capacity planning and pricing decisions.
The best rate limiting system is one your clients never notice — because it only kicks in for genuinely abusive traffic while legitimate users operate freely within generous boundaries.
Product recommendations are based on independent research and testing. We may earn a commission through affiliate links at no extra cost to you.