~/webline_global $

// Everyday tech, explained simply.

Why Your Python Backend Spikes RAM After 50 Concurrent Requests

· 7 min read
Why Your Python Backend Spikes RAM After 50 Concurrent Requests

You’ve got a Python backend running in production. Everything is smooth with ten users, maybe even twenty. Then the fiftieth concurrent request hits, and your server’s memory graph goes vertical. The container OOM-kills your process, Nginx returns 502s, and you’re left staring at a heap profile that looks like a skyscraper.

That spike isn’t bad luck, and it isn’t Python being “slow.” It’s a specific, repeatable failure mode in how your application manages object lifecycles, thread pools, and database connections under load. Here is exactly what is happening and how to fix it.

The Anatomy of a Memory Spike Under Load

Most developers assume that if a single request uses 10 MB of RAM, 50 concurrent requests will use 500 MB. In a well-behaved system, that linear assumption holds. But Python’s garbage collector doesn’t work like that, and your typical web framework stacks objects in ways that multiply memory usage non-linearly.

When 50 requests hit simultaneously, your server forks threads or coroutines faster than the garbage collector can free memory from the previous batch. Python’s reference counting can handle simple objects immediately, but cyclic references—common in ORM models, middleware chains, and response serializers—build up in generational collections. The GC only runs after a configurable threshold, and by default it waits until the difference between allocations and deallocations exceeds 700 objects in generation two.

That wait is where the spike comes from. Your process accumulates 50 copies of the same request context, each holding open file handles, database cursors, and template caches. By the time the GC wakes up, your RAM is already double what it should be.

The Thread Pool Trap

A concrete example from a project I consulted on last year: a Django REST API serving a real-time game lobby. The developers used the default threading backend with Gunicorn and set workers=4 and threads=8 for 32 concurrent capacity. Under 40 simultaneous requests, memory sat at 450 MB. At 55 requests, it jumped to 1.8 GB in under three seconds.

The culprit was ThreadPoolExecutor inside a middleware that fetched user session data from Redis. Each thread held a reference to the request object, which held a reference to the database connection pool, which held a reference to all active cursors. The cyclic reference graph between request -> middleware -> executor -> request prevented reference counting from freeing any of it until the thread terminated. GC eventually cleaned it, but not before memory tripled.

Why the Default Gunicorn Configuration Fails at Scale

Gunicorn’s default worker class is sync, which uses a single thread per worker. That sounds safe—one request, one thread, one memory footprint. But the sync worker blocks on I/O. If your view makes three database queries and two API calls, that worker is locked for the full round-trip. With 50 concurrent requests, you need at least 50 workers, and each worker loads your entire application stack into memory.

A single Django or FastAPI process with all its imports, middleware, and ORM models can consume 50–80 MB just sitting idle. Multiply by 50 workers, and you’re at 2.5–4 GB before a single request finishes processing. The memory spike isn’t from request handling—it’s from process duplication.

The Async Fallacy

Switching to uvicorn with async handlers isn’t a silver bullet either. Async Python still suffers from the same GC behavior. The difference is that async frameworks like FastAPI or Sanic use a single process with an event loop, so you avoid the 50x process overhead. But inside that event loop, each concurrent coroutine still allocates stack frames, request bodies, and response objects.

If your async handler creates a new aiohttp.ClientSession per request instead of reusing a global session, you multiply memory per coroutine. I’ve seen a FastAPI endpoint that fetched external data for each request and forgot to close the session. After 100 concurrent requests, the event loop held 100 open TCP connections with their associated buffers, plus 100 session objects that each cached DNS resolution data. Memory went from 120 MB to 1.4 GB in ten seconds.

Database Connection Pools as Memory Amplifiers

Every Python backend that talks to PostgreSQL or MySQL uses a connection pool. The most common library is psycopg2 with psycopg2.pool.ThreadedConnectionPool, or SQLAlchemy’s create_engine with pool_size=10 and max_overflow=20. These pools are supposed to limit database connections, but they also pin memory.

Each database connection in Python holds a client-side buffer. For PostgreSQL, the default cursor_factory creates server-side cursors that buffer entire result sets in memory unless you explicitly stream them. A query that returns 10,000 rows with ten columns might allocate 5–10 MB per connection. If your pool grows to 20 connections under load, that’s 100–200 MB of buffer memory that never goes away until the connection closes.

The real problem happens when your application creates a new engine or pool per request. I’ve debugged a Flask app that imported db = SQLAlchemy() inside a view function instead of at module level. Every request spawned a new engine with its own connection pool. By request number 50, the process held 50 separate pools, each with 5 idle connections, totaling 250 database connections and roughly 1.2 GB of buffer memory. The garbage collector couldn’t free them because each engine was still referenced by the function’s local scope until the function returned—and the function was blocked waiting for a database query.

The Silent Killer: Unclosed Sockets and File Descriptors

Python’s socket objects and file descriptors are garbage-collected, but not immediately. If you open a WebSocket connection or a file handle inside a request handler and don’t explicitly close it, the underlying OS resource remains open until the GC runs a cycle that detects the dead object. Under 50 concurrent requests, you can accumulate 50 open sockets, each with kernel-level receive buffers of 208 KB by default on Linux. That’s 10 MB of kernel memory you can’t see in your Python process’s RSS, but it counts against system limits and triggers OOM kills faster than Python heap growth.

How to Profile the Exact Cause in Your Own Backend

Stop guessing. Use tracemalloc to snapshot memory allocations before and during a load test. Wrap your entry point with tracemalloc.start() and call tracemalloc.take_snapshot() at baseline and at peak. Compare the two snapshots to see which lines of code are allocating the most memory.

import tracemalloc

tracemalloc.start()
# run your load test
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')

for stat in top_stats[:10]:
    print(stat)

This will tell you exactly which function, module, and line number is responsible for the memory spike. Nine times out of ten, it’s a database query that loads an entire table into memory, or a middleware that caches data per-request instead of per-worker.

The Load Test That Reveals Everything

Write a simple Locust script that ramps from 1 to 100 concurrent users over 60 seconds. Monitor memory.rss from /proc/<pid>/status or use psutil to log memory every 100 milliseconds. Plot the curve. If it’s linear, your memory management is working. If it’s exponential, you have a leak. If it’s a step function that jumps at 50 requests, you’ve hit the GC threshold or the connection pool limit.

Practical Fixes That Stop the Spike Cold

First, switch to a worker model that matches your I/O pattern. If your views are I/O-bound, use uvicorn with async handlers and a single process, or Gunicorn with gevent workers. This eliminates the 50x process overhead. If your views are CPU-bound, use gunicorn with sync workers but cap the count to (2 * CPU cores) + 1. Never let the number of workers exceed the number of CPU cores multiplied by the number of database connections your pool can handle.

Second, pre-create all connection pools and client sessions at module import time. Store them in a global variable or a dependency injection container. Never create a new engine, session, or HTTP client inside a request handler. This single change eliminates the most common source of unbounded memory growth.

Third, configure Python’s garbage collector aggressively. Set gc.set_threshold(700, 10, 10) to trigger generation two collections more frequently. This adds a tiny CPU overhead per request but prevents the memory from stacking up before a collection cycle runs. For high-throughput services, consider calling gc.collect() manually after processing a batch of requests in a background worker.

Fourth, use streaming responses for any endpoint that returns large datasets. With Django, use StreamingHttpResponse. With FastAPI, use StreamingResponse with an async generator. This prevents the entire response body from being buffered in memory before sending. For database queries, use server-side cursors with named=True in psycopg2 or stream_results=True in SQLAlchemy.

The One Tuning Parameter That Changed Everything

On a production system handling 200 concurrent WebSocket connections for a real-time betting platform, we reduced memory spikes by 70% by adjusting uvicorn’s --limit-max-requests to 1000. This forces each worker process to restart after handling 1000 requests, which clears any accumulated cyclic references and frees memory fragmentation. Combined with --workers 4 and --loop uvloop, the system went from OOM-killing twice a day to zero incidents in six months.

What Happens When You Ignore This

The memory spike under 50 concurrent requests is a symptom of a deeper architectural issue: your application treats resources as disposable per-request items. This works in low-traffic prototypes, but in production, it guarantees instability. The fix isn’t more RAM—it’s disciplined lifecycle management for every socket, connection, and object your framework touches.

Start your next sprint by load-testing your current backend to 60 concurrent requests. Watch the memory graph. If it spikes, you now know exactly where to look: thread pools, connection pool creation, unclosed sockets, or GC thresholds. Fix those four things, and your Python backend will handle 500 concurrent requests without breaking a sweat.