HTTP/2 Flow Control Deadlock
Scroll to the bottom for TL;DR.
We recently opened a Reactor Netty issue on GitHub after experiencing infinite hangs on HTTP/2 connections in our WebFlux based microservice architecture. The hangs were isolated to one flow. This flow typically processes the largest data volumes across individual HTTP calls within our system.
The flow involved two services, A and B, and two separate endpoint calls, each of which handle Reactive streams of data (Flux<Data>):
- A -> B to retrieve data S (c1)
- A -> B to write data S' (c2)
A first calls B to retrieve data set S, performs some business logic which mutates S, then calls B again with S'. We split S' into windows with 50 elements, allowing us to distribute the load across various instances of B. The number of concurrent windows of S' were limited to 4.
public Flux<Data> getAndWrite(int elementCount) {
return getData(elementCount)
.window(50)
.flatMap(this::writeData, 4); // window concurrency = 4
}
private Flux<Data> getData(int count) {
return webClient.get()
.uri("/get/" + count)
.retrieve()
.bodyToFlux(Data.class);
}
private Flux<Data> writeData(Flux<Data> datas) {
return webClient.post()
.uri("/write")
.body(BodyInserters.fromPublisher(datas, Data.class))
.retrieve()
.bodyToFlux(Data.class);
}
What we tried
After numerous failed experiments and hypotheses we switched to HTTP/1.1 and re-ran the test suite. It was a simple two service suite with a parameterised endpoint allowing simulation with different dataset sizes. On my local machine, it would deterministically fail when the dataset size reached a known value, yet with HTTP/1.1 it passed regardless of the size.
This lead us to believe the issue was isolated to HTTP/2 and perhaps related to stream multiplexing, which allows multiple HTTP requests to flow concurrently on a single TCP connection.
After various other tests that provided data points that we couldn't quite connect together, it became apparent that using separate WebClient instances, and thus separate connection pools (from A -> B), for calls c1 and c2 prevented the stalls from occurring (whilst remaining on HTTP/2).
In addition, removing concurrency limits (4, in the code sample above) alleviated the stalls for all input sizes.
At this point, we had gathered some useful facts, but needed an extra hand.
The cause
Flow Control
Before we dive into the details, it is useful to know HTTP/2 implements flow control (RFC) at both the stream and connection level, as a mechanism to 'protect endpoints that are operating under resource constraints', which essentially means it handles upstream and downstream systems who may produce and consume at different rates.
An essential aspect of flow control in HTTP/2 is the WINDOW_UPDATE frame, which allows the receiver to specify how much data they can accept on both a per-stream basis and for the entire connection. The default window size is 65,535 bytes. In a single request from A to B, both A and B act as receivers (depending on the direction of data flow) and announce a window size to the sender. Once a window is fully utilised, no additional DATA frames can be sent until the window is updated by the receiver, with further WINDOW_UPDATE frames.
Problem Statement
(Huge props to chemicL for the diagnosis)
Let's break this down...
Consider HTTP/2 window size = 64KB and connection pool size = 1.
Step 1. Client A sends a request on Stream 1
Client A sends a small GET request on Stream 1 to Server B.
Server B prepares a large response (64KB).
Flow Control State
• Connection window: 64KB (full)
• Stream 1 window: 64KB (full)
Step 2: Server B sends a large response on Stream 1
Server B sends 64KB of DATA frames on Stream 1, consuming the entire
connection-level and stream-level flow control windows.
Server B is now blocked and cannot send more data until Client A sends a
WINDOW_UPDATE.
Flow Control State:
• Connection window: 0KB (fully used).
• Stream 1 window: 0KB (fully used).
Step 3: Client A consumes part of the response and starts Stream 2
Client A consumes 32KB of data from Stream 1 and sends a WINDOW_UPDATE for
32KB to Server B.
Client A initiates a second request to Server B on Stream 2, using some of
the data from Stream 1.
HEADERS are not subject to flow control, so the request on Stream 2 is sent
successfully.
Flow Control State:
• Connection window: 32KB (available).
• Stream 1 window: 32KB (available).
• Stream 2: Request headers sent, ready to receive data.
Step 4: Server B sends a response on Stream 2
Server B processes the request on Stream 2 and sends 32KB of DATA frames,
consuming the remaining connection window.
The connection-level window is now 0KB again, so Server B is fully blocked
from sending data on both Stream 1 and Stream 2.
Flow Control State:
• Connection window: 0KB (fully used).
• Stream 1 window: 32KB (available, but blocked by the connection).
• Stream 2 window: 0KB (fully used).
Step 5: Deadlock (No Progress Possible)
Client A is waiting for Server B to respond on Stream 2.
Server B is waiting for Client A to consume more data from Stream 1 and
send a WINDOW_UPDATE.
Since Client A doesn’t consume more data from Stream 1 (due to concurrency
limits, in our case), the connection window remains blocked, and both
parties are stuck.
Flow Control State:
• Connection window: 0KB (fully blocked).
• Stream 1 & Stream 2: Blocked.
Key Points
- The connection-level flow control window is shared across all streams, meaning one stream’s usage can block others.
- This deadlock happens because Client A and Server B are waiting on each other:
- Server B waits for a WINDOW_UPDATE to free up connection flow control, but Client A's concurrency constraints make this impossible
- Client A waits for a response on Stream 2, which would free up a concurrency slot, but Server B cannot send it.
The solutions
Various options exist as workarounds or protections against the window consumption deadlock described above, each with their own implications. Let's consider a few:
Option 1: Avoid Cyclical Calls on the Same Connection
- Separate Connection Pools:
- Use different WebClient instances (each with its own connection pool) for each call.
- Pros: Eliminates contention at the connection level; straightforward setup if additional connections are acceptable.
- Cons: More TCP connections; slightly higher resource usage.
Option 2: Increase the HTTP/2 Flow Control Window
- Adjust initialWindowSize in Reactor Netty:
- Raising the default window from 65,535 bytes can reduce the likelihood of hitting a deadlock.
- Pros: Easy configuration change; good intermediate workaround.
- Cons: Not a permanent fix if data volumes or concurrency keep growing; may only delay a potential deadlock.
Option 3: Fully Consume One Call’s Data Before Making the Next
- Buffer All Data from the First Call:
- Ensure the first response is fully read (and potentially buffered) before initiating the second call.
- Pros: No chance of overlapping streams competing for flow control on the same connection.
- Cons: Requires buffering large payloads, losing streaming benefits; increased memory usage.
Option 4: Increase or Remove Concurrency Limits
- If your reactive pipeline uses operators like flatMap(this::writeData, concurrency = X), consider raising X significantly or removing the limit. By allowing more simultaneous consumption, you’re less likely to stall a particular stream—and thus you keep issuing WINDOW_UPDATE frames more regularly.
- Pros
- Reduces Partial Consumption: More concurrency means the client promptly reads from all in-flight streams, freeing up window space
- Cons
- Not a Silver Bullet: If your application logic inherently delays reading a stream (or you deal with extremely large payloads), deadlocks can still occur under certain conditions.
What we did
Option 3 (reading all data from the initial call) suited our needs because our services are not memory constrained, and the flow in question is asynchronous with no stringent latency requirements. This notably allowed us to keep our concurrency limits whilst fully protecting against deadlock.
In addition, regardless of solution choice, we also recommend configuring sensible timeouts for all HTTP requests as an additional layer of protection against deadlock (and to fail-fast, but that's a different topic!).
Will HTTP/3 help?
(Thanks to Lucas Pardue on X for the heads up)
HTTP/3, like HTTP/2, is a stream-based protocol that multiplexes multiple streams over a single connection. Unfortunately, it can suffer from the same issue for similar reasons. While each stream in HTTP/3 has its own flow control window (a notable improvement over HTTP/2), the connection-level flow control window still exists and can lead to a deadlock scenario.
Lucas, who is debugging a similar issue on behalf of Cloudflare, highlighted an intriguing open Chromium issue. This issue describes a deadlock that occurs when 13 <video> elements are loaded concurrently within an HTML webpage. The problem manifests in both HTTP/2 and HTTP/3 due to the exhaustion of the connection-level flow control window.
TL;DR
In HTTP/2, all streams share one connection-level flow-control window. When a single client (A) makes two separate calls (A→B and A→B again) on the same connection, the first call can consume the shared window and block the second call. Meanwhile, the second call can also partially consume (or demand) window space before the first call has fully finished. If neither call’s data is consumed quickly enough (and WINDOW_UPDATE frames aren’t sent), each call ends up waiting on the other to free the shared connection window. The result is a deadlock where no progress is possible.
HTTP/3, which similarly shares a single connection-level flow-control window, suffers the same fate.
Links
- Reactor Netty issue: https://github.com/reactor/reactor-netty/issues/3495
- HTTP/2 RFC: https://datatracker.ietf.org/doc/html/rfc9113
- HTTP/3 RFC: https://datatracker.ietf.org/doc/html/rfc9114
- Lucas on X: https://x.com/SimmerVigor/status/1875101622747173255
- Chromium window deadlock: https://issues.chromium.org/issues/41161335?pli=1