Quick Facts
- Category: Linux & DevOps
- Published: 2026-05-19 01:11:23
- Revolutionary DNA-Based Cholesterol Treatment: A Q&A Guide
- Documenting Open Source: The Stories Behind the Code
- Migrating to the Latest A2UI and Flutter GenUI: A Step-by-Step Guide
- 6 Key Insights into ByteDance's Astra: Revolutionizing Robot Navigation
- Leveraging AI Assistants for macOS Kernel Exploit Development: A Five-Day Journey with Mythos Preview
Introduction
At the heart of modern internet communication lies the CUBIC congestion control algorithm, standardized in RFC 9438 and serving as the default in Linux. It governs how both TCP and QUIC connections probe for bandwidth, respond to packet loss, and recover. At Cloudflare, our open-source QUIC implementation—quiche—relies on CUBIC as its primary congestion controller, placing this algorithm in the critical path for a large portion of our traffic. This article unravels a peculiar bug where CUBIC's congestion window (cwnd) becomes permanently stuck at its minimum after a congestion event, never recovering. The issue began with a Linux kernel update designed to align CUBIC with RFC 9438's app-limited exclusion rule—a fix for TCP that, when ported to QUIC, triggered unexpected behavior. The resolution was elegantly simple: a near-one-line change that broke the problematic cycle.

Understanding CUBIC and Congestion Control
Congestion control algorithms (CCAs) manage the flow of data across networks by adjusting a key parameter called the congestion window (cwnd). This window caps the number of bytes a sender can have in flight—sent but not yet acknowledged—at any time. A larger cwnd allows more data per round trip, while a smaller one throttles the transmission. CUBIC, a loss-based CCA, follows a straightforward philosophy: increase the sending rate when the network appears healthy (no loss) and decrease it when loss is detected, assuming capacity has been exceeded. This approach aims to maximize bandwidth utilization without overwhelming the network.
While the core logic seems simple, it rests on assumptions that have been refined over the years. One such refinement is the app-limited exclusion introduced in RFC 9438 §4.2-12, which ensures that CUBIC doesn't incorrectly reduce the cwnd when the sender itself is limiting data, not the network. However, this rule—designed for TCP—proved problematic when adapted to QUIC, leading to the bug described next.
The Bug: Cwnd Stuck at Minimum
The anomaly surfaced as persistent failures in Cloudflare's ingress proxy integration tests. These tests evaluated CUBIC under severe packet loss early in a connection, simulating a congestion collapse followed by recovery. Normally, a congestion controller should gradually increase the cwnd after a collapse, restoring throughput. But in these tests, the cwnd remained locked at its minimum value—never recovering. The failures appeared erratically, occurring about 61% of the time during specific scenarios.
Recovery after congestion collapse is a rare regime, yet it is precisely what a congestion controller is designed to handle. Most testing focuses on steady-state and growth phases, leaving the corner case of minimum cwnd less explored. Bugs in this state space remain hidden in typical throughput simulations, but they can cause real-world performance degradation when they trigger.
Root Cause: App-Limited Exclusion in QUIC
The root cause traced back to the Linux kernel patch implementing RFC 9438's app-limited exclusion. In TCP, this patch prevents CUBIC from reacting to loss or reduction events when the connection is app-limited—i.e., when an application has little data to send, and the sender is not fully utilizing the available bandwidth. The logic is sound for TCP because the kernel can track application behavior accurately.

However, when Cloudflare ported this logic to quiche (a userspace QUIC implementation), the interaction changed. QUIC, unlike TCP, runs in userspace and handles multiplexing streams over a single connection. The app-limited state detection, originally designed for a single stream, became ambiguous with multiple streams. Under certain conditions, the algorithm incorrectly concluded that the connection was app-limited even when it wasn't—or it failed to exit that state after a congestion collapse. As a result, the cwnd was permanently capped at its minimal value, preventing any recovery.
The Fix: A Simple Yet Effective Solution
After thorough analysis, the fix turned out to be remarkably concise—a near-one-line change in the quiche codebase. The solution involved adjusting the condition under which the app-limited exclusion was applied. Specifically, the team modified the logic to correctly reset the app-limited state after a congestion event, ensuring the cwnd could grow again. This small alteration broke the cycle that held the cwnd hostage, restoring normal recovery behavior.
The fix was deployed and validated against the failing tests, reducing the failure rate to zero. It also integrated smoothly into the existing code, highlighting how a nuanced understanding of protocol differences—between TCP and QUIC—can prevent subtle bugs.
Conclusion
This story underscores the challenges of porting protocol features between different transport implementations. A well-intentioned patch for TCP's CUBIC inadvertently introduced a regression in QUIC, exposing a corner case that testing had missed. The resolution—a minimal code change—demonstrates the value of careful testing and deep knowledge of congestion control mechanics. For Cloudflare and the broader internet, this fix ensures that CUBIC continues to serve as a reliable congestion controller for QUIC traffic, even under adverse conditions.