6 Surprising Lessons from a CUBIC Congestion Control Bug in QUIC

From Xshell Ssh, the free encyclopedia of technology

In the world of network protocols, even the most robust algorithms can harbor hidden quirks. CUBIC, the default congestion controller in Linux (standardized in RFC 9438), governs how most TCP and QUIC connections probe bandwidth and handle loss. At Cloudflare, our open-source QUIC implementation (quiche) relies on CUBIC for a significant portion of traffic. This article unveils a peculiar bug where CUBIC's congestion window gets permanently stuck at its minimum, never recovering from collapse. The journey began with a Linux kernel update aligning CUBIC with the app-limited exclusion rule—a fix that, when ported to quiche, triggered unexpected failures. The resolution? A near one-line code change that elegantly restored sanity. Here are six key takeaways from this investigation.

1. CUBIC's Core Logic: A Quick Refresher

CUBIC operates by adjusting the congestion window (cwnd)—a sender-side cap on bytes in flight. When no packet loss occurs, CUBIC aggressively increases cwnd to maximize bandwidth utilization. Upon detecting loss, it assumes network capacity is exceeded and reduces cwnd. This loss-based approach assumes that packet loss signals congestion. However, this logic has limitations, especially in modern networks with diverse traffic patterns. Understanding this foundation is crucial to grasping how a small bug can cause big problems.

6 Surprising Lessons from a CUBIC Congestion Control Bug in QUIC
Source: blog.cloudflare.com

2. The App-Limited Exclusion: A Necessary Fix

RFC 9438 §4.2-12 introduced the app-limited exclusion principle: when a connection has no data to send (app-limited), CUBIC should not adjust its cwnd based on that idle period. This prevents cwnd from growing unnecessarily when the sender has no data. The Linux kernel implemented this fix to solve a real TCP issue, ensuring CUBIC only operates when there is actual data flow. This change seemed straightforward, but its implications for QUIC were not immediately obvious.

3. Porting to QUIC: Unexpected Surfaces

When Cloudflare ported the app-limited exclusion fix from Linux's CUBIC to quiche (their QUIC implementation), the behavior deviated from expectations. QUIC, unlike TCP, has multiplexed streams and different acknowledgment mechanisms. The fix, originally designed for TCP stacks, interacted poorly with quiche's handling of idle periods and loss recovery. This mismatch triggered a cascade of errors, demonstrating how protocol-specific nuances can break seemingly universal fixes.

4. The Symptom: A Test Fails 61% of the Time

Our investigation began with erratic failures in ingress proxy integration tests. The test simulated heavy packet loss early in a connection—a scenario where CUBIC should reduce cwnd and then recover. However, in 61% of runs, the connection never recovered; cwnd stayed at its minimum value. This was not a steady-state issue but a corner case in recovery after congestion collapse. Such bugs are rarely caught by typical throughput tests, highlighting the importance of edge-case testing.

6 Surprising Lessons from a CUBIC Congestion Control Bug in QUIC
Source: blog.cloudflare.com

5. Root Cause: Cwnd Pinned at Minimum

Digging into the code, we found that the app-limited exclusion check inadvertently prevented cwnd from ever increasing after collapse. When the connection was in recovery and became app-limited (no data to send for a brief moment), the fix forced CUBIC to freeze cwnd. Since recovery involves multiple rounds of sending small amounts of data, each round trip could hit an app-limited state, permanently locking cwnd at its minimum. The algorithm assumed that app-limited meant 'no congestion,' but it misapplied this assumption during loss recovery.

6. The Elegant One-Line Fix

The solution was surprisingly simple: a single line that reordered the app-limited check. By ensuring that the app-limited logic only applies when the connection is not in recovery, we allowed CUBIC to grow cwnd normally after congestion collapse. This fix restored recovery behavior without breaking the intended app-limited exclusion. It underscores how a small change can resolve a complex bug, and reminds us that even well-tested algorithms have delicate state interactions.

The story of this bug is a testament to the intricacies of network protocol implementations. What worked for TCP in the Linux kernel needed careful adaptation for QUIC. The fix not only solved the 61% failure rate but also improved the reliability of Cloudflare's QUIC traffic. For developers working on congestion control, this serves as a valuable lesson: always test corner cases, and never assume that a patch from one stack will seamlessly transfer to another.