API is degraded
Resolved
Aug 14, 2025 at 09:26pm UTC
We are back. Job timeout metrics have recovered to pre-incident levels.
The issue came down to misconfigured pipeline queue limits on Dragonfly -- we configured them with high values expecting a high load on production, however, they ended up being ridiculously high. This caused Dragonfly's backpressure mechanisms to kick in way too late, only when the state of the instance is practically already unsalvageable. The configuration was tuned, and we will continue to monitor this.
Affected services
Updated
Aug 14, 2025 at 08:48pm UTC
The issue has regressed. We are working on a fix.
Affected services
Updated
Aug 14, 2025 at 08:03pm UTC
This issue was caused by an elevated number of connections overloading our Dragonfly instance in charge of operating the scrape queue. We've applied a fix to reduce the number of extraneous connections, and we are putting in further architectural work to do better on this metric. Apologies for the disruption.
Affected services
Updated
Aug 14, 2025 at 05:54pm UTC
The issue is now resolved.
Affected services
Updated
Aug 14, 2025 at 05:44pm UTC
The issue has been mitigated. We are now slowly allowing more workers to come online.
Affected services
Created
Aug 14, 2025 at 05:33pm UTC
We are investigating
Affected services