fix: ExponentialBackoff execute() bypasses maxElapsed timeout#15
Open
deepshekhardas wants to merge 20 commits into
Open
fix: ExponentialBackoff execute() bypasses maxElapsed timeout#15deepshekhardas wants to merge 20 commits into
deepshekhardas wants to merge 20 commits into
Conversation
…t build server failures (triggerdotdev#2913)
- Include reproduction scripts for Sentry (triggerdotdev#2900) and engine strictness (triggerdotdev#2913) - Include PR body drafts for consolidated tracking
- Include reproduction scripts for Sentry (triggerdotdev#2900) and engine strictness (triggerdotdev#2913) - Include PR body drafts for consolidated tracking
When the underlying logical-replication client errored (e.g. after a Postgres failover), the runs and sessions replication services logged the error and left the stream stopped. The host process kept running, the WAL backed up, and ClickHouse silently fell behind. Both services now run a configurable recovery strategy on stream errors, defaulting to in-process reconnect with exponential backoff so a fresh self-hosted setup heals on its own: - "reconnect" (default) re-subscribes via the existing subscribe(lastLsn) path with exponential backoff (1s -> 60s cap, unlimited attempts), which re-validates the publication, re-acquires the leader lock, and resumes from the last acknowledged LSN. - "exit" calls process.exit after a short flush window so a host's supervisor (Docker restart=always, systemd, k8s, etc.) can replace the process. - "log" preserves the historical behaviour. Per-service strategy + exit knobs are env-driven via RUN_REPLICATION_ERROR_STRATEGY / SESSION_REPLICATION_ERROR_STRATEGY plus matching *_EXIT_DELAY_MS / *_EXIT_CODE. Reconnect tuning is shared across both services via REPLICATION_RECONNECT_INITIAL_DELAY_MS / _MAX_DELAY_MS / _MAX_ATTEMPTS (0 = unlimited).
Addresses PR review feedback:
- LogicalReplicationClient.subscribe() can throw before its internal
"error" listener is wired up (notably when pg client.connect() fails
mid-failover). The reconnect strategy's catch block only logged, so
recovery silently stopped. Now also calls scheduleReconnect(err) — the
pendingReconnect guard makes it idempotent if an error event was also
emitted.
- Reject negative values for the new replication-recovery env vars and
cap exit codes at 255.
- Convert the new ReplicationErrorRecovery{Deps,} interfaces to type
aliases to match the repo's TypeScript style.
- Tighten the reconnect dep comment to drop a stale "lastAcknowledgedLsn"
reference (the wrapper-tracked resume LSN is what callers actually pass).
- Restore process.exit after service.shutdown() in the exit-strategy
test so a delayed exit timer can't terminate the test worker.
LogicalReplicationClient.subscribe() can resolve without throwing or emitting an "error" event when leader-lock acquisition fails — it just calls this.stop() and returns. The reconnect callback now checks isStopped after subscribe() and throws so the recovery handler can schedule the next attempt instead of silently giving up.
…rough handle() The previous post-subscribe() isStopped check was always true on the happy path: subscribe() calls stop() up front (setting _isStopped=true) and only resets the flag inside the replicationStart event, which fires asynchronously after subscribe() returns. So the check threw on every successful reconnect, the catch rescheduled, the next attempt tore down the just-built client, and the cycle continued — replication briefly worked between teardowns, which is why the integration test passed. Replace it with the correct nudge: subscribe to leaderElection and call the recovery handler on isLeader=false. That's the only subscribe() exit path that doesn't either throw or emit an "error" event (the other silent-return paths emit "error" first via createPublication/createSlot failures).
The previous commit routed leaderElection(false) through handle(), which under the exit strategy schedules process.exit. In a multi-instance deployment that turns lost leader election — a normal operational state — into a restart loop: exit, supervisor restarts, election fails again, exit, and so on. Add a dedicated notifyLeaderElectionLost() on ReplicationErrorRecovery that the reconnect strategy treats as another retry trigger, while exit and log strategies no-op. Wire the wrapper services through the new method.
fix(webapp): auto-recover replication services after stream errors
The execute() method tracked elapsedMs across retries but never checked it against maxElapsed, allowing the retry loop to continue indefinitely regardless of the configured time limit.
There was a problem hiding this comment.
1 issue found across 1 file
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/core/src/v3/apps/backoff.ts">
<violation number="1" location="packages/core/src/v3/apps/backoff.ts:344">
P1: The new `maxElapsed` break is unreachable on `AttemptTimeout` retries because `continue` in `catch` skips the post-`finally` check.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
| clearTimeout(attemptTimeout); | ||
| } | ||
|
|
||
| if (elapsedMs > this.#maxElapsed * 1000) { |
There was a problem hiding this comment.
P1: The new maxElapsed break is unreachable on AttemptTimeout retries because continue in catch skips the post-finally check.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/core/src/v3/apps/backoff.ts, line 344:
<comment>The new `maxElapsed` break is unreachable on `AttemptTimeout` retries because `continue` in `catch` skips the post-`finally` check.</comment>
<file context>
@@ -340,6 +340,10 @@ export class ExponentialBackoff {
clearTimeout(attemptTimeout);
}
+
+ if (elapsedMs > this.#maxElapsed * 1000) {
+ break;
+ }
</file context>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes issue triggerdotdev#3726.
The execute() method tracked elapsedMs but never checked it against maxElapsed, allowing the retry loop to continue indefinitely regardless of the configured time limit. Added check after each retry attempt.
Summary by cubic
Fixes
ExponentialBackoff.execute()ignoringmaxElapsed, which could let retries run past the configured time limit. Now the loop checks elapsed time after each attempt and stops whenmaxElapsedis exceeded.elapsedMs > this.#maxElapsed * 1000) after each retry, preventing overlong or infinite retries.Written for commit bf858f4. Summary will update on new commits. Review in cubic