Skip to content

fix: correct 4 webapp bugs from issue #3748#14

Open
deepshekhardas wants to merge 20 commits into
mainfrom
fix/3748-logs-presenter-replay-ai-agent
Open

fix: correct 4 webapp bugs from issue #3748#14
deepshekhardas wants to merge 20 commits into
mainfrom
fix/3748-logs-presenter-replay-ai-agent

Conversation

@deepshekhardas

@deepshekhardas deepshekhardas commented May 26, 2026

Copy link
Copy Markdown
Owner

Fixes 4 bugs identified in issue triggerdotdev#3748:

  1. LogsListPresenter: Inverted ClickHouse guard - removed the CLICKHOUSE error block so ClickHouse queries proceed correctly.

  2. ReplayRunDialog: Missing replayDataFetcher.data dependency in useEffect.

  3. AIFilterInput: Changed fetcher state check from loading to idle.

  4. AgentView: Added useEffect to reset refs when sessionId changes.


Summary by cubic

Fixes four webapp bugs to restore ClickHouse log queries, stabilize the replay dialog, clear the AI filter input correctly, and prevent stale agent messages when switching sessions. Addresses issue triggerdotdev#3748.

  • Bug Fixes
    • LogsListPresenter: removed inverted CLICKHOUSE guard so ClickHouse log queries run.
    • ReplayRunDialog: added replayDataFetcher.data to useEffect deps to refresh queues reliably.
    • AIFilterInput: clear input when fetcher state is idle (not loading).
    • AgentView: reset message refs on sessionId change to stop stale data from carrying over.

Written for commit 913f7c7. Summary will update on new commits. Review in cubic

Deploy Bot and others added 20 commits February 2, 2026 16:16
- Include reproduction scripts for Sentry (triggerdotdev#2900) and engine strictness (triggerdotdev#2913)
- Include PR body drafts for consolidated tracking
- Include reproduction scripts for Sentry (triggerdotdev#2900) and engine strictness (triggerdotdev#2913)
- Include PR body drafts for consolidated tracking
When the underlying logical-replication client errored (e.g. after a
Postgres failover), the runs and sessions replication services logged
the error and left the stream stopped. The host process kept running,
the WAL backed up, and ClickHouse silently fell behind.

Both services now run a configurable recovery strategy on stream errors,
defaulting to in-process reconnect with exponential backoff so a fresh
self-hosted setup heals on its own:

- "reconnect" (default) re-subscribes via the existing subscribe(lastLsn)
  path with exponential backoff (1s -> 60s cap, unlimited attempts), which
  re-validates the publication, re-acquires the leader lock, and resumes
  from the last acknowledged LSN.
- "exit" calls process.exit after a short flush window so a host's
  supervisor (Docker restart=always, systemd, k8s, etc.) can replace the
  process.
- "log" preserves the historical behaviour.

Per-service strategy + exit knobs are env-driven via
RUN_REPLICATION_ERROR_STRATEGY / SESSION_REPLICATION_ERROR_STRATEGY plus
matching *_EXIT_DELAY_MS / *_EXIT_CODE. Reconnect tuning is shared
across both services via REPLICATION_RECONNECT_INITIAL_DELAY_MS /
_MAX_DELAY_MS / _MAX_ATTEMPTS (0 = unlimited).
Addresses PR review feedback:

- LogicalReplicationClient.subscribe() can throw before its internal
  "error" listener is wired up (notably when pg client.connect() fails
  mid-failover). The reconnect strategy's catch block only logged, so
  recovery silently stopped. Now also calls scheduleReconnect(err) — the
  pendingReconnect guard makes it idempotent if an error event was also
  emitted.
- Reject negative values for the new replication-recovery env vars and
  cap exit codes at 255.
- Convert the new ReplicationErrorRecovery{Deps,} interfaces to type
  aliases to match the repo's TypeScript style.
- Tighten the reconnect dep comment to drop a stale "lastAcknowledgedLsn"
  reference (the wrapper-tracked resume LSN is what callers actually pass).
- Restore process.exit after service.shutdown() in the exit-strategy
  test so a delayed exit timer can't terminate the test worker.
LogicalReplicationClient.subscribe() can resolve without throwing or
emitting an "error" event when leader-lock acquisition fails — it just
calls this.stop() and returns. The reconnect callback now checks
isStopped after subscribe() and throws so the recovery handler can
schedule the next attempt instead of silently giving up.
…rough handle()

The previous post-subscribe() isStopped check was always true on the
happy path: subscribe() calls stop() up front (setting _isStopped=true)
and only resets the flag inside the replicationStart event, which fires
asynchronously after subscribe() returns. So the check threw on every
successful reconnect, the catch rescheduled, the next attempt tore down
the just-built client, and the cycle continued — replication briefly
worked between teardowns, which is why the integration test passed.

Replace it with the correct nudge: subscribe to leaderElection and call
the recovery handler on isLeader=false. That's the only subscribe()
exit path that doesn't either throw or emit an "error" event (the other
silent-return paths emit "error" first via createPublication/createSlot
failures).
The previous commit routed leaderElection(false) through handle(), which
under the exit strategy schedules process.exit. In a multi-instance
deployment that turns lost leader election — a normal operational state
— into a restart loop: exit, supervisor restarts, election fails again,
exit, and so on.

Add a dedicated notifyLeaderElectionLost() on ReplicationErrorRecovery
that the reconnect strategy treats as another retry trigger, while
exit and log strategies no-op. Wire the wrapper services through the
new method.
fix(webapp): auto-recover replication services after stream errors
- LogsListPresenter: Remove inverted ClickHouse guard that blocked valid log queries
- ReplayRunDialog: Add missing replayDataFetcher.data useEffect dependency
- AIFilterInput: Fix wrong fetcher state check (loading -> idle)
- AgentView: Reset refs when session changes to prevent stale data bleed

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 4 files

Re-trigger cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants