You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
🔒 Scanned for secrets using gitleaks 8.30.1
# Description
Introduces a non-blocking compaction flow for the jobsdb migration loop,
gated behind the `nonBlockingCompaction` flag. When enabled it
dramatically shrinks the lock window taken during dataset migration:
- **`dsMigrationLock` is no longer taken.** Concurrent readers
(`getJobs`, `GetPileUpCounts`, `GetDistinctParameterValues`) are not
blocked by an in-flight compaction.
- **Caveat:** an `UpdateJobStatus` call that targets a source dataset
*while it is being compacted* will block at the PG level on the
`EXCLUSIVE` lock held against that source's status table, until the
compaction TX commits — bounded by the per-source `COPY` duration. After
commit, late writers landing on the old status table are fenced by the
readonly trigger and routed to the new destination via the existing
`ErrStaleDsList` retry path.
Compared to the legacy path, the writer side of `dsListLock` is reduced
from "entire migration TX, including bulk `COPY` of all non-terminal
jobs and `DROP TABLE` of every source" down to "`COMMIT` + a single
`getDSList` + a `MIN/MAX` scan of the new destination".
## New maintenance pool
Introducing a new maintenance pool which can be used by maintenance
operations, such as:
- adding new dataset
- refreshing dataset list
- compacting datasets
- compacting job status tables
This pool helps closing a connection-pool deadlock vector where jobsdb
readers and writers could fully exhaust the pool, blocking an active
maintenance goroutine while it still held a mutex (e.g. post-commit
compaction which requires acquiring a new connection to refresh the
dataset list). In addition, all maintenance goroutines were updated to
ensure they require no more than a single connection at any time,
eliminating pool-related deadlock risks caused by previously nested
connection usage.
If no maintenance pool is injected, the calls fall back to `dbHandle`
(backwards compatibility).
## Lock stats
`dsListLock` and `dsMigrationLock` now emit timing metrics, tagged by
lock type (`read`/`write`) and whether the acquisition was async:
| Metric | Description |
|---|---|
| `jobsdb_lock_wait_time` | Time spent waiting to acquire the lock |
| `jobsdb_lock_time` | Time the lock was held (excluding wait) |
| `jobsdb_lock_total_time` | End-to-end time (wait + hold) |
All metrics carry a `name` tag so lock contention can be tracked per
jobsdb instance.
## Flags
- `JobsDB.<prefix>.nonBlockingCompaction` (default `false`) — gates the
new flow. When off, `doMigrateDS` falls back to the legacy in-TX
migrate+drop path.
- `JobsDB.<prefix>.getJobsRetryOnCompaction` (default `true`) — gates
the `getJobs` snapshot revalidation. When on, if a dataset that
`getJobs` queried was compacted mid-read, the call returns
`ErrStaleDsList` and is retried against the freshly published list.
**Why:** while `getJobs` is reading the source's status table, an
`UpdateJobStatus` for one of the same jobs could commit against the new
dataset. Without the retry, `getJobs` would miss that status update,
risking out-of-order processing downstream. This option has no effect
unless `nonBlockingCompaction` is also on.
## Testing
#6979 runs all tests with non-blocking partition migration enabled
## Linear Ticket
resolves PIPE-2997
## Security
- [x] The code changed/added as part of this pull request won't create
any security issues with how the software is being used.
<!-- GitButler Footer Boundary Top -->
---
This is **part 3 of 4 in a stack** made with GitButler:
- <kbd> 4 </kbd> #6979
- <kbd> 3 </kbd> #6967 👈
- <kbd> 2 </kbd> #6962
- <kbd> 1 </kbd> #6963
<!-- GitButler Footer Boundary Bottom -->
0 commit comments