feat(worker): worker pool + DB-backed task queue + retry (bookshelf-w5y.1) #22

Merged
zombor merged 2 commits from bd-bookshelf-w5y.1 into main 2026-05-24 01:46:47 +00:00
Owner

Summary

  • Adds internal/worker package: single-process Pool with RegisterHandler, Start, Shutdown, panic recovery (terminal, no retry), wall-clock budget (1h from created_at), in-process attempt counter (MaxAttempts=3), exponential backoff (0s/30s/5m), and UpdateProgress passed to handlers as a func arg.
  • Adds sqlc queries (tasks.sql) and hand-written helpers (tasks_extra.go): ClaimNextTask (dynamic IN clause + CAS for future cluster-safety), CASClaimTask, EnqueueTask, UpdateTaskProgress, MarkTaskCompleted, MarkTaskFailed, RequeueForRetry.
  • Prometheus metrics: tasks_claimed_total{type}, tasks_processed_total{type,outcome}, task_duration_seconds{type}, tasks_in_flight.
  • 100% coverage: 28 unit tests (concurrent-safe stubs, race detector clean) + 7 integration tests against testcontainers MySQL covering success path, retry-to-fail (3 attempts), and panic recovery.

HA gap: The attempt counter is in-process only. A mid-retry crash resets the counter. The wall-clock budget (1h) acts as the compensating control — see internal/worker/doc.go.

Test plan

  • make test passes locally (race detector enabled)
  • make coverage passes at 100% gate
  • make lint passes (go vet + golangci-lint, 0 issues)
  • sqlc generated code is up to date
  • Integration tests: success path, retry-to-fail (3×), panic recovery verified against real MySQL
  • Prometheus metrics registered and gathered in unit tests
  • Shutdown drain verified (returns when context deadline hits even with in-flight handler)

Closes bead bookshelf-w5y.1 on merge.

🤖 Generated with Claude Code

## Summary - Adds `internal/worker` package: single-process `Pool` with `RegisterHandler`, `Start`, `Shutdown`, panic recovery (terminal, no retry), wall-clock budget (1h from `created_at`), in-process attempt counter (MaxAttempts=3), exponential backoff (0s/30s/5m), and `UpdateProgress` passed to handlers as a func arg. - Adds sqlc queries (`tasks.sql`) and hand-written helpers (`tasks_extra.go`): `ClaimNextTask` (dynamic IN clause + CAS for future cluster-safety), `CASClaimTask`, `EnqueueTask`, `UpdateTaskProgress`, `MarkTaskCompleted`, `MarkTaskFailed`, `RequeueForRetry`. - Prometheus metrics: `tasks_claimed_total{type}`, `tasks_processed_total{type,outcome}`, `task_duration_seconds{type}`, `tasks_in_flight`. - 100% coverage: 28 unit tests (concurrent-safe stubs, race detector clean) + 7 integration tests against testcontainers MySQL covering success path, retry-to-fail (3 attempts), and panic recovery. **HA gap:** The attempt counter is in-process only. A mid-retry crash resets the counter. The wall-clock budget (1h) acts as the compensating control — see `internal/worker/doc.go`. ## Test plan - [x] `make test` passes locally (race detector enabled) - [x] `make coverage` passes at 100% gate - [x] `make lint` passes (go vet + golangci-lint, 0 issues) - [x] sqlc generated code is up to date - [x] Integration tests: success path, retry-to-fail (3×), panic recovery verified against real MySQL - [x] Prometheus metrics registered and gathered in unit tests - [x] `Shutdown` drain verified (returns when context deadline hits even with in-flight handler) Closes bead bookshelf-w5y.1 on merge. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
feat(worker): add worker pool + DB-backed task queue + retry (bookshelf-w5y.1)
Some checks failed
/ Lint (pull_request) Successful in 5m17s
/ Test (pull_request) Failing after 8m27s
/ Coverage (pull_request) Failing after 7m59s
/ E2E (pull_request) Successful in 10m16s
4d18f41442
Implements the first child of the Background task framework epic:
- sqlc queries in internal/db/queries/tasks.sql: EnqueueTask,
  ClaimNextTaskByType, UpdateTaskProgress, MarkTaskCompleted,
  MarkTaskFailed, RequeueForRetry
- internal/db/sqlc/tasks_extra.go: hand-written ClaimNextTask (dynamic
  IN clause for multiple types) and CASClaimTask (compare-and-swap
  queued→running transition)
- internal/worker package: Pool with RegisterHandler/Start/Shutdown,
  panic recovery (terminal failure, no retry), wall-clock budget (1h
  from created_at, passes ctx deadline to handler), in-process attempt
  counter (MaxAttempts=3), exponential backoff (0s/30s/5m), Prometheus
  metrics (tasks_claimed_total, tasks_processed_total, task_duration_
  seconds, tasks_in_flight), UpdateProgress func passed to handlers
- 100% coverage: 28 unit tests (stubs, race-free) + 7 integration
  tests against testcontainers MySQL covering success, retry-to-fail,
  and panic recovery end-to-end

HA gap: attempt counter is in-process only; wall-clock budget (1h)
acts as compensating control — see internal/worker/doc.go.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix(app): pin connection for FOREIGN_KEY_CHECKS in truncateTables
Some checks failed
/ Lint (pull_request) Successful in 5m15s
/ Coverage (pull_request) Failing after 6m45s
/ Test (pull_request) Failing after 7m7s
/ E2E (pull_request) Has been cancelled
121b70cc39
SET FOREIGN_KEY_CHECKS is a session variable — using pool.ExecContext can
dispatch each statement to a different connection, so the FK check disable
does not apply to the subsequent TRUNCATE calls.  Pin to a single *sql.Conn
so the session setting and all truncations run on the same connection.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix(ci): run test packages sequentially to avoid migration lock contention
All checks were successful
/ Lint (pull_request) Successful in 4m24s
/ Coverage (pull_request) Successful in 5m45s
/ E2E (pull_request) Successful in 7m30s
/ Test (pull_request) Successful in 8m18s
c1217295bc
Integration test suites (internal/app, internal/db, internal/worker) all
call RunMigrations on startup.  With -p 8 they race for the golang-migrate
advisory lock on the shared MySQL service container, causing "can't acquire
lock" failures.  Change to -p 1 so packages run sequentially; the overhead
is negligible because DB-backed suites dominate wall-clock time anyway.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Author
Owner

Security Review

Threat checks

  • All queries parameterized (incl. tasks_extra.go)
  • task_options deserialization safe
  • Status state machine consistent
  • Logging hygiene (no raw task_options in logs)
  • No duplicate execution from race in SKIP LOCKED + CAS
  • Wall-clock budget honest (limitation not documented for ctx-unaware handlers)
  • Panic recovery does not enable log injection (low-risk note)
  • Prometheus labels bounded (type label sourced from DB, not handler registry)
  • Makefile + test-helper changes do not weaken security

Findings

1. SQL injection — CLEAN
tasks_extra.go builds the dynamic IN (?, ?, ...) clause by generating one "?" placeholder per element and appending each types[i] value to the args slice. No user-controlled string is ever interpolated into the SQL text. All other queries are sqlc-generated with ? placeholders throughout. Clean.

2. task_options deserialization — CLEAN
task_options is stored as text and surfaced to handlers as a raw string (Task.TaskOptions). There is no JSON deserialization at the framework layer. Handlers own parsing. No deserialization exploit surface in this PR.

3. Status state machine — CLEAN
Every terminal transition (MarkTaskCompleted, MarkTaskFailed) sets completed_at = NOW() atomically in the same SQL UPDATE. The "budget expired before dispatch" path goes through markFailed which also sets completed_at. No path reaches a completed/failed status with a NULL completed_at.

4. Logging hygiene — CLEAN
task_options is never written to any log field in pool.go. Log calls only emit task_id, task_type, attempt, error, and progress_pct. The full Task struct (which carries TaskOptions) is never passed to a logger. Clean per logging-standard.

5. Race/duplicate execution — CLEAN
FOR UPDATE SKIP LOCKED acquires an advisory row lock, preventing two workers from seeing the same row in the same transaction. After the lock is released, CASClaimTask does UPDATE ... WHERE id = ? AND status = 'queued' — only the first worker to execute this wins; the second gets RowsAffected = 0 and aborts without dispatching. Two-phase claim is correct.

6. Wall-clock budget — NOTE (not a must-fix, but needs a doc line)
The budget is enforced via context.WithDeadline. A handler that uses ctx-aware calls will be cancelled correctly. A handler that blocks on ctx-unaware I/O (e.g., a pure time.Sleep, a blocking syscall without ctx) will silently exceed the budget. The HA gap around the attempt counter is documented in doc.go, but this ctx-awareness requirement is not. Recommend adding one sentence to doc.go or the Handler type godoc: "Handlers must respect ctx cancellation; blocking I/O that ignores ctx can exceed the wall-clock budget."

7. Panic recovery / log injection — LOW RISK NOTE
When a handler panics, the pool logs fmt.Sprintf("%v", r) as the "panic" slog field and string(debug.Stack()) as "stack". In slog.NewJSONHandler mode both values are JSON-escaped and safe for log parsing. In slog.NewTextHandler mode (used in tests with io.Discard) the values are unescaped but still structured key-value pairs, so an adversary would need DB write access and a crafted panic value containing slog text-format control sequences (e.g., embedded key=value pairs with newlines). This is a theoretical concern for an internal worker with controlled task types; low risk. No action required unless the app switches to text-format logging in production.

8. Prometheus label cardinality — NOTE
claimedTotal.WithLabelValues(row.Type) and durationSeconds.WithLabelValues(row.Type) use the task type field read from the DB row, not the set of registered handler keys. An operator or test helper with DB write access could INSERT INTO tasks (type='arbitrary-string') and each unique type value would create a new Prometheus time series. Currently task types are application-controlled (enqueued by the app itself), so the cardinality is bounded in practice. However, if task enqueueing is ever exposed to external input (API, webhook), this becomes a cardinality-bomb risk. Recommend a guard: before recording the metric, check _, ok := p.handlers[row.Type] and fall back to "unknown" for unregistered types (which already get fast-failed anyway).

9. Makefile — CLEAN
The change from -p 8 to -p 1 serialises test package execution to prevent migration-lock contention among integration suites sharing a MySQL container. -race is retained. No security boundary change.

10. app_suite_test.go — CLEAN
The change pins TRUNCATE and SET FOREIGN_KEY_CHECKS statements to a single *sql.Conn to ensure the session variable is scoped correctly. No credentials are added or leaked; the DSN is sourced from the existing BOOKSHELF_DSN env var or a testcontainer.

Verdict

⚠️ Notes

Two actionable notes (neither is a blocker for merge):

  • #6 — Add one sentence to doc.go or the Handler godoc documenting that handlers must honour ctx cancellation for the wall-clock budget to be enforced.
  • #8 — Consider clamping the Prometheus type label to registered handler types (use "unknown" for anything else) to guard against future cardinality explosion if task enqueueing becomes user-facing.
## Security Review ### Threat checks - [x] All queries parameterized (incl. tasks_extra.go) - [x] task_options deserialization safe - [x] Status state machine consistent - [x] Logging hygiene (no raw task_options in logs) - [x] No duplicate execution from race in SKIP LOCKED + CAS - [ ] Wall-clock budget honest (limitation not documented for ctx-unaware handlers) - [ ] Panic recovery does not enable log injection (low-risk note) - [ ] Prometheus labels bounded (type label sourced from DB, not handler registry) - [x] Makefile + test-helper changes do not weaken security ### Findings **1. SQL injection — CLEAN** `tasks_extra.go` builds the dynamic `IN (?, ?, ...)` clause by generating one `"?"` placeholder per element and appending each `types[i]` value to the args slice. No user-controlled string is ever interpolated into the SQL text. All other queries are sqlc-generated with `?` placeholders throughout. Clean. **2. task_options deserialization — CLEAN** `task_options` is stored as `text` and surfaced to handlers as a raw `string` (`Task.TaskOptions`). There is no JSON deserialization at the framework layer. Handlers own parsing. No deserialization exploit surface in this PR. **3. Status state machine — CLEAN** Every terminal transition (`MarkTaskCompleted`, `MarkTaskFailed`) sets `completed_at = NOW()` atomically in the same SQL UPDATE. The "budget expired before dispatch" path goes through `markFailed` which also sets `completed_at`. No path reaches a `completed`/`failed` status with a NULL `completed_at`. **4. Logging hygiene — CLEAN** `task_options` is never written to any log field in `pool.go`. Log calls only emit `task_id`, `task_type`, `attempt`, `error`, and `progress_pct`. The full `Task` struct (which carries `TaskOptions`) is never passed to a logger. Clean per logging-standard. **5. Race/duplicate execution — CLEAN** `FOR UPDATE SKIP LOCKED` acquires an advisory row lock, preventing two workers from seeing the same row in the same transaction. After the lock is released, `CASClaimTask` does `UPDATE ... WHERE id = ? AND status = 'queued'` — only the first worker to execute this wins; the second gets `RowsAffected = 0` and aborts without dispatching. Two-phase claim is correct. **6. Wall-clock budget — NOTE (not a must-fix, but needs a doc line)** The budget is enforced via `context.WithDeadline`. A handler that uses ctx-aware calls will be cancelled correctly. A handler that blocks on ctx-unaware I/O (e.g., a pure `time.Sleep`, a blocking syscall without ctx) will silently exceed the budget. The HA gap around the attempt counter is documented in `doc.go`, but this ctx-awareness requirement is not. Recommend adding one sentence to `doc.go` or the `Handler` type godoc: _"Handlers must respect ctx cancellation; blocking I/O that ignores ctx can exceed the wall-clock budget."_ **7. Panic recovery / log injection — LOW RISK NOTE** When a handler panics, the pool logs `fmt.Sprintf("%v", r)` as the `"panic"` slog field and `string(debug.Stack())` as `"stack"`. In `slog.NewJSONHandler` mode both values are JSON-escaped and safe for log parsing. In `slog.NewTextHandler` mode (used in tests with `io.Discard`) the values are unescaped but still structured key-value pairs, so an adversary would need DB write access and a crafted panic value containing slog text-format control sequences (e.g., embedded `key=value` pairs with newlines). This is a theoretical concern for an internal worker with controlled task types; low risk. No action required unless the app switches to text-format logging in production. **8. Prometheus label cardinality — NOTE** `claimedTotal.WithLabelValues(row.Type)` and `durationSeconds.WithLabelValues(row.Type)` use the task `type` field read from the DB row, **not** the set of registered handler keys. An operator or test helper with DB write access could `INSERT INTO tasks (type='arbitrary-string')` and each unique type value would create a new Prometheus time series. Currently task types are application-controlled (enqueued by the app itself), so the cardinality is bounded in practice. However, if task enqueueing is ever exposed to external input (API, webhook), this becomes a cardinality-bomb risk. Recommend a guard: before recording the metric, check `_, ok := p.handlers[row.Type]` and fall back to `"unknown"` for unregistered types (which already get fast-failed anyway). **9. Makefile — CLEAN** The change from `-p 8` to `-p 1` serialises test package execution to prevent migration-lock contention among integration suites sharing a MySQL container. `-race` is retained. No security boundary change. **10. app_suite_test.go — CLEAN** The change pins `TRUNCATE` and `SET FOREIGN_KEY_CHECKS` statements to a single `*sql.Conn` to ensure the session variable is scoped correctly. No credentials are added or leaked; the DSN is sourced from the existing `BOOKSHELF_DSN` env var or a testcontainer. ### Verdict ⚠️ Notes Two actionable notes (neither is a blocker for merge): - **#6** — Add one sentence to `doc.go` or the `Handler` godoc documenting that handlers must honour ctx cancellation for the wall-clock budget to be enforced. - **#8** — Consider clamping the Prometheus `type` label to registered handler types (use `"unknown"` for anything else) to guard against future cardinality explosion if task enqueueing becomes user-facing.
Author
Owner

Code Review

Phase 0 - DEMO Verification

No standalone DEMO block was provided in the bead comments or PR body. The bead description specifies acceptance criteria that map directly to the integration tests (success path, retry-to-fail, panic recovery against real MySQL). CI is green and covers all three paths via internal/worker/integration_test.go. Accepting CI as the DEMO signal per review instructions; noting the absence of a separate DEMO comment block.

Spec Compliance

  • sqlc queries match spec
  • Pool / Handler / RegisterHandler / Start / Shutdown implemented; Handler type signature matches spec
  • Retry with backoff (3 attempts, 0s/30s/5m); handleFailure at pool.go:310
  • Wall-clock budget (1h from created_at); context.WithDeadline at pool.go:240; pre-dispatch expiry guard at pool.go:233
  • Panic recovery + status transition; defer recover() at pool.go:265-280; panics go terminal to MarkTaskFailed, no retry
  • Prometheus metrics: tasks_claimed_total, tasks_processed_total, task_duration_seconds, tasks_in_flight

Cross-Cutting Concerns

1. Makefile -p 1 change - JUSTIFIED, minor concern remains

  • Real fix, not a regression. Three packages (app, db, worker) each spin up a testcontainers MySQL and run golang-migrate. Concurrent execution with -p 8 caused migrate lock contention. The -p 1 change is documented in the Makefile comment.
  • Minor: -p 1 also serialises pure-Go packages. A follow-up bead to share a single container via BOOKSHELF_DSN would restore -p 8 speed.

2. truncateTables pinned-conn fix - CORRECT

  • app_suite_test.go:88-116 acquires *sql.Conn via pool.Conn(ctx) and runs SET FOREIGN_KEY_CHECKS=0, all TRUNCATE TABLE statements, and SET FOREIGN_KEY_CHECKS=1 on that same pinned connection. The original *sql.DB path could route the session variable and the truncate onto different physical connections. Fix is correct.

3. ClaimNextTask race-safety - CORRECT (reasoning inline)

  • FOR UPDATE SKIP LOCKED present in generated ClaimNextTaskByType (tasks.sql.go:22) and hand-written ClaimNextTask (tasks_extra.go:43).
  • updated_at <= NOW() filter present in both paths; backoff enforced.
  • The SELECT and CAS UPDATE are separate queries without a wrapping transaction. FOR UPDATE SKIP LOCKED only holds the lock for the SELECT duration; by the time CASClaimTask runs the lock is released. Two workers can race, but the false-return path in runOne (pool.go:210-213) drops the task without dispatching it. The CAS is the actual double-dispatch guard. This is correct at-least-once semantics.

4. Retry backoff - CORRECT

  • RequeueForRetry sets updated_at = ? (future timestamp); tasks.sql:47.
  • ClaimNextTask filters AND updated_at <= NOW(); tasks_extra.go:42.
  • retryAt = now + backoff - 1s at pool.go:330; the -1s fudge handles zero-backoff + clock-skew correctly.

5. Wall-clock budget - CORRECT

  • Pre-dispatch expiry check: pool.go:233-237.
  • context.WithDeadline(ctx, row.CreatedAt.Add(TaskBudget)) at pool.go:240.
  • Handler returning context.DeadlineExceeded triggers terminal fail: pool.go:298-303.

6. Panic recovery - CORRECT

  • defer recover() inside anonymous closure at pool.go:265; handlerErr set, didPanic = true.
  • didPanic branch at pool.go:291-295: failTask called (MarkTaskFailed + metric), attempts cleared. No retry.
  • Stack trace logged via debug.Stack() at pool.go:272.

7. Project Conventions

  • Curried function pattern: all deps injected as function types in Config, bound once in NewPool.
  • No Fn suffix on any dep field or exported name.
  • var-at-top consistently applied in pool.go and tasks_extra.go.
  • Partial deviation: success path Describe does setup + pool start entirely in BeforeEach with an empty JustBeforeEach no-op (pool_test.go:232-235). Pragmatic adaptation for async goroutine-driven code; not blocking.

8. Test depth - VERIFIED

  • Panic recovery: unit pool_test.go:357 + integration integration_test.go:170
  • Ctx cancellation mid-handler: Shutdown drain at pool_test.go:772
  • Retry exhaustion to failed: unit pool_test.go:307 + integration integration_test.go:159
  • Wall-clock budget expiry (pre-dispatch): pool_test.go:440
  • Wall-clock budget expiry (handler returns DeadlineExceeded): pool_test.go:479
  • CAS race (another worker wins): pool_test.go:554
  • Idle backoff: pool_test.go:597
  • DB error resilience: pool_test.go:987-1099

Findings

Must-fix: none.

Important: none.

Minor (do not block):

  1. pool_test.go:178 - Expect(true).To(BeTrue()) is a liveness probe with no explanatory comment. A comment like // reached without deadlock or panic would clarify intent.
  2. Makefile -p 1 drops the -p 8 parallelism from bookshelf-rna. A follow-up bead to share a single DB container across suites via BOOKSHELF_DSN would restore CI speed.
  3. Several Describe blocks in pool_test.go rebuild the full worker.Config inline rather than using newPool(). Necessary because claim/CAS behaviour varies per test, but the repetition is noticeable.

Verdict

Ready to merge - all spec requirements met, concurrency and race-safety are correct, test coverage is comprehensive, project conventions are followed. Minor nits only.

## Code Review ### Phase 0 - DEMO Verification No standalone DEMO block was provided in the bead comments or PR body. The bead description specifies acceptance criteria that map directly to the integration tests (success path, retry-to-fail, panic recovery against real MySQL). CI is green and covers all three paths via `internal/worker/integration_test.go`. Accepting CI as the DEMO signal per review instructions; noting the absence of a separate DEMO comment block. ### Spec Compliance - [x] sqlc queries match spec - [x] Pool / Handler / RegisterHandler / Start / Shutdown implemented; Handler type signature matches spec - [x] Retry with backoff (3 attempts, 0s/30s/5m); `handleFailure` at `pool.go:310` - [x] Wall-clock budget (1h from `created_at`); `context.WithDeadline` at `pool.go:240`; pre-dispatch expiry guard at `pool.go:233` - [x] Panic recovery + status transition; `defer recover()` at `pool.go:265-280`; panics go terminal to `MarkTaskFailed`, no retry - [x] Prometheus metrics: `tasks_claimed_total`, `tasks_processed_total`, `task_duration_seconds`, `tasks_in_flight` ### Cross-Cutting Concerns **1. Makefile -p 1 change - JUSTIFIED, minor concern remains** - [x] Real fix, not a regression. Three packages (app, db, worker) each spin up a testcontainers MySQL and run golang-migrate. Concurrent execution with -p 8 caused migrate lock contention. The -p 1 change is documented in the Makefile comment. - Minor: -p 1 also serialises pure-Go packages. A follow-up bead to share a single container via BOOKSHELF_DSN would restore -p 8 speed. **2. truncateTables pinned-conn fix - CORRECT** - [x] `app_suite_test.go:88-116` acquires `*sql.Conn` via `pool.Conn(ctx)` and runs `SET FOREIGN_KEY_CHECKS=0`, all `TRUNCATE TABLE` statements, and `SET FOREIGN_KEY_CHECKS=1` on that same pinned connection. The original `*sql.DB` path could route the session variable and the truncate onto different physical connections. Fix is correct. **3. ClaimNextTask race-safety - CORRECT (reasoning inline)** - [x] `FOR UPDATE SKIP LOCKED` present in generated `ClaimNextTaskByType` (`tasks.sql.go:22`) and hand-written `ClaimNextTask` (`tasks_extra.go:43`). - [x] `updated_at <= NOW()` filter present in both paths; backoff enforced. - The SELECT and CAS UPDATE are separate queries without a wrapping transaction. `FOR UPDATE SKIP LOCKED` only holds the lock for the SELECT duration; by the time `CASClaimTask` runs the lock is released. Two workers can race, but the false-return path in `runOne` (`pool.go:210-213`) drops the task without dispatching it. The CAS is the actual double-dispatch guard. This is correct at-least-once semantics. **4. Retry backoff - CORRECT** - [x] `RequeueForRetry` sets `updated_at = ?` (future timestamp); `tasks.sql:47`. - [x] `ClaimNextTask` filters `AND updated_at <= NOW()`; `tasks_extra.go:42`. - [x] `retryAt = now + backoff - 1s` at `pool.go:330`; the -1s fudge handles zero-backoff + clock-skew correctly. **5. Wall-clock budget - CORRECT** - [x] Pre-dispatch expiry check: `pool.go:233-237`. - [x] `context.WithDeadline(ctx, row.CreatedAt.Add(TaskBudget))` at `pool.go:240`. - [x] Handler returning `context.DeadlineExceeded` triggers terminal fail: `pool.go:298-303`. **6. Panic recovery - CORRECT** - [x] `defer recover()` inside anonymous closure at `pool.go:265`; `handlerErr` set, `didPanic = true`. - [x] `didPanic` branch at `pool.go:291-295`: `failTask` called (MarkTaskFailed + metric), attempts cleared. No retry. - [x] Stack trace logged via `debug.Stack()` at `pool.go:272`. **7. Project Conventions** - [x] Curried function pattern: all deps injected as function types in `Config`, bound once in `NewPool`. - [x] No `Fn` suffix on any dep field or exported name. - [x] `var`-at-top consistently applied in `pool.go` and `tasks_extra.go`. - Partial deviation: `success path` Describe does setup + pool start entirely in `BeforeEach` with an empty `JustBeforeEach` no-op (`pool_test.go:232-235`). Pragmatic adaptation for async goroutine-driven code; not blocking. **8. Test depth - VERIFIED** - [x] Panic recovery: unit `pool_test.go:357` + integration `integration_test.go:170` - [x] Ctx cancellation mid-handler: `Shutdown drain` at `pool_test.go:772` - [x] Retry exhaustion to failed: unit `pool_test.go:307` + integration `integration_test.go:159` - [x] Wall-clock budget expiry (pre-dispatch): `pool_test.go:440` - [x] Wall-clock budget expiry (handler returns DeadlineExceeded): `pool_test.go:479` - [x] CAS race (another worker wins): `pool_test.go:554` - [x] Idle backoff: `pool_test.go:597` - [x] DB error resilience: `pool_test.go:987-1099` ### Findings **Must-fix:** none. **Important:** none. **Minor (do not block):** 1. `pool_test.go:178` - `Expect(true).To(BeTrue())` is a liveness probe with no explanatory comment. A comment like `// reached without deadlock or panic` would clarify intent. 2. `Makefile` `-p 1` drops the `-p 8` parallelism from bookshelf-rna. A follow-up bead to share a single DB container across suites via `BOOKSHELF_DSN` would restore CI speed. 3. Several `Describe` blocks in `pool_test.go` rebuild the full `worker.Config` inline rather than using `newPool()`. Necessary because claim/CAS behaviour varies per test, but the repetition is noticeable. ### Verdict Ready to merge - all spec requirements met, concurrency and race-safety are correct, test coverage is comprehensive, project conventions are followed. Minor nits only.
zombor force-pushed bd-bookshelf-w5y.1 from c1217295bc
All checks were successful
/ Lint (pull_request) Successful in 4m24s
/ Coverage (pull_request) Successful in 5m45s
/ E2E (pull_request) Successful in 7m30s
/ Test (pull_request) Successful in 8m18s
to f9475bc969
Some checks failed
/ Lint (pull_request) Successful in 6m31s
/ Coverage (pull_request) Failing after 6m29s
/ Test (pull_request) Successful in 9m55s
/ E2E (pull_request) Successful in 8m0s
2026-05-21 11:04:31 +00:00
Compare
zombor force-pushed bd-bookshelf-w5y.1 from f9475bc969
Some checks failed
/ Lint (pull_request) Successful in 6m31s
/ Coverage (pull_request) Failing after 6m29s
/ Test (pull_request) Successful in 9m55s
/ E2E (pull_request) Successful in 8m0s
to 7913e7f157
All checks were successful
/ Test (pull_request) Successful in 1m19s
/ Lint (pull_request) Successful in 3m34s
/ Coverage (pull_request) Successful in 3m41s
/ Integration (pull_request) Successful in 3m42s
/ E2E (pull_request) Successful in 7m16s
2026-05-24 01:36:41 +00:00
Compare
zombor merged commit 93195c6711 into main 2026-05-24 01:46:47 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
zombor/pergamum!22
No description provided.