fix(dbtest): eliminate thundering-herd migrate hang (bookshelf-7akw) #587

Merged
zombor merged 4 commits from bd-bookshelf-7akw into main 2026-06-17 04:02:41 +00:00
Owner

Root cause

8 parallel ginkgo nodes each calling migrate.Up() concurrently on one MySQL instance with --tmpfs /var/lib/mysql size=1g overwhelmed the server's DDL capacity. ExecContext calls stalled, blocked on readResultSetHeaderPacket, silently hanging for the full 10-minute Go test timeout. Affected both #579 Integration and #583 E2E API identically.

Fix

Two parts applied together:

1. Fail-fast (resilience): ensureTemplateDB runs RunMigrations under a 120s context.WithTimeout. A stalled migration now errors in ≤120s instead of hanging until the suite-level 10m timeout.

2. De-thundering-herd (the actual fix): Instead of every node running full migrations concurrently:

  • Migrate ONCE into a shared template DB (pergamum_test_template) guarded by a MySQL GET_LOCK() advisory lock (cross-process, 120s timeout).
  • Each node then clones the schema via SHOW CREATE TABLE + DDL replay with FOREIGN_KEY_CHECKS=0 on a single dedicated connection (no FK ordering issues).
  • schema_migrations rows are also cloned so golang-migrate treats each clone as fully migrated (ErrNoChange on next Up() call).

The advisory lock ensures only one goroutine/process ever migrates; all concurrent callers block, then clone. Reduces N concurrent full migrations to 1 migration + N fast DDL clones.

Local results

  • make integration: all packages pass (768 + 227 + 3 specs)
  • make e2e-api: 1125/1125 pass in 2m37s (11 procs)
  • make e2e-browser: 250/250 pass in 2m0s (11 procs)

No changes to e2e suite files, BeforeSuite callers, or any production code.

Closes bead bookshelf-7akw on merge.

## Root cause 8 parallel ginkgo nodes each calling `migrate.Up()` concurrently on one MySQL instance with `--tmpfs /var/lib/mysql size=1g` overwhelmed the server's DDL capacity. `ExecContext` calls stalled, blocked on `readResultSetHeaderPacket`, silently hanging for the full 10-minute Go test timeout. Affected both #579 Integration and #583 E2E API identically. ## Fix Two parts applied together: **1. Fail-fast (resilience):** `ensureTemplateDB` runs `RunMigrations` under a 120s `context.WithTimeout`. A stalled migration now errors in ≤120s instead of hanging until the suite-level 10m timeout. **2. De-thundering-herd (the actual fix):** Instead of every node running full migrations concurrently: - Migrate ONCE into a shared template DB (`pergamum_test_template`) guarded by a MySQL `GET_LOCK()` advisory lock (cross-process, 120s timeout). - Each node then clones the schema via `SHOW CREATE TABLE` + DDL replay with `FOREIGN_KEY_CHECKS=0` on a single dedicated connection (no FK ordering issues). - `schema_migrations` rows are also cloned so `golang-migrate` treats each clone as fully migrated (`ErrNoChange` on next `Up()` call). The advisory lock ensures only one goroutine/process ever migrates; all concurrent callers block, then clone. Reduces N concurrent full migrations to 1 migration + N fast DDL clones. ## Local results - `make integration`: all packages pass (768 + 227 + 3 specs) - `make e2e-api`: 1125/1125 pass in 2m37s (11 procs) - `make e2e-browser`: 250/250 pass in 2m0s (11 procs) No changes to e2e suite files, `BeforeSuite` callers, or any production code. Closes bead bookshelf-7akw on merge.
fix(dbtest): eliminate thundering-herd migrate hang via template cloning (bookshelf-7akw)
Some checks failed
/ Lint (pull_request) Successful in 2m0s
/ Integration (pull_request) Failing after 3m7s
/ Test (pull_request) Successful in 3m16s
/ JS Unit Tests (pull_request) Successful in 3m32s
/ E2E API (pull_request) Failing after 9m11s
/ E2E Browser (pull_request) Failing after 10m40s
ad1e6025a0
Root cause: 8 parallel ginkgo nodes each calling migrate.Up() concurrently
on a single MySQL instance with --tmpfs stalled migration ExecContext calls,
hanging the suite for the full 10-minute Go test timeout.

Fix (two parts):
1. Fail-fast: RunMigrations now runs under a 120s context deadline inside
   NewSuiteDB so a stalled migration errors quickly instead of hanging.
2. De-thundering-herd: migrate ONCE into a shared template DB
   (pergamum_test_template) guarded by a MySQL GET_LOCK() advisory lock
   (cross-process, 120s timeout), then clone schema per-node via
   SHOW CREATE TABLE + DDL replay with FOREIGN_KEY_CHECKS disabled.
   schema_migrations rows are also cloned so golang-migrate treats each
   clone as fully migrated (ErrNoChange on next Up() call).

The advisory lock ensures only one goroutine/process ever runs the full
migration; all concurrent callers wait and then clone the ready template.
Reduces N concurrent full migrations to 1 migration + N cheap DDL clones.

Local results:
- make integration: all packages pass
- make e2e-api: 1125/1125 pass in 2m37s
- make e2e-browser: 250/250 pass in 2m0s

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix(dbtest): close template-creation race — check schema_migrations not just dbExists
Some checks failed
/ Lint (pull_request) Successful in 58s
/ Test (pull_request) Successful in 1m45s
/ Integration (pull_request) Has been cancelled
/ E2E Browser (pull_request) Has been cancelled
/ JS Unit Tests (pull_request) Has been cancelled
/ E2E API (pull_request) Has been cancelled
101ab137c7
The fast path in ensureTemplateDB previously returned nil as soon as the
template DB existed in information_schema. This raced with concurrent
callers: if process B checked after process A had created the DB (CREATE
DATABASE) but before A finished migrating, B would return nil and proceed
to clone an empty template — producing test DBs with no tables or with a
dirty migration row.

Fixes:
- Replace dbExists check with templateIsReady, which queries
  schema_migrations WHERE dirty=0: returns false if the table is absent
  (DB not yet migrated or mid-creation) or all rows are dirty (failed
  migration).
- After acquiring GET_LOCK, use the same templateIsReady re-check so
  a lock-holder that finds a partial DB drops it and re-migrates instead
  of returning on a stale exists=true.
- Add explicit DROP DATABASE IF EXISTS before creating the template under
  the lock, so stale/partial DBs from prior failed attempts are cleaned up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Author
Owner

Security Review — PR #587 (bd-bookshelf-7akw)

Scope: internal/dbtest/dbtest.go +258/-14 — migrate-once-into-template-DB + clone-per-parallel-node.


1. SQL injection via DB/table names

templateDBName and templateLockName are compile-time string constants, not derived from any runtime input. All interpolated identifiers fall into one of these categories:

Value Source Safe?
templateDBName = "pergamum_test_template" package-level const yes — fixed at compile time
templateLockName = "pergamum_dbtest_template" package-level const yes — fixed at compile time
destDB / dbName fmt.Sprintf("%s_%s", name, randomHex()) see below
t (table name) information_schema.tables result from our own template DB effectively controlled

name parameter: The name argument to NewSuiteDB is caller-supplied. Looking at the diff, name was already interpolated into the CREATE DATABASE DDL in the pre-existing code (unchanged), and the new diff adds no new interpolation of name beyond what already existed — the destDB passed to cloneSchemaFromTemplate is the already-constructed dbName (name + "_" + randomHex()), which is then backtick-quoted in DDL. All callers that can be seen in the codebase pass a short fixed package-name string ("books", "library", etc.), so there is no practical injection risk; still, this is pre-existing and outside the new diff's scope.

Table names from information_schema: In cloneSchemaFromTemplate, t comes from SELECT table_name FROM information_schema.tables WHERE table_schema = ?. The table_schema is the constant templateDBName, so MySQL only returns tables from our own template DB. A MySQL table name cannot itself contain a backtick because MySQL's identifier rules forbid it — so backtick-quoting t in SHOW CREATE TABLE \%s`.`%s`` is safe.

GET_LOCK call: templateLockName is a constant so there is no injection vector there either.

Verdict: no injection issues in the new code.


2. Secrets / credentials in logs or errors

The destDB name and dbName appear in error strings (fmt.Errorf("open dest pool: %w", err) etc.), but these are just database names — no passwords. The DSN itself is only stored in suiteDSN and passed to db.Open; it is not included in any error message visible in the diff. The existing discardSlogLogger() is wired to io.Discard so nothing from the template-creation path is logged at all.

No credentials leaked in new code.


3. Blast radius — can this reach a production DB?

  • File starts with //go:build integration (inherited from the package — confirmed by the package-level comment and the existing pre-PR file).
  • StartSharedMySQL only fires when PERGAMUM_DSN is set (CI service container) or when testcontainers spins a MySQL container. Neither path is reachable from the production binary (cmd/pergamum).
  • The DROP DATABASE IF EXISTS calls drop only DBs named "pergamum_test_template", dbName (suffix _<hex>), or templateDBName — all prefixed pergamum_test_*. The production schema lives in a DB named by the operator-supplied DSN, never anything matching those prefixes.
  • There is no way to import dbtest from a non-integration-tagged build, so these DROP calls cannot touch production.

Production blast radius: none.


4. Advisory lock name attacker-controllability

templateLockName is a constant ("pergamum_dbtest_template"). It is not derived from any input, runtime state, or environment variable. Length is 24 bytes, well under MySQL's 64-byte limit.

No controllability concern.


5. Minor observations (no severity)

  • The nolint:nilerr on the schema_migrations miss-path silently swallows the "table doesn't exist" error and returns nil, which means a partially-cloned DB (no migration rows) will be handed to the suite. Golang-migrate will then try to re-run all migrations and will likely succeed, so it degrades gracefully rather than failing silently. No security implication; noted for awareness.
  • defer lockConn.Close() and defer templateDB.Close() both use //nolint:errcheck — reasonable for test infrastructure.

Summary

No security issues found in the new code. All SQL identifiers in the new paths are either compile-time constants or table names from a fully controlled source. No credentials are introduced or leaked. The code is guarded by //go:build integration and cannot reach production.

REVIEW VERDICT: 0 blocker, 0 major, 0 minor
## Security Review — PR #587 (`bd-bookshelf-7akw`) Scope: `internal/dbtest/dbtest.go` +258/-14 — migrate-once-into-template-DB + clone-per-parallel-node. --- ### 1. SQL injection via DB/table names **`templateDBName` and `templateLockName` are compile-time string constants**, not derived from any runtime input. All interpolated identifiers fall into one of these categories: | Value | Source | Safe? | |---|---|---| | `templateDBName` = `"pergamum_test_template"` | package-level `const` | yes — fixed at compile time | | `templateLockName` = `"pergamum_dbtest_template"` | package-level `const` | yes — fixed at compile time | | `destDB` / `dbName` | `fmt.Sprintf("%s_%s", name, randomHex())` | see below | | `t` (table name) | `information_schema.tables` result from our own template DB | effectively controlled | **`name` parameter:** The `name` argument to `NewSuiteDB` is caller-supplied. Looking at the diff, `name` was already interpolated into the `CREATE DATABASE` DDL in the pre-existing code (unchanged), and the new diff adds no new interpolation of `name` beyond what already existed — the `destDB` passed to `cloneSchemaFromTemplate` is the already-constructed `dbName` (`name + "_" + randomHex()`), which is then backtick-quoted in DDL. All callers that can be seen in the codebase pass a short fixed package-name string (`"books"`, `"library"`, etc.), so there is no practical injection risk; still, this is pre-existing and outside the new diff's scope. **Table names from `information_schema`:** In `cloneSchemaFromTemplate`, `t` comes from `SELECT table_name FROM information_schema.tables WHERE table_schema = ?`. The `table_schema` is the constant `templateDBName`, so MySQL only returns tables from our own template DB. A MySQL table name cannot itself contain a backtick because MySQL's identifier rules forbid it — so backtick-quoting `t` in `SHOW CREATE TABLE \`%s\`.\`%s\`` is safe. **GET_LOCK call:** `templateLockName` is a constant so there is no injection vector there either. **Verdict: no injection issues in the new code.** --- ### 2. Secrets / credentials in logs or errors The `destDB` name and `dbName` appear in error strings (`fmt.Errorf("open dest pool: %w", err)` etc.), but these are just database names — no passwords. The DSN itself is only stored in `suiteDSN` and passed to `db.Open`; it is not included in any error message visible in the diff. The existing `discardSlogLogger()` is wired to `io.Discard` so nothing from the template-creation path is logged at all. **No credentials leaked in new code.** --- ### 3. Blast radius — can this reach a production DB? - File starts with `//go:build integration` (inherited from the package — confirmed by the package-level comment and the existing pre-PR file). - `StartSharedMySQL` only fires when `PERGAMUM_DSN` is set (CI service container) or when testcontainers spins a MySQL container. Neither path is reachable from the production binary (`cmd/pergamum`). - The `DROP DATABASE IF EXISTS` calls drop only DBs named `"pergamum_test_template"`, `dbName` (suffix `_<hex>`), or `templateDBName` — all prefixed `pergamum_test_*`. The production schema lives in a DB named by the operator-supplied DSN, never anything matching those prefixes. - There is no way to import `dbtest` from a non-`integration`-tagged build, so these `DROP` calls cannot touch production. **Production blast radius: none.** --- ### 4. Advisory lock name attacker-controllability `templateLockName` is a constant (`"pergamum_dbtest_template"`). It is not derived from any input, runtime state, or environment variable. Length is 24 bytes, well under MySQL's 64-byte limit. **No controllability concern.** --- ### 5. Minor observations (no severity) - The `nolint:nilerr` on the `schema_migrations` miss-path silently swallows the "table doesn't exist" error and returns `nil`, which means a partially-cloned DB (no migration rows) will be handed to the suite. Golang-migrate will then try to re-run all migrations and will likely succeed, so it degrades gracefully rather than failing silently. No security implication; noted for awareness. - `defer lockConn.Close()` and `defer templateDB.Close()` both use `//nolint:errcheck` — reasonable for test infrastructure. --- ### Summary No security issues found in the new code. All SQL identifiers in the new paths are either compile-time constants or table names from a fully controlled source. No credentials are introduced or leaked. The code is guarded by `//go:build integration` and cannot reach production. ``` REVIEW VERDICT: 0 blocker, 0 major, 0 minor ```
Author
Owner

[MINOR] internal/dbtest/dbtest.go:299-302 — schema_migrations query error silently returns nil

In cloneSchemaFromTemplate, when rootDB.QueryContext fails for the schema_migrations copy step, the function returns nil (success) with a comment "skip — migrations will re-apply." But if the dest DB was cloned (all tables exist), golang-migrate would attempt to re-run migrations and immediately fail with "table already exists." In practice this code path is not reachable after a successful template migration (schema_migrations always exists), and the //nolint:nilerr documents the intent. Given that the template was just confirmed ready by templateIsReady — which itself queries schema_migrations — a failure here would be a connectivity catastrophe already surfaced elsewhere. Acceptable for test-infra, but the comment says "migrations will re-apply" when in fact they would error, not re-apply cleanly. Update the comment to accurately describe the failure mode, or return the error.

[MINOR] internal/dbtest/dbtest.go:168-169 — GET_LOCK timeout and ctx deadline can conflict

GET_LOCK is called with migrationTimeout (120 s) as the server-side wait, but the call is made via lockConn.QueryRowContext(ctx, ...) where ctx is the caller's test-suite context. If the caller's ctx has a shorter deadline (common for Go test contexts), QueryRowContext cancels before the server-side lock wait fires, and the error surfaces as a context-deadline error rather than the clear "GET_LOCK timed out" message. The lock holder's migration is bounded by migrCtx (migrationTimeout), so the GET_LOCK server timeout and migrCtx together give the correct worst-case bound — but a test that passes a short-deadline ctx will see a confusing cancellation error before the 120 s bound triggers. Low risk in practice (ginkgo suites use context.Background() or a very long deadline), but the interaction is worth documenting.

[MINOR] internal/dbtest/dbtest.go:325-337 — templateIsReady swallows all errors, not just "table not found"

templateIsReady returns false, nil on ANY scan error, including real connectivity failures. The //nolint:nilerr acknowledges this, and the consequence — falling through to GET_LOCK, which then fails with a cleaner error — is acceptable. No code change needed, but the comment should say "any query error (table absent, DB absent, connectivity) is treated as not-ready" instead of the narrower "Table doesn't exist or DB doesn't exist" to be accurate.


Concurrency correctness — verified clean:

  1. Lock acquire/release path: defer lockConn.Close() is registered before defer RELEASE_LOCK(...). Go defers run LIFO, so RELEASE_LOCK fires first (explicitly), then the connection closes. Correct on all paths including panics.

  2. Double-checked locking: The post-lock re-check uses rootDB (a pool), not lockConn. This is safe: the check runs while holding the advisory lock, so no other process can be mid-migration. Any connection from the pool observes the committed state.

  3. Fail-fast bound: migrCtx is context.WithTimeout(ctx, migrationTimeout) wrapping the caller's ctx. db.RunMigrations uses this context for each DDL ExecContext, so a stalled migration cancels within 120 s. The GET_LOCK call also has the 120 s server-side wait. Together these replace the 10-minute suite-kill with a 120 s ceiling. Correct.

  4. FK handling in clone: A dedicated destConn holds SET FOREIGN_KEY_CHECKS=0 for the entire DDL replay loop. LIFO defer order: re-enable FK fires via the last explicit Exec, then destConn.Close(), then destPool.Close(). On error paths the code explicitly re-enables FK before returning. Correct.

  5. templateIsReady check is sufficient: It queries COUNT(*) FROM schema_migrations WHERE dirty=0. Since ensureTemplateDB drops and recreates the template on any partially-built state, and only returns success after RunMigrations completes without error (which sets dirty=0), the check correctly distinguishes empty/in-progress from ready. dirty=1 rows cannot exist after a successful migration.

  6. Per-node isolation: Each node gets a unique <name>_<hex> DB, cloned independently from the (read-only after creation) template. No cross-node state sharing.

  7. Orphan DB cleanup: Template DB is intentionally persistent (mirrors the WithReuse container model). Per-node DBs are dropped in cleanup funcs. No orphan concern.

REVIEW VERDICT: 0 blocker, 0 major, 3 minor

[MINOR] internal/dbtest/dbtest.go:299-302 — `schema_migrations` query error silently returns nil In `cloneSchemaFromTemplate`, when `rootDB.QueryContext` fails for the `schema_migrations` copy step, the function returns `nil` (success) with a comment "skip — migrations will re-apply." But if the dest DB was cloned (all tables exist), golang-migrate would attempt to re-run migrations and immediately fail with "table already exists." In practice this code path is not reachable after a successful template migration (schema_migrations always exists), and the `//nolint:nilerr` documents the intent. Given that the template was just confirmed ready by `templateIsReady` — which itself queries `schema_migrations` — a failure here would be a connectivity catastrophe already surfaced elsewhere. Acceptable for test-infra, but the comment says "migrations will re-apply" when in fact they would error, not re-apply cleanly. Update the comment to accurately describe the failure mode, or return the error. [MINOR] internal/dbtest/dbtest.go:168-169 — `GET_LOCK` timeout and `ctx` deadline can conflict `GET_LOCK` is called with `migrationTimeout` (120 s) as the server-side wait, but the call is made via `lockConn.QueryRowContext(ctx, ...)` where `ctx` is the caller's test-suite context. If the caller's `ctx` has a shorter deadline (common for Go test contexts), QueryRowContext cancels before the server-side lock wait fires, and the error surfaces as a context-deadline error rather than the clear "GET_LOCK timed out" message. The lock holder's migration is bounded by `migrCtx` (migrationTimeout), so the GET_LOCK server timeout and migrCtx together give the correct worst-case bound — but a test that passes a short-deadline ctx will see a confusing cancellation error before the 120 s bound triggers. Low risk in practice (ginkgo suites use `context.Background()` or a very long deadline), but the interaction is worth documenting. [MINOR] internal/dbtest/dbtest.go:325-337 — `templateIsReady` swallows all errors, not just "table not found" `templateIsReady` returns `false, nil` on ANY scan error, including real connectivity failures. The `//nolint:nilerr` acknowledges this, and the consequence — falling through to GET_LOCK, which then fails with a cleaner error — is acceptable. No code change needed, but the comment should say "any query error (table absent, DB absent, connectivity) is treated as not-ready" instead of the narrower "Table doesn't exist or DB doesn't exist" to be accurate. --- **Concurrency correctness — verified clean:** 1. **Lock acquire/release path**: `defer lockConn.Close()` is registered before `defer RELEASE_LOCK(...)`. Go defers run LIFO, so RELEASE_LOCK fires first (explicitly), then the connection closes. Correct on all paths including panics. 2. **Double-checked locking**: The post-lock re-check uses `rootDB` (a pool), not `lockConn`. This is safe: the check runs while holding the advisory lock, so no other process can be mid-migration. Any connection from the pool observes the committed state. 3. **Fail-fast bound**: `migrCtx` is `context.WithTimeout(ctx, migrationTimeout)` wrapping the caller's ctx. `db.RunMigrations` uses this context for each DDL `ExecContext`, so a stalled migration cancels within 120 s. The GET_LOCK call also has the 120 s server-side wait. Together these replace the 10-minute suite-kill with a 120 s ceiling. Correct. 4. **FK handling in clone**: A dedicated `destConn` holds `SET FOREIGN_KEY_CHECKS=0` for the entire DDL replay loop. LIFO defer order: re-enable FK fires via the last explicit Exec, then `destConn.Close()`, then `destPool.Close()`. On error paths the code explicitly re-enables FK before returning. Correct. 5. **`templateIsReady` check is sufficient**: It queries `COUNT(*) FROM schema_migrations WHERE dirty=0`. Since `ensureTemplateDB` drops and recreates the template on any partially-built state, and only returns success after `RunMigrations` completes without error (which sets dirty=0), the check correctly distinguishes empty/in-progress from ready. `dirty=1` rows cannot exist after a successful migration. 6. **Per-node isolation**: Each node gets a unique `<name>_<hex>` DB, cloned independently from the (read-only after creation) template. No cross-node state sharing. 7. **Orphan DB cleanup**: Template DB is intentionally persistent (mirrors the WithReuse container model). Per-node DBs are dropped in cleanup funcs. No orphan concern. REVIEW VERDICT: 0 blocker, 0 major, 3 minor
fix(observability): eliminate TOCTOU port-race in metrics server test
Some checks failed
/ JS Unit Tests (pull_request) Successful in 2m54s
/ Lint (pull_request) Successful in 4m18s
/ Test (pull_request) Failing after 5m23s
/ Integration (pull_request) Successful in 5m34s
/ E2E Browser (pull_request) Successful in 6m0s
/ E2E API (pull_request) Failing after 10m32s
6ecf9a0a85
The test pattern listen→close→pass-addr-to-ServeMetrics was subject to
another goroutine grabbing the ephemeral port between Close() and
ServeMetrics's internal net.Listen(), causing intermittent
"connection refused" under -p 8 -race.

Split ServeMetrics into a public addr-based entry point that calls an
unexported serveMetricsOnListener, exposed for tests via export_test.go.
The test now passes the pre-bound net.Listener directly, eliminating
the race entirely.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Author
Owner

CODE REVIEW (delta only: commit 6ecf9a0a "fix(observability): eliminate TOCTOU port-race in metrics server test")

Re-review of the observability delta added after the previously-clean 101ab137 commit. Three files reviewed: internal/observability/metrics_server.go, metrics_server_test.go, export_test.go.


Phase 0 — DEMO

No DEMO block is present in this bead; the fix targets a test flake (CI behaviour), not a user-facing surface. Accepted: the fix is verified by re-reading the test logic below.


Phase 1 — Correctness

ServeMetrics behaviour preserved.
The public signature is unchanged. The function now does net.Listen then delegates — identical observable semantics. fmt is still imported (metrics_server.go:5). Error wrapping format "metrics listen %s: %w" is identical to the old code. Pass.

Listener ownership and close — no leak, no double-close.
serveMetricsOnListener calls srv.Serve(ln) at line 59. Go's http.Server.Serve closes the listener when it returns (documented stdlib behaviour), so the listener is always released whether the server exits cleanly or with an error. There is no ln.Close() call elsewhere in the function, so there is no double-close risk. The shutdown goroutine calls srv.Shutdown(...), which does not close the listener directly — it signals Serve to exit, and Serve closes the listener on its way out. Pass.

Context cancellation / clean-shutdown.
The goroutine at metrics_server.go:52–56 calls srv.Shutdown on ctx.Done(). srv.Serve then returns http.ErrServerClosed, which is filtered at line 59, so the function returns nil. Behaviour is unchanged from before the refactor. Pass.

Routes/handlers unchanged.
serveMetricsOnListener wires mux.Handle("GET /metrics", MetricsHandler(reg)) — identical to the old ServeMetrics. No routes added or removed. srv.Addr is now set to ln.Addr().String() instead of the input addr string. For production use these are identical (the listener is bound to addr); the only visible difference is that :0 in tests resolves to the real assigned address in logs/srv.Addr. Pass.

The TOCTOU race is fixed.
Old code: net.Listen("127.0.0.1:0")ln.Close()ServeMetrics(ctx, addr, ...) → second net.Listen(addr). The window between close and re-listen was the race.

New code: net.Listen("127.0.0.1:0") → read addr from ln.Addr() → pass ln directly to ServeMetricsOnListener. The same file descriptor is handed to the server. There is no close-and-reopen window. Race is eliminated. Pass.

ServeMetricsOnListener export is test-only.
export_test.go is named *_test.go, so the Go toolchain compiles it only during go test — it does not appear in the production binary. The package observability (not _test) declaration is the standard Go "export for white-box tests" idiom. ServeMetricsOnListener does not widen the production surface. Pass.

Listen-error path still exercises ServeMetrics.
The "ServeMetrics listen error" describe block at metrics_server_test.go:95 calls observability.ServeMetrics(...) directly, not the internal helper. The public error path remains under test. Pass.


Phase 2 — Code Quality

Var-at-top convention.
ServeMetrics now uses := (4-line function). project-conventions.md explicitly excepts "short-lived loop indices, throw-away buffers, and other genuinely scope-local vars." A 4-line delegation wrapper qualifies; ln and err are used exactly once each. serveMetricsOnListener retains the var (...) block at the top for its multi-use vars. Convention followed appropriately.

Multi-Expect It blocks.
The "serves Prometheus metrics content" and "includes go_goroutines metric" Its each contain two/three Expect calls. These are pre-existing at commit 101ab137 — this delta does not introduce them. Not a finding for this review.

No security concerns.
The server still binds to the configured addr in production; no new port or interface is exposed. Error messages contain only the address string and a wrapped stdlib error — no secrets or sensitive data. The test-only export path is inert in production.


Findings

No findings.


REVIEW VERDICT: 0 blocker, 0 major, 0 minor

**CODE REVIEW (delta only: commit 6ecf9a0a "fix(observability): eliminate TOCTOU port-race in metrics server test")** Re-review of the observability delta added after the previously-clean 101ab137 commit. Three files reviewed: `internal/observability/metrics_server.go`, `metrics_server_test.go`, `export_test.go`. --- ## Phase 0 — DEMO No DEMO block is present in this bead; the fix targets a test flake (CI behaviour), not a user-facing surface. Accepted: the fix is verified by re-reading the test logic below. --- ## Phase 1 — Correctness **ServeMetrics behaviour preserved.** The public signature is unchanged. The function now does `net.Listen` then delegates — identical observable semantics. `fmt` is still imported (`metrics_server.go:5`). Error wrapping format `"metrics listen %s: %w"` is identical to the old code. Pass. **Listener ownership and close — no leak, no double-close.** `serveMetricsOnListener` calls `srv.Serve(ln)` at line 59. Go's `http.Server.Serve` closes the listener when it returns (documented stdlib behaviour), so the listener is always released whether the server exits cleanly or with an error. There is no `ln.Close()` call elsewhere in the function, so there is no double-close risk. The shutdown goroutine calls `srv.Shutdown(...)`, which does *not* close the listener directly — it signals Serve to exit, and Serve closes the listener on its way out. Pass. **Context cancellation / clean-shutdown.** The goroutine at `metrics_server.go:52–56` calls `srv.Shutdown` on `ctx.Done()`. `srv.Serve` then returns `http.ErrServerClosed`, which is filtered at line 59, so the function returns `nil`. Behaviour is unchanged from before the refactor. Pass. **Routes/handlers unchanged.** `serveMetricsOnListener` wires `mux.Handle("GET /metrics", MetricsHandler(reg))` — identical to the old `ServeMetrics`. No routes added or removed. `srv.Addr` is now set to `ln.Addr().String()` instead of the input `addr` string. For production use these are identical (the listener is bound to `addr`); the only visible difference is that `:0` in tests resolves to the real assigned address in logs/`srv.Addr`. Pass. **The TOCTOU race is fixed.** Old code: `net.Listen("127.0.0.1:0")` → `ln.Close()` → `ServeMetrics(ctx, addr, ...)` → second `net.Listen(addr)`. The window between close and re-listen was the race. New code: `net.Listen("127.0.0.1:0")` → read `addr` from `ln.Addr()` → pass `ln` directly to `ServeMetricsOnListener`. The same file descriptor is handed to the server. There is no close-and-reopen window. Race is eliminated. Pass. **`ServeMetricsOnListener` export is test-only.** `export_test.go` is named `*_test.go`, so the Go toolchain compiles it only during `go test` — it does not appear in the production binary. The `package observability` (not `_test`) declaration is the standard Go "export for white-box tests" idiom. `ServeMetricsOnListener` does not widen the production surface. Pass. **Listen-error path still exercises `ServeMetrics`.** The `"ServeMetrics listen error"` describe block at `metrics_server_test.go:95` calls `observability.ServeMetrics(...)` directly, not the internal helper. The public error path remains under test. Pass. --- ## Phase 2 — Code Quality **Var-at-top convention.** `ServeMetrics` now uses `:=` (4-line function). `project-conventions.md` explicitly excepts "short-lived loop indices, throw-away buffers, and other genuinely scope-local vars." A 4-line delegation wrapper qualifies; `ln` and `err` are used exactly once each. `serveMetricsOnListener` retains the `var (...)` block at the top for its multi-use vars. Convention followed appropriately. **Multi-Expect `It` blocks.** The `"serves Prometheus metrics content"` and `"includes go_goroutines metric"` Its each contain two/three `Expect` calls. These are pre-existing at commit 101ab137 — this delta does not introduce them. Not a finding for this review. **No security concerns.** The server still binds to the configured `addr` in production; no new port or interface is exposed. Error messages contain only the address string and a wrapped stdlib error — no secrets or sensitive data. The test-only export path is inert in production. --- ## Findings No findings. --- REVIEW VERDICT: 0 blocker, 0 major, 0 minor
Revert "fix(observability): eliminate TOCTOU port-race in metrics server test"
All checks were successful
/ Lint (pull_request) Successful in 2m4s
/ JS Unit Tests (pull_request) Successful in 38s
/ Test (pull_request) Successful in 3m33s
/ Integration (pull_request) Successful in 3m42s
/ E2E Browser (pull_request) Successful in 4m20s
/ E2E API (pull_request) Successful in 5m11s
c4f5c9f990
This reverts commit 6ecf9a0a85.
zombor merged commit e717d56ba0 into main 2026-06-17 04:02:41 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
zombor/pergamum!587
No description provided.