fix(dbtest): eliminate thundering-herd migrate hang (bookshelf-7akw) #587
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "bd-bookshelf-7akw"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Root cause
8 parallel ginkgo nodes each calling
migrate.Up()concurrently on one MySQL instance with--tmpfs /var/lib/mysql size=1goverwhelmed the server's DDL capacity.ExecContextcalls stalled, blocked onreadResultSetHeaderPacket, silently hanging for the full 10-minute Go test timeout. Affected both #579 Integration and #583 E2E API identically.Fix
Two parts applied together:
1. Fail-fast (resilience):
ensureTemplateDBrunsRunMigrationsunder a 120scontext.WithTimeout. A stalled migration now errors in ≤120s instead of hanging until the suite-level 10m timeout.2. De-thundering-herd (the actual fix): Instead of every node running full migrations concurrently:
pergamum_test_template) guarded by a MySQLGET_LOCK()advisory lock (cross-process, 120s timeout).SHOW CREATE TABLE+ DDL replay withFOREIGN_KEY_CHECKS=0on a single dedicated connection (no FK ordering issues).schema_migrationsrows are also cloned sogolang-migratetreats each clone as fully migrated (ErrNoChangeon nextUp()call).The advisory lock ensures only one goroutine/process ever migrates; all concurrent callers block, then clone. Reduces N concurrent full migrations to 1 migration + N fast DDL clones.
Local results
make integration: all packages pass (768 + 227 + 3 specs)make e2e-api: 1125/1125 pass in 2m37s (11 procs)make e2e-browser: 250/250 pass in 2m0s (11 procs)No changes to e2e suite files,
BeforeSuitecallers, or any production code.Closes bead bookshelf-7akw on merge.
Security Review — PR #587 (
bd-bookshelf-7akw)Scope:
internal/dbtest/dbtest.go+258/-14 — migrate-once-into-template-DB + clone-per-parallel-node.1. SQL injection via DB/table names
templateDBNameandtemplateLockNameare compile-time string constants, not derived from any runtime input. All interpolated identifiers fall into one of these categories:templateDBName="pergamum_test_template"consttemplateLockName="pergamum_dbtest_template"constdestDB/dbNamefmt.Sprintf("%s_%s", name, randomHex())t(table name)information_schema.tablesresult from our own template DBnameparameter: Thenameargument toNewSuiteDBis caller-supplied. Looking at the diff,namewas already interpolated into theCREATE DATABASEDDL in the pre-existing code (unchanged), and the new diff adds no new interpolation ofnamebeyond what already existed — thedestDBpassed tocloneSchemaFromTemplateis the already-constructeddbName(name + "_" + randomHex()), which is then backtick-quoted in DDL. All callers that can be seen in the codebase pass a short fixed package-name string ("books","library", etc.), so there is no practical injection risk; still, this is pre-existing and outside the new diff's scope.Table names from
information_schema: IncloneSchemaFromTemplate,tcomes fromSELECT table_name FROM information_schema.tables WHERE table_schema = ?. Thetable_schemais the constanttemplateDBName, so MySQL only returns tables from our own template DB. A MySQL table name cannot itself contain a backtick because MySQL's identifier rules forbid it — so backtick-quotingtinSHOW CREATE TABLE \%s`.`%s`` is safe.GET_LOCK call:
templateLockNameis a constant so there is no injection vector there either.Verdict: no injection issues in the new code.
2. Secrets / credentials in logs or errors
The
destDBname anddbNameappear in error strings (fmt.Errorf("open dest pool: %w", err)etc.), but these are just database names — no passwords. The DSN itself is only stored insuiteDSNand passed todb.Open; it is not included in any error message visible in the diff. The existingdiscardSlogLogger()is wired toio.Discardso nothing from the template-creation path is logged at all.No credentials leaked in new code.
3. Blast radius — can this reach a production DB?
//go:build integration(inherited from the package — confirmed by the package-level comment and the existing pre-PR file).StartSharedMySQLonly fires whenPERGAMUM_DSNis set (CI service container) or when testcontainers spins a MySQL container. Neither path is reachable from the production binary (cmd/pergamum).DROP DATABASE IF EXISTScalls drop only DBs named"pergamum_test_template",dbName(suffix_<hex>), ortemplateDBName— all prefixedpergamum_test_*. The production schema lives in a DB named by the operator-supplied DSN, never anything matching those prefixes.dbtestfrom a non-integration-tagged build, so theseDROPcalls cannot touch production.Production blast radius: none.
4. Advisory lock name attacker-controllability
templateLockNameis a constant ("pergamum_dbtest_template"). It is not derived from any input, runtime state, or environment variable. Length is 24 bytes, well under MySQL's 64-byte limit.No controllability concern.
5. Minor observations (no severity)
nolint:nilerron theschema_migrationsmiss-path silently swallows the "table doesn't exist" error and returnsnil, which means a partially-cloned DB (no migration rows) will be handed to the suite. Golang-migrate will then try to re-run all migrations and will likely succeed, so it degrades gracefully rather than failing silently. No security implication; noted for awareness.defer lockConn.Close()anddefer templateDB.Close()both use//nolint:errcheck— reasonable for test infrastructure.Summary
No security issues found in the new code. All SQL identifiers in the new paths are either compile-time constants or table names from a fully controlled source. No credentials are introduced or leaked. The code is guarded by
//go:build integrationand cannot reach production.[MINOR] internal/dbtest/dbtest.go:299-302 —
schema_migrationsquery error silently returns nilIn
cloneSchemaFromTemplate, whenrootDB.QueryContextfails for theschema_migrationscopy step, the function returnsnil(success) with a comment "skip — migrations will re-apply." But if the dest DB was cloned (all tables exist), golang-migrate would attempt to re-run migrations and immediately fail with "table already exists." In practice this code path is not reachable after a successful template migration (schema_migrations always exists), and the//nolint:nilerrdocuments the intent. Given that the template was just confirmed ready bytemplateIsReady— which itself queriesschema_migrations— a failure here would be a connectivity catastrophe already surfaced elsewhere. Acceptable for test-infra, but the comment says "migrations will re-apply" when in fact they would error, not re-apply cleanly. Update the comment to accurately describe the failure mode, or return the error.[MINOR] internal/dbtest/dbtest.go:168-169 —
GET_LOCKtimeout andctxdeadline can conflictGET_LOCKis called withmigrationTimeout(120 s) as the server-side wait, but the call is made vialockConn.QueryRowContext(ctx, ...)wherectxis the caller's test-suite context. If the caller'sctxhas a shorter deadline (common for Go test contexts), QueryRowContext cancels before the server-side lock wait fires, and the error surfaces as a context-deadline error rather than the clear "GET_LOCK timed out" message. The lock holder's migration is bounded bymigrCtx(migrationTimeout), so the GET_LOCK server timeout and migrCtx together give the correct worst-case bound — but a test that passes a short-deadline ctx will see a confusing cancellation error before the 120 s bound triggers. Low risk in practice (ginkgo suites usecontext.Background()or a very long deadline), but the interaction is worth documenting.[MINOR] internal/dbtest/dbtest.go:325-337 —
templateIsReadyswallows all errors, not just "table not found"templateIsReadyreturnsfalse, nilon ANY scan error, including real connectivity failures. The//nolint:nilerracknowledges this, and the consequence — falling through to GET_LOCK, which then fails with a cleaner error — is acceptable. No code change needed, but the comment should say "any query error (table absent, DB absent, connectivity) is treated as not-ready" instead of the narrower "Table doesn't exist or DB doesn't exist" to be accurate.Concurrency correctness — verified clean:
Lock acquire/release path:
defer lockConn.Close()is registered beforedefer RELEASE_LOCK(...). Go defers run LIFO, so RELEASE_LOCK fires first (explicitly), then the connection closes. Correct on all paths including panics.Double-checked locking: The post-lock re-check uses
rootDB(a pool), notlockConn. This is safe: the check runs while holding the advisory lock, so no other process can be mid-migration. Any connection from the pool observes the committed state.Fail-fast bound:
migrCtxiscontext.WithTimeout(ctx, migrationTimeout)wrapping the caller's ctx.db.RunMigrationsuses this context for each DDLExecContext, so a stalled migration cancels within 120 s. The GET_LOCK call also has the 120 s server-side wait. Together these replace the 10-minute suite-kill with a 120 s ceiling. Correct.FK handling in clone: A dedicated
destConnholdsSET FOREIGN_KEY_CHECKS=0for the entire DDL replay loop. LIFO defer order: re-enable FK fires via the last explicit Exec, thendestConn.Close(), thendestPool.Close(). On error paths the code explicitly re-enables FK before returning. Correct.templateIsReadycheck is sufficient: It queriesCOUNT(*) FROM schema_migrations WHERE dirty=0. SinceensureTemplateDBdrops and recreates the template on any partially-built state, and only returns success afterRunMigrationscompletes without error (which sets dirty=0), the check correctly distinguishes empty/in-progress from ready.dirty=1rows cannot exist after a successful migration.Per-node isolation: Each node gets a unique
<name>_<hex>DB, cloned independently from the (read-only after creation) template. No cross-node state sharing.Orphan DB cleanup: Template DB is intentionally persistent (mirrors the WithReuse container model). Per-node DBs are dropped in cleanup funcs. No orphan concern.
REVIEW VERDICT: 0 blocker, 0 major, 3 minor
CODE REVIEW (delta only: commit
6ecf9a0a"fix(observability): eliminate TOCTOU port-race in metrics server test")Re-review of the observability delta added after the previously-clean
101ab137commit. Three files reviewed:internal/observability/metrics_server.go,metrics_server_test.go,export_test.go.Phase 0 — DEMO
No DEMO block is present in this bead; the fix targets a test flake (CI behaviour), not a user-facing surface. Accepted: the fix is verified by re-reading the test logic below.
Phase 1 — Correctness
ServeMetrics behaviour preserved.
The public signature is unchanged. The function now does
net.Listenthen delegates — identical observable semantics.fmtis still imported (metrics_server.go:5). Error wrapping format"metrics listen %s: %w"is identical to the old code. Pass.Listener ownership and close — no leak, no double-close.
serveMetricsOnListenercallssrv.Serve(ln)at line 59. Go'shttp.Server.Servecloses the listener when it returns (documented stdlib behaviour), so the listener is always released whether the server exits cleanly or with an error. There is noln.Close()call elsewhere in the function, so there is no double-close risk. The shutdown goroutine callssrv.Shutdown(...), which does not close the listener directly — it signals Serve to exit, and Serve closes the listener on its way out. Pass.Context cancellation / clean-shutdown.
The goroutine at
metrics_server.go:52–56callssrv.Shutdownonctx.Done().srv.Servethen returnshttp.ErrServerClosed, which is filtered at line 59, so the function returnsnil. Behaviour is unchanged from before the refactor. Pass.Routes/handlers unchanged.
serveMetricsOnListenerwiresmux.Handle("GET /metrics", MetricsHandler(reg))— identical to the oldServeMetrics. No routes added or removed.srv.Addris now set toln.Addr().String()instead of the inputaddrstring. For production use these are identical (the listener is bound toaddr); the only visible difference is that:0in tests resolves to the real assigned address in logs/srv.Addr. Pass.The TOCTOU race is fixed.
Old code:
net.Listen("127.0.0.1:0")→ln.Close()→ServeMetrics(ctx, addr, ...)→ secondnet.Listen(addr). The window between close and re-listen was the race.New code:
net.Listen("127.0.0.1:0")→ readaddrfromln.Addr()→ passlndirectly toServeMetricsOnListener. The same file descriptor is handed to the server. There is no close-and-reopen window. Race is eliminated. Pass.ServeMetricsOnListenerexport is test-only.export_test.gois named*_test.go, so the Go toolchain compiles it only duringgo test— it does not appear in the production binary. Thepackage observability(not_test) declaration is the standard Go "export for white-box tests" idiom.ServeMetricsOnListenerdoes not widen the production surface. Pass.Listen-error path still exercises
ServeMetrics.The
"ServeMetrics listen error"describe block atmetrics_server_test.go:95callsobservability.ServeMetrics(...)directly, not the internal helper. The public error path remains under test. Pass.Phase 2 — Code Quality
Var-at-top convention.
ServeMetricsnow uses:=(4-line function).project-conventions.mdexplicitly excepts "short-lived loop indices, throw-away buffers, and other genuinely scope-local vars." A 4-line delegation wrapper qualifies;lnanderrare used exactly once each.serveMetricsOnListenerretains thevar (...)block at the top for its multi-use vars. Convention followed appropriately.Multi-Expect
Itblocks.The
"serves Prometheus metrics content"and"includes go_goroutines metric"Its each contain two/threeExpectcalls. These are pre-existing at commit101ab137— this delta does not introduce them. Not a finding for this review.No security concerns.
The server still binds to the configured
addrin production; no new port or interface is exposed. Error messages contain only the address string and a wrapped stdlib error — no secrets or sensitive data. The test-only export path is inert in production.Findings
No findings.
REVIEW VERDICT: 0 blocker, 0 major, 0 minor