fix(middleware): data race + cross-user contamination in WrapHealthCheck (bookshelf-n51o) #897

Merged
zombor merged 3 commits from bd-bookshelf-n51o into main 2026-07-03 20:24:59 +00:00
Owner

Summary

  • Data race eliminated: lastIDs in WrapHealthCheck was written from request goroutines without any lock, and read by the fill goroutine inside NavCache's mutex — a real -race finding.
  • Cross-user contamination fixed: the single process-global cache served whichever user's library-ID set won the race to all users for the TTL window (cosmetic sidebar-dot corruption).

Fix

Replace the single shared lastIDs + single NavCache with a keyed map of NavCache instances:

  • healthKey(ids) derives a canonical sorted-IDs string ("1,2,3")
  • The returned closure looks up or creates a per-key *NavCache[map[int64]bool] under a sync.Mutex
  • The fill closure captures a local copy of ids — no unsynchronised shared variable
  • Each distinct user's library-ID set gets its own TTL-cached entry

Test plan

  • New It: "different ID sets return independent cache entries" — proves no contamination (was failing before fix)
  • New It: "same ID set in different order shares one cache" — proves key is order-independent
  • New It: 20 concurrent goroutines with distinct ID sets each fill exactly once
  • go test -race ./internal/middleware/ — clean (race was detected before fix)
  • make coverage — 100% gate passes

Closes bead bookshelf-n51o on merge.

## Summary - **Data race eliminated**: `lastIDs` in `WrapHealthCheck` was written from request goroutines without any lock, and read by the fill goroutine inside `NavCache`'s mutex — a real `-race` finding. - **Cross-user contamination fixed**: the single process-global cache served whichever user's library-ID set won the race to all users for the TTL window (cosmetic sidebar-dot corruption). ## Fix Replace the single shared `lastIDs` + single `NavCache` with a keyed map of `NavCache` instances: - `healthKey(ids)` derives a canonical sorted-IDs string (`"1,2,3"`) - The returned closure looks up or creates a per-key `*NavCache[map[int64]bool]` under a `sync.Mutex` - The fill closure captures a local copy of `ids` — no unsynchronised shared variable - Each distinct user's library-ID set gets its own TTL-cached entry ## Test plan - [x] New `It`: "different ID sets return independent cache entries" — proves no contamination (was failing before fix) - [x] New `It`: "same ID set in different order shares one cache" — proves key is order-independent - [x] New `It`: 20 concurrent goroutines with distinct ID sets each fill exactly once - [x] `go test -race ./internal/middleware/` — clean (race was detected before fix) - [x] `make coverage` — 100% gate passes Closes bead bookshelf-n51o on merge.
fix(middleware): eliminate data race and cross-user contamination in WrapHealthCheck (bookshelf-n51o)
All checks were successful
/ JS Unit Tests (pull_request) Successful in 56s
/ E2E Browser (pull_request) Successful in 2m57s
/ E2E API (pull_request) Successful in 5m30s
/ Integration (pull_request) Successful in 6m40s
/ Lint (pull_request) Successful in 7m12s
/ Test (pull_request) Successful in 8m11s
2fad39d508
Replace the single shared lastIDs closure var + single NavCache with a keyed
map of NavCache instances, one per distinct sorted ID set.

Root causes fixed:
- Data race: lastIDs was written from request goroutines outside any mutex but
  read inside the NavCache fill goroutine (real -race finding).
- Cross-user contamination: a single process-global cache served whichever
  user's library-ID set won the race to all users for the full TTL window.

Fix: healthKey() derives a canonical sorted string from the IDs slice; the
returned closure looks up (or creates) a per-key NavCache under a mutex, then
calls c.Get(). Each distinct user's library-ID set gets its own cache entry;
the fill closure captures a local copy of ids so no unsynchronised shared
variable exists. go test -race now passes clean.

New tests:
- different ID sets return independent cache entries (not first user's map)
- same ID set in different order shares one cache entry (order-independent key)
- concurrent calls with 20 distinct ID sets: no race, one fill per set

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Workflow Detail page screenshot (wf-detail-older-execution)

Older completed ContinueAsNew epoch detail — execution ID and state visible, Cancel absent.

wf-detail-older-execution

**Workflow Detail page screenshot** (wf-detail-older-execution) Older completed ContinueAsNew epoch detail — execution ID and state visible, Cancel absent. ![wf-detail-older-execution](/attachments/d9af860d-a324-4972-a564-68e7a9854e84)

Recompute Match Score — kebab open screenshot (recompute-match-score-kebab-open)

recompute-match-score-kebab-open

**Recompute Match Score — kebab open screenshot** (recompute-match-score-kebab-open) ![recompute-match-score-kebab-open](/attachments/4535e801-a264-4500-a8ef-6501009df86f)
Author
Owner

Security Review — PR #897 (bookshelf-n51o) — nav-cache cross-user fix

Scope

Reviewed: internal/middleware/nav_cache.go + internal/middleware/nav_cache_test.go

Focus areas per dispatch: cross-user cache isolation, unbounded memory, PII/secrets in keys, data-race elimination.


Findings

[MINOR] internal/middleware/nav_cache.go:186 — caches map never evicts stale entries (memory leak)

The caches map[string]*NavCache[map[int64]bool] inside the WrapHealthCheck closure
accumulates one entry per distinct library-ID-set seen and never removes entries.
In a long-running server where users add or remove libraries, old configurations generate
new keys; the entries for superseded sets are never reclaimed. Growth is bounded by real
system cardinality (library IDs come from server-side ownership resolution, not attacker
input), so this is not an injectable DoS — but it is a slow memory leak proportional to
distinct user-library-configuration churn.
Suggested fix: after c.Get(ctx) returns, check whether the cache's internal entry is
stale/nil and prune the map entry, or keep a last-access timestamp and run a periodic
sweep under mu.

[MINOR] internal/middleware/nav_cache.go:174 — fill closure holds shallow copy of caller's slice backing array

fill := ids copies the slice header (pointer, length, capacity), not the underlying
array. If a future caller mutated the slice's backing array after calling the returned
function, the fill closure would silently use the mutated IDs on the next cache miss
for that key. In the current usage in nav.golibIDs is freshly allocated on each
request and goes out of scope immediately — this is safe. But it is a subtle contract that
is not documented and could silently corrupt cache fills if the call site changes.
Suggested fix: deep-copy inside the if !ok block:

fill := make([]int64, len(ids))
copy(fill, ids)

Non-findings (confirmed clean)

  • Cross-user isolation: CONFIRMED CORRECT. libIDs is derived from
    listLibraries(ctx, userID, navFetchLimit) where userID is extracted from the
    authenticated session via extractUser(r) (a closure over users.ClaimsFromContext),
    never from a request body or query parameter. An attacker cannot influence their
    library-ID set to collide with another user's cache key without actual identical
    library membership.

  • Key collision: NONE POSSIBLE. healthKey sorts IDs (all int64) and joins with
    ,. Since IDs are rendered as decimal integers and the separator is a non-digit, distinct
    sets always produce distinct strings (e.g., {3, 12} → "3,12" vs {1, 23} → "1,23").

  • Data race: ELIMINATED. The old unsynchronized lastIDs write is gone. The new
    mu sync.Mutex serialises all reads and writes to caches. NavCache.Get() has its own
    internal mutex. The fill variable is assigned inside the if !ok block under the lock;
    once the lock is released, c is a valid immutable pointer and c.Get() is goroutine-safe.

  • PII / secrets in key or logs: None. Library IDs are integer PKs. The key string is
    used only as a map key and is never logged or echoed.

  • Empty/nil IDs edge case: healthKey(nil)"" and healthKey([]int64{})"".
    Correct — both represent "no libraries." The call site guards with len(libs) > 0 so
    this path is never reached in practice.


REVIEW VERDICT: 0 blocker, 0 major, 2 minor

## Security Review — PR #897 (bookshelf-n51o) — nav-cache cross-user fix ### Scope Reviewed: `internal/middleware/nav_cache.go` + `internal/middleware/nav_cache_test.go` Focus areas per dispatch: cross-user cache isolation, unbounded memory, PII/secrets in keys, data-race elimination. --- ### Findings [MINOR] internal/middleware/nav_cache.go:186 — `caches` map never evicts stale entries (memory leak) The `caches map[string]*NavCache[map[int64]bool]` inside the `WrapHealthCheck` closure accumulates one entry per distinct library-ID-set seen and never removes entries. In a long-running server where users add or remove libraries, old configurations generate new keys; the entries for superseded sets are never reclaimed. Growth is bounded by real system cardinality (library IDs come from server-side ownership resolution, not attacker input), so this is not an injectable DoS — but it is a slow memory leak proportional to distinct user-library-configuration churn. Suggested fix: after `c.Get(ctx)` returns, check whether the cache's internal entry is stale/nil and prune the map entry, or keep a last-access timestamp and run a periodic sweep under `mu`. [MINOR] internal/middleware/nav_cache.go:174 — fill closure holds shallow copy of caller's slice backing array `fill := ids` copies the slice *header* (pointer, length, capacity), not the underlying array. If a future caller mutated the slice's backing array after calling the returned function, the fill closure would silently use the mutated IDs on the next cache miss for that key. In the current usage in `nav.go` — `libIDs` is freshly allocated on each request and goes out of scope immediately — this is safe. But it is a subtle contract that is not documented and could silently corrupt cache fills if the call site changes. Suggested fix: deep-copy inside the `if !ok` block: ```go fill := make([]int64, len(ids)) copy(fill, ids) ``` --- ### Non-findings (confirmed clean) - **Cross-user isolation:** CONFIRMED CORRECT. `libIDs` is derived from `listLibraries(ctx, userID, navFetchLimit)` where `userID` is extracted from the authenticated session via `extractUser(r)` (a closure over `users.ClaimsFromContext`), never from a request body or query parameter. An attacker cannot influence their library-ID set to collide with another user's cache key without actual identical library membership. - **Key collision:** NONE POSSIBLE. `healthKey` sorts IDs (all `int64`) and joins with `,`. Since IDs are rendered as decimal integers and the separator is a non-digit, distinct sets always produce distinct strings (e.g., {3, 12} → `"3,12"` vs {1, 23} → `"1,23"`). - **Data race: ELIMINATED.** The old unsynchronized `lastIDs` write is gone. The new `mu sync.Mutex` serialises all reads and writes to `caches`. `NavCache.Get()` has its own internal mutex. The `fill` variable is assigned inside the `if !ok` block under the lock; once the lock is released, `c` is a valid immutable pointer and `c.Get()` is goroutine-safe. - **PII / secrets in key or logs:** None. Library IDs are integer PKs. The key string is used only as a map key and is never logged or echoed. - **Empty/nil IDs edge case:** `healthKey(nil)` → `""` and `healthKey([]int64{})` → `""`. Correct — both represent "no libraries." The call site guards with `len(libs) > 0` so this path is never reached in practice. --- REVIEW VERDICT: 0 blocker, 0 major, 2 minor
Author
Owner

CODE REVIEW: NOT APPROVED

Phase 0: DEMO Verification

No formal DEMO block was present in the bead or completion comment. The completion comment states "go test -race clean" but provides no expected output to compare against. Per review standard, this is a NOT APPROVED trigger.

However, I independently ran go test -race ./internal/middleware/... against the branch and it passes cleanly. Proceeding with full review so the team can address all findings in one pass.


Phase 1: Spec Compliance

The bead spec calls for: (1) eliminate the lastIDs data race, (2) eliminate cross-user contamination. Both are addressed. The per-key NavCache approach is correct for the stated goals. No missing requirements, no out-of-scope work.


Phase 2: Code Quality

[MAJOR] internal/middleware/nav_cache.go:163 — unbounded map growth; stale cache-key entries never evicted
The caches map accumulates one *NavCache per distinct library-ID combination ever queried, with no eviction. Each per-key NavCache has a TTL for its data, but the map entry itself is permanent. When a user's library membership changes from {1,2} to {1,2,3}, the old "1,2" entry stays in the map forever — its fill closure holds a []int64{1,2} that will never be called again, and the NavCache struct (mutex, entry pointer, inflight pointer) leaks. In a long-lived process where library memberships change over time, the map grows monotonically with no ceiling. Fix: track lastGet time.Time per NavCache (updated under the outer mu) and periodically sweep entries idle for >2xTTL, or cap the map at a small size with LRU eviction. File a follow-up bead if not fixing here.

[MINOR] internal/middleware/nav_cache_test.go:458 — It blocks mix setup, invocation, and assertion (convention violation)
The new It blocks inside "different ID sets get independent cache entries" call fn directly inside the It body — both the priming call and the call under test — then the single Expect. Per project conventions: BeforeEach = setup only, JustBeforeEach = call the SUT, It = single Expect. The priming call belongs in BeforeEach, the tested call in JustBeforeEach. Same applies to the concurrent It at line 499 which spawns goroutines and asserts inside the same It block. Tests still exercise correct behaviors but the structure mixes invocation and assertion.


healthKey correctness

Sorted copy before join: canonical. Separator comma cannot appear in decimal int64 strings: collision-free. Empty-slice maps to empty-string key (valid -- all no-library users share one entry, which is correct). Total order with strict < on unique IDs. No issues.

Race analysis (confirmed eliminated)

All caches map accesses are under mu. Each *NavCache has its own mutex guarding entry/inflight. The fill local variable is created under mu and captured by the fill closure; no goroutine writes to it after creation. mu.Unlock() before c.Get() is correct -- Go memory model guarantees visibility of the stored *NavCache to any goroutine that subsequently acquires mu. No double-checked locking. Race confirmed eliminated.


REVIEW VERDICT: 0 blocker, 1 major, 1 minor

## CODE REVIEW: NOT APPROVED ### Phase 0: DEMO Verification No formal DEMO block was present in the bead or completion comment. The completion comment states "go test -race clean" but provides no expected output to compare against. Per review standard, this is a NOT APPROVED trigger. However, I independently ran `go test -race ./internal/middleware/...` against the branch and it passes cleanly. Proceeding with full review so the team can address all findings in one pass. --- ### Phase 1: Spec Compliance The bead spec calls for: (1) eliminate the `lastIDs` data race, (2) eliminate cross-user contamination. Both are addressed. The per-key NavCache approach is correct for the stated goals. No missing requirements, no out-of-scope work. --- ### Phase 2: Code Quality [MAJOR] internal/middleware/nav_cache.go:163 — unbounded map growth; stale cache-key entries never evicted The `caches` map accumulates one `*NavCache` per distinct library-ID combination ever queried, with no eviction. Each per-key NavCache has a TTL for its data, but the map entry itself is permanent. When a user's library membership changes from {1,2} to {1,2,3}, the old "1,2" entry stays in the map forever — its fill closure holds a []int64{1,2} that will never be called again, and the NavCache struct (mutex, entry pointer, inflight pointer) leaks. In a long-lived process where library memberships change over time, the map grows monotonically with no ceiling. Fix: track lastGet time.Time per NavCache (updated under the outer mu) and periodically sweep entries idle for >2xTTL, or cap the map at a small size with LRU eviction. File a follow-up bead if not fixing here. [MINOR] internal/middleware/nav_cache_test.go:458 — It blocks mix setup, invocation, and assertion (convention violation) The new It blocks inside "different ID sets get independent cache entries" call fn directly inside the It body — both the priming call and the call under test — then the single Expect. Per project conventions: BeforeEach = setup only, JustBeforeEach = call the SUT, It = single Expect. The priming call belongs in BeforeEach, the tested call in JustBeforeEach. Same applies to the concurrent It at line 499 which spawns goroutines and asserts inside the same It block. Tests still exercise correct behaviors but the structure mixes invocation and assertion. --- ### healthKey correctness Sorted copy before join: canonical. Separator comma cannot appear in decimal int64 strings: collision-free. Empty-slice maps to empty-string key (valid -- all no-library users share one entry, which is correct). Total order with strict < on unique IDs. No issues. ### Race analysis (confirmed eliminated) All caches map accesses are under mu. Each *NavCache has its own mutex guarding entry/inflight. The fill local variable is created under mu and captured by the fill closure; no goroutine writes to it after creation. mu.Unlock() before c.Get() is correct -- Go memory model guarantees visibility of the stored *NavCache to any goroutine that subsequently acquires mu. No double-checked locking. Race confirmed eliminated. --- REVIEW VERDICT: 0 blocker, 1 major, 1 minor
fix(middleware): address review findings on WrapHealthCheck (bookshelf-n51o)
Some checks failed
/ JS Unit Tests (pull_request) Successful in 1m45s
/ E2E API (pull_request) Successful in 3m15s
/ Integration (pull_request) Successful in 4m39s
/ Lint (pull_request) Successful in 4m53s
/ E2E Browser (pull_request) Failing after 5m25s
/ Test (pull_request) Successful in 5m52s
7581f90f99
[MAJOR] Add eviction to healthMapEntry map (unbounded growth):
- Add healthMapEntry struct tracking lastGet time.Time per per-ID-set NavCache.
- On each Get call (under outer mu), sweep entries idle > 2×TTL and delete them.
- Add now func() time.Time parameter to WrapHealthCheck (nil → time.Now) so
  deterministic tests can control the clock (mirrors NewNavCache pattern).
- Update internal/app/app.go call site: WrapHealthCheck(..., navCacheTTL, nil).
- Add eviction test: warm {1}, advance 21s past 2×10s TTL, call {2} (triggers
  sweep), re-call {1} → fn invoked again (callCount=3 proves eviction).

[MINOR] Deep-copy ids slice in fill closure:
- Replace `fill := ids` (shallow, shares backing array) with
  `fill := make([]int64, len(ids)); copy(fill, ids)` to eliminate the risk of
  a caller mutating the array after WrapHealthCheck captures the fill closure.

[MINOR] Restructure test Its per project convention:
- "different ID sets" and "concurrent" Describes now use BeforeEach (setup +
  priming), JustBeforeEach (SUT call), one Expect per It.
- Inner Context blocks separate the three distinct scenarios (second distinct
  key, same key caching, order-independent key) with clean separation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Recompute Match Score — kebab open screenshot (recompute-match-score-kebab-open)

recompute-match-score-kebab-open

**Recompute Match Score — kebab open screenshot** (recompute-match-score-kebab-open) ![recompute-match-score-kebab-open](/attachments/28cd04ee-83a0-485d-a4fb-7d02bccc2c76)

Workflow Detail page screenshot (wf-detail-older-execution)

Older completed ContinueAsNew epoch detail — execution ID and state visible, Cancel absent.

wf-detail-older-execution

**Workflow Detail page screenshot** (wf-detail-older-execution) Older completed ContinueAsNew epoch detail — execution ID and state visible, Cancel absent. ![wf-detail-older-execution](/attachments/61503644-9613-48a6-a72b-dd701df4ebe5)
zombor force-pushed bd-bookshelf-n51o from 7581f90f99
Some checks failed
/ JS Unit Tests (pull_request) Successful in 1m45s
/ E2E API (pull_request) Successful in 3m15s
/ Integration (pull_request) Successful in 4m39s
/ Lint (pull_request) Successful in 4m53s
/ E2E Browser (pull_request) Failing after 5m25s
/ Test (pull_request) Successful in 5m52s
to e81c55b97d
All checks were successful
/ JS Unit Tests (pull_request) Successful in 1m24s
/ E2E API (pull_request) Successful in 2m26s
/ Integration (pull_request) Successful in 3m30s
/ Lint (pull_request) Successful in 4m49s
/ E2E Browser (pull_request) Successful in 4m19s
/ Test (pull_request) Successful in 6m2s
2026-07-03 19:07:56 +00:00
Compare

Recompute Match Score — kebab open screenshot (recompute-match-score-kebab-open)

recompute-match-score-kebab-open

**Recompute Match Score — kebab open screenshot** (recompute-match-score-kebab-open) ![recompute-match-score-kebab-open](/attachments/8f0a1fa0-5743-4e91-9c05-a000cc3fa735)

Workflow Detail page screenshot (wf-detail-older-execution)

Older completed ContinueAsNew epoch detail — execution ID and state visible, Cancel absent.

wf-detail-older-execution

**Workflow Detail page screenshot** (wf-detail-older-execution) Older completed ContinueAsNew epoch detail — execution ID and state visible, Cancel absent. ![wf-detail-older-execution](/attachments/0c88d627-35de-453a-a1e6-2484d320d7e8)
zombor force-pushed bd-bookshelf-n51o from e81c55b97d
All checks were successful
/ JS Unit Tests (pull_request) Successful in 1m24s
/ E2E API (pull_request) Successful in 2m26s
/ Integration (pull_request) Successful in 3m30s
/ Lint (pull_request) Successful in 4m49s
/ E2E Browser (pull_request) Successful in 4m19s
/ Test (pull_request) Successful in 6m2s
to 57c3adbb57
All checks were successful
/ E2E API (pull_request) Successful in 2m41s
/ JS Unit Tests (pull_request) Successful in 1m43s
/ Lint (pull_request) Successful in 3m53s
/ Integration (pull_request) Successful in 3m49s
/ Test (pull_request) Successful in 4m45s
/ E2E Browser (pull_request) Successful in 3m49s
2026-07-03 19:18:20 +00:00
Compare

Recompute Match Score — kebab open screenshot (recompute-match-score-kebab-open)

recompute-match-score-kebab-open

**Recompute Match Score — kebab open screenshot** (recompute-match-score-kebab-open) ![recompute-match-score-kebab-open](/attachments/cedb2536-3555-4119-aaac-1fb1de058438)

Workflow Detail page screenshot (wf-detail-older-execution)

Older completed ContinueAsNew epoch detail — execution ID and state visible, Cancel absent.

wf-detail-older-execution

**Workflow Detail page screenshot** (wf-detail-older-execution) Older completed ContinueAsNew epoch detail — execution ID and state visible, Cancel absent. ![wf-detail-older-execution](/attachments/c34bd954-b09f-4a20-8329-593d13f1616f)
zombor force-pushed bd-bookshelf-n51o from 57c3adbb57
All checks were successful
/ E2E API (pull_request) Successful in 2m41s
/ JS Unit Tests (pull_request) Successful in 1m43s
/ Lint (pull_request) Successful in 3m53s
/ Integration (pull_request) Successful in 3m49s
/ Test (pull_request) Successful in 4m45s
/ E2E Browser (pull_request) Successful in 3m49s
to 410bc7cceb
All checks were successful
/ JS Unit Tests (pull_request) Successful in 33s
/ E2E API (pull_request) Successful in 2m53s
/ Lint (pull_request) Successful in 4m15s
/ Integration (pull_request) Successful in 4m16s
/ E2E Browser (pull_request) Successful in 4m18s
/ Test (pull_request) Successful in 5m15s
2026-07-03 20:19:07 +00:00
Compare

Recompute Match Score — kebab open screenshot (recompute-match-score-kebab-open)

recompute-match-score-kebab-open

**Recompute Match Score — kebab open screenshot** (recompute-match-score-kebab-open) ![recompute-match-score-kebab-open](/attachments/04197ea3-1b1d-4851-8206-7d264e68d59c)

Workflow Detail page screenshot (wf-detail-older-execution)

Older completed ContinueAsNew epoch detail — execution ID and state visible, Cancel absent.

wf-detail-older-execution

**Workflow Detail page screenshot** (wf-detail-older-execution) Older completed ContinueAsNew epoch detail — execution ID and state visible, Cancel absent. ![wf-detail-older-execution](/attachments/f8251d1d-458e-44f6-8b2e-6b45738a417a)
zombor merged commit f027793508 into main 2026-07-03 20:24:59 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
zombor/pergamum!897
No description provided.