fix(middleware): data race + cross-user contamination in WrapHealthCheck (bookshelf-n51o) #897
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "bd-bookshelf-n51o"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
lastIDsinWrapHealthCheckwas written from request goroutines without any lock, and read by the fill goroutine insideNavCache's mutex — a real-racefinding.Fix
Replace the single shared
lastIDs+ singleNavCachewith a keyed map ofNavCacheinstances:healthKey(ids)derives a canonical sorted-IDs string ("1,2,3")*NavCache[map[int64]bool]under async.Mutexids— no unsynchronised shared variableTest plan
It: "different ID sets return independent cache entries" — proves no contamination (was failing before fix)It: "same ID set in different order shares one cache" — proves key is order-independentIt: 20 concurrent goroutines with distinct ID sets each fill exactly oncego test -race ./internal/middleware/— clean (race was detected before fix)make coverage— 100% gate passesCloses bead bookshelf-n51o on merge.
Workflow Detail page screenshot (wf-detail-older-execution)
Older completed ContinueAsNew epoch detail — execution ID and state visible, Cancel absent.
Recompute Match Score — kebab open screenshot (recompute-match-score-kebab-open)
Security Review — PR #897 (bookshelf-n51o) — nav-cache cross-user fix
Scope
Reviewed:
internal/middleware/nav_cache.go+internal/middleware/nav_cache_test.goFocus areas per dispatch: cross-user cache isolation, unbounded memory, PII/secrets in keys, data-race elimination.
Findings
[MINOR] internal/middleware/nav_cache.go:186 —
cachesmap never evicts stale entries (memory leak)The
caches map[string]*NavCache[map[int64]bool]inside theWrapHealthCheckclosureaccumulates one entry per distinct library-ID-set seen and never removes entries.
In a long-running server where users add or remove libraries, old configurations generate
new keys; the entries for superseded sets are never reclaimed. Growth is bounded by real
system cardinality (library IDs come from server-side ownership resolution, not attacker
input), so this is not an injectable DoS — but it is a slow memory leak proportional to
distinct user-library-configuration churn.
Suggested fix: after
c.Get(ctx)returns, check whether the cache's internal entry isstale/nil and prune the map entry, or keep a last-access timestamp and run a periodic
sweep under
mu.[MINOR] internal/middleware/nav_cache.go:174 — fill closure holds shallow copy of caller's slice backing array
fill := idscopies the slice header (pointer, length, capacity), not the underlyingarray. If a future caller mutated the slice's backing array after calling the returned
function, the fill closure would silently use the mutated IDs on the next cache miss
for that key. In the current usage in
nav.go—libIDsis freshly allocated on eachrequest and goes out of scope immediately — this is safe. But it is a subtle contract that
is not documented and could silently corrupt cache fills if the call site changes.
Suggested fix: deep-copy inside the
if !okblock:Non-findings (confirmed clean)
Cross-user isolation: CONFIRMED CORRECT.
libIDsis derived fromlistLibraries(ctx, userID, navFetchLimit)whereuserIDis extracted from theauthenticated session via
extractUser(r)(a closure overusers.ClaimsFromContext),never from a request body or query parameter. An attacker cannot influence their
library-ID set to collide with another user's cache key without actual identical
library membership.
Key collision: NONE POSSIBLE.
healthKeysorts IDs (allint64) and joins with,. Since IDs are rendered as decimal integers and the separator is a non-digit, distinctsets always produce distinct strings (e.g., {3, 12} →
"3,12"vs {1, 23} →"1,23").Data race: ELIMINATED. The old unsynchronized
lastIDswrite is gone. The newmu sync.Mutexserialises all reads and writes tocaches.NavCache.Get()has its owninternal mutex. The
fillvariable is assigned inside theif !okblock under the lock;once the lock is released,
cis a valid immutable pointer andc.Get()is goroutine-safe.PII / secrets in key or logs: None. Library IDs are integer PKs. The key string is
used only as a map key and is never logged or echoed.
Empty/nil IDs edge case:
healthKey(nil)→""andhealthKey([]int64{})→"".Correct — both represent "no libraries." The call site guards with
len(libs) > 0sothis path is never reached in practice.
REVIEW VERDICT: 0 blocker, 0 major, 2 minor
CODE REVIEW: NOT APPROVED
Phase 0: DEMO Verification
No formal DEMO block was present in the bead or completion comment. The completion comment states "go test -race clean" but provides no expected output to compare against. Per review standard, this is a NOT APPROVED trigger.
However, I independently ran
go test -race ./internal/middleware/...against the branch and it passes cleanly. Proceeding with full review so the team can address all findings in one pass.Phase 1: Spec Compliance
The bead spec calls for: (1) eliminate the
lastIDsdata race, (2) eliminate cross-user contamination. Both are addressed. The per-key NavCache approach is correct for the stated goals. No missing requirements, no out-of-scope work.Phase 2: Code Quality
[MAJOR] internal/middleware/nav_cache.go:163 — unbounded map growth; stale cache-key entries never evicted
The
cachesmap accumulates one*NavCacheper distinct library-ID combination ever queried, with no eviction. Each per-key NavCache has a TTL for its data, but the map entry itself is permanent. When a user's library membership changes from {1,2} to {1,2,3}, the old "1,2" entry stays in the map forever — its fill closure holds a []int64{1,2} that will never be called again, and the NavCache struct (mutex, entry pointer, inflight pointer) leaks. In a long-lived process where library memberships change over time, the map grows monotonically with no ceiling. Fix: track lastGet time.Time per NavCache (updated under the outer mu) and periodically sweep entries idle for >2xTTL, or cap the map at a small size with LRU eviction. File a follow-up bead if not fixing here.[MINOR] internal/middleware/nav_cache_test.go:458 — It blocks mix setup, invocation, and assertion (convention violation)
The new It blocks inside "different ID sets get independent cache entries" call fn directly inside the It body — both the priming call and the call under test — then the single Expect. Per project conventions: BeforeEach = setup only, JustBeforeEach = call the SUT, It = single Expect. The priming call belongs in BeforeEach, the tested call in JustBeforeEach. Same applies to the concurrent It at line 499 which spawns goroutines and asserts inside the same It block. Tests still exercise correct behaviors but the structure mixes invocation and assertion.
healthKey correctness
Sorted copy before join: canonical. Separator comma cannot appear in decimal int64 strings: collision-free. Empty-slice maps to empty-string key (valid -- all no-library users share one entry, which is correct). Total order with strict < on unique IDs. No issues.
Race analysis (confirmed eliminated)
All caches map accesses are under mu. Each *NavCache has its own mutex guarding entry/inflight. The fill local variable is created under mu and captured by the fill closure; no goroutine writes to it after creation. mu.Unlock() before c.Get() is correct -- Go memory model guarantees visibility of the stored *NavCache to any goroutine that subsequently acquires mu. No double-checked locking. Race confirmed eliminated.
REVIEW VERDICT: 0 blocker, 1 major, 1 minor
[MAJOR] Add eviction to healthMapEntry map (unbounded growth): - Add healthMapEntry struct tracking lastGet time.Time per per-ID-set NavCache. - On each Get call (under outer mu), sweep entries idle > 2×TTL and delete them. - Add now func() time.Time parameter to WrapHealthCheck (nil → time.Now) so deterministic tests can control the clock (mirrors NewNavCache pattern). - Update internal/app/app.go call site: WrapHealthCheck(..., navCacheTTL, nil). - Add eviction test: warm {1}, advance 21s past 2×10s TTL, call {2} (triggers sweep), re-call {1} → fn invoked again (callCount=3 proves eviction). [MINOR] Deep-copy ids slice in fill closure: - Replace `fill := ids` (shallow, shares backing array) with `fill := make([]int64, len(ids)); copy(fill, ids)` to eliminate the risk of a caller mutating the array after WrapHealthCheck captures the fill closure. [MINOR] Restructure test Its per project convention: - "different ID sets" and "concurrent" Describes now use BeforeEach (setup + priming), JustBeforeEach (SUT call), one Expect per It. - Inner Context blocks separate the three distinct scenarios (second distinct key, same key caching, order-independent key) with clean separation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>Recompute Match Score — kebab open screenshot (recompute-match-score-kebab-open)
Workflow Detail page screenshot (wf-detail-older-execution)
Older completed ContinueAsNew epoch detail — execution ID and state visible, Cancel absent.
7581f90f99e81c55b97dRecompute Match Score — kebab open screenshot (recompute-match-score-kebab-open)
Workflow Detail page screenshot (wf-detail-older-execution)
Older completed ContinueAsNew epoch detail — execution ID and state visible, Cancel absent.
e81c55b97d57c3adbb57Recompute Match Score — kebab open screenshot (recompute-match-score-kebab-open)
Workflow Detail page screenshot (wf-detail-older-execution)
Older completed ContinueAsNew epoch detail — execution ID and state visible, Cancel absent.
57c3adbb57410bc7ccebRecompute Match Score — kebab open screenshot (recompute-match-score-kebab-open)
Workflow Detail page screenshot (wf-detail-older-execution)
Older completed ContinueAsNew epoch detail — execution ID and state visible, Cancel absent.