feat(scan): filesystem scanner — read-only walk (bookshelf-axt.2) #21
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "bd-bookshelf-axt.2"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
internal/library/scanpackage withWalk()function andScanResulttypeBOOK_PER_FILEandFOLDER_PER_BOOKorg modes, incremental mtime-based scanning, SHA-256 initial hash (first 1MB), context cancellation, hidden file/dir skipping, and sidecar detectiontesting/fstest.MapFS— no real filesystem, no DB writesTest plan
go test ./internal/library/scan/... -coverprofile=cover.out→ 100% coverage, 52 specs passinggo vet ./internal/library/scan/...→ cleanCloses bead bookshelf-axt.2 on merge.
Security Review
Threat checks
Findings
1. Path traversal — SAFE
Walker calls
fs.WalkDir(fsys, ".")— all paths are relative and constrained by Go'sio/fscontract. Paths with..or a leading/are rejected by the standard library. No escape vector here.2. Symlink following — NOTE (undocumented footgun)
The walker receives
fsys fs.FSfrom the caller.os.DirFS(the obvious production value) follows symlinks by default. A symlink inside the library root pointing to/etc/passwdor~/.ssh/id_rsawould be opened and its first 1 MB hashed. The PR does not document whether symlink-following is intentional, nor does it apply any Lstat-vs-Stat mitigation. Recommend: either document the decision explicitly, or add a symlink guard at the Open boundary.3. Memory exhaustion — MUST FIX
readFile(injected asfs.ReadFilein production) reads the full file into[]bytebeforeinitialHashslices to 1 MB. A single 10 GB ebook would allocate 10 GB of heap before the cap is applied. The 1 MB guard ininitialHashprotects the hash, not the allocation. Fix: useio.LimitReader(limit toInitialHashBytes) inside thereadFileimplementation, or pass amaxBytesargument sobuildResultcan read-limited from the start.4. Error channel stall — NOTE
errsis buffered at 16.sendErrblocks if the buffer is full andctxis still live (only thectx.Done()select arm provides an escape). With a library containing many unreadable files (e.g., strict ACLs on a large NAS), the walker goroutine will stall until the consumer drains the error channel. This is not a leak (cancel unblocks it) but it is a latency hazard. Consider increasing the buffer, using a non-blocking send-with-drop strategy for errors, or documenting that the caller must drainerrsconcurrently.5. TOCTOU — LOW RISK, NOTE
d.Info()(in-memory stat from WalkDir) and the subsequentreadFileopen are separate operations. On a live filesystem a symlink could race into place between the two. Very low probability for the local-library use case; worth a brief comment in the code.6. DISK_TYPE / network mount DoS — NOTE
The Grimmory
DISK_TYPE=LOCAL|NETWORKtoggle is not wired into the walker. Walking an NFS/SMB mount with no per-file timeout means a stalled mount can block the goroutine indefinitely (only top-levelctxcancellation helps). IfDISK_TYPE=NETWORKpaths are in scope, acontext.WithTimeoutper folder or file open would add meaningful protection.7. Goroutine leak on ctx cancel — SAFE
Both
send()andsendErr()select onctx.Done(), so the goroutine always exits when the context is cancelled regardless of consumer behaviour. Channels are closed viadeferin the goroutine. Verified by the cancellation tests inwalk_test.go.8. FD exhaustion — SAFE
walkFolderPerBookaccumulates entries in memory first, then opens files serially one at a time insideemitFolder. At most one file descriptor is held open at a time per walk.Verdict
⚠️ Notes — one must-fix (memory allocation unbounded before 1 MB cap), one undocumented security-relevant decision (symlink following), two low-priority notes (error channel stall, network mount timeout).
Code Review
Spec compliance
Walk(readFile)returningfunc(ctx, fsys, allowed, orgMode, since) (<-chan ScanResult, <-chan error)(walk.go:77–111)pickPrimarybyformatPriorityindex (walk.go:283–332)!since.IsZero() && !info.ModTime().After(since)(walk.go:172–174)initialHashtruncates then sha256.Sum256 (walk.go:379–390)defer close(results); defer close(errs)in goroutine;send()selects on ctx.Done() (walk.go:99–100, 115–122)Project conventions
defer closeon both channels; send/ctx.Done select guards all sendsFindings
Must-fix
walkFolderPerBookbuffers everything before emitting — not streaming (walk.go:218–270): Thefoldersmap andorderslice collect ALL entries from the entire filesystem in the first WalkDir pass, then emit in a second loop. For a library with hundreds of thousands of books this holds all paths in memory simultaneously — directly violating the Scale hard rule ("Streaming over buffering") and the spec's own wording ("Streams ScanResult values on a channel").walkBookPerFileis correctly streaming. Fix: emit each folder's results inside the WalkDir callback as soon as a directory boundary is crossed (track the previous dir; when it changes, flush and emit the previous folder's entries).Important
Multiple
Expectinside a loop in oneItblock (walk_test.go:583–587):It("all results are supported formats")iteratesresultsand callsExpectinside the loop — that is N assertions perIt. Project convention is exactly ONEExpectperIt. Split into separateItblocks or assert on a derived value (e.g.Expect(bookTypes).To(ConsistOf("epub", "pdf"))).Expect(true).To(BeTrue())padding specs (walk_test.go:743, 907, 1061, 1102, 1128): FiveItblocks asserttrue == trueand provide zero behavioural coverage. The channel-close guarantee is proven by the test not hanging — noExpectneeded at all — but these bodies inflate the spec count with non-assertions. Replace with a meaningful assertion (e.g.Expect(results).NotTo(BeNil()),Expect(len(results)).To(BeNumerically("<", 10))) or remove theItbody'sExpectentirely and use a comment noting the test guards against hang.Minor / nit
cbrabsent fromformatPriority(walk.go:46):cbris a spec-listed format and is in the default allowed map, but is not informatPriority. It works (falls through to thelen(formatPriority)sentinel and still gets elected primary when it is the only allowed file in a folder), but its relative priority vs other formats when multiple formats are present in a folder is undefined. A folder with bothepubandcbrwill correctly preferepub, but a folder with bothcbrandpdfwill preferpdf(since cbr's priority = 7, pdf = 1). Whether that is intentional is worth documenting or fixing by addingcbrto the list.send()return discarded inemitFolder(walk.go:298, 310):_ = send(...)inemitFolder— when ctx is cancelled mid-folder, thefs.SkipAllsentinel returned bysend()is thrown away. The loop does re-checkctx.Err()at the top of each iteration (walk.go:286–288), so cancellation is caught one iteration late rather than immediately. Not a correctness bug, but mildly inconsistent withwalkBookPerFilewhich propagates the sentinel back tofs.WalkDir.drainWalkdrains channels serially (walk_test.go:427–440): Results are fully drained before errors are drained. Becauseerrsis buffered at 16, this is safe in all current tests. If any future test injects >16 simultaneous errors the goroutine would deadlock (blocked on errs send while drainWalk is waiting for results to close). Consider usingsync.WaitGroup+ two goroutines to drain both channels concurrently, or at minimum add a comment noting the 16-error limit.Verdict
⚠️ Must-fix before merge: Finding #1 (FOLDER_PER_BOOK is not streaming — violates the Scale hard rule that is explicitly called out as a first-class project requirement). Findings #2 and #3 (test quality) should be fixed but are not blocking if the author accepts them as follow-up beads. Findings #4–6 are nits.
Review-fix commit
Addressed (must-fix)
Addressed (should-fix)
Deferred (follow-up beads)
CI: success
Mergeable: True
Code Review (round 2)
Round-1 must-fix verification
walk.go:409-415:f, err = openFile(fsys, path)thenio.Copy(h, io.LimitReader(f, InitialHashBytes)). Allocation bounded to hash-state + io.Copy buffer.walk.go:231-287: usescurrentDir/pending []folderEntryboundary-detection.flush()is called at each dir boundary AND after the WalkDir loop. At most one folder in memory at a time.Round-1 should-fix verification
walk_test.go:624-630:ConsistOf("epub", "pdf")used correctlywalk.go:26-37: full package-level comment explainingos.DirFSfollows symlinks, calls it a known footgun, notes future beadwalk.go:56:["epub", "pdf", "cbz", "cbr", "mobi", "azw3", "m4b", "mp3"]walk.go:316-319(book branch) andwalk.go:330-332(sidecar branch) both check== fs.SkipAlland returnFresh findings
Minor — residual tautology at
walk_test.go:785-790: The It block "channels close after cancellation while blocked on send" setsresults = nilat line 781 then assertsExpect(results).To(BeNil()). The assertion can never fail. The real value is the channel drain loops not hanging (a hang-guard pattern). The Expect adds no signal — should either be deleted or replaced with a comment explaining the hang-guard intent.Minor —
drainWalkdrain order (walk_test.go:465-471): Drainsresultsto completion before touchingerrs. If a goroutine is blocked onerrs <-(buffer of 16 full) whileresultsis also blocked, this could deadlock in a future test that generates >16 errors alongside results. Not triggered by current tests, but fragile helper.Informational —
sincefilter in FOLDER_PER_BOOK is per-file (walk.go:272-274): An old-mtime primary epub is silently excluded frompendingwhile a new-mtime sidecar in the same folder is included. The folder emits with noIsBook=trueentry. Likely acceptable design (the epub was already ingested on a prior scan), but undocumented. Consider a comment or a test asserting this intentional behavior.No leftover
readFilereferences — confirmed clean throughout.Verdict
Ready to merge. The two blocking must-fixes from round 1 are correctly addressed. All should-fixes landed. Remaining findings are minor and do not block merge.
Security Review (round 2)
Round-1 must-fix verification
initialHashusesio.LimitReader— memory boundedWalksignature acceptsopenFile func(fs.FS, string) (fs.File, error)(returnsfs.File, not[]byte)initialHashcallsopenFile(fsys, path), thenio.Copy(h, io.LimitReader(f, InitialHashBytes))— allocates only the SHA-256 hash state +io.Copy's internal buffer, completely independent of file sizefs.Fileproperly closed viadefer f.Close()with//nolint:errcheckand an explicit justification commentRound-1 notes status
os.DirFSfollow-symlink behavior, provides a concrete attack example (/etc/passwd), labels it intentional, references a future bead for a guard at theOpenboundaryFresh security pass on the refactor
pathvalues come fromfs.WalkDir(which rejects..at the API boundary); raw WalkDir paths flow intoopenFile, never string-concatenatedclose(results)andclose(errs). Cancellation checked at every loop entry and inside every send viaselect { case ...: case <-ctx.Done(): }Verdict
✅ Clean — all round-1 must-fixes confirmed, symlink risk explicitly documented, streaming refactor introduces no new vulnerabilities.