feat(dedup): detection presets + criteria toggles + series-volume guard (bookshelf-c15a.2, bookshelf-7c5m) #567
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "bd-bookshelf-c15a.2"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
book_metadata.hardcover_idandgoodreads_idfor the External ID finder.FindDuplicatesByTitleAuthornow excludes groups where books have different non-nullseries_numbervalues (Eleria vol 1 ≠ vol 3 even though they share title+author). Fixed viaHAVING COUNT(DISTINCT CASE WHEN series_number IS NOT NULL THEN series_number END) <= 1.Test plan
FindDuplicatesBy*finders,PresetCriteria(),parseScanRequest()(JSON body + query params including custom preset),Scan()with each criteria combinationinternal/dedup/(excludingwire.gowhich is wiring)make build+golangci-lint run ./internal/dedup/...cleanCloses bead bookshelf-c15a.2 and bookshelf-7c5m on merge.
6fa9657ed6082b170c16Find Duplicates v2 — preset chips + criteria toggles
Screenshots from the browser e2e test:
Modal: preset chips (Strict / Balanced / Aggressive / Custom) + advanced criteria, Balanced active by default
Strict selected — Title+Author criterion deselected, Advanced panel open
Scan results with Balanced preset — ISBN Match group
Security Review — PR #567 (bookshelf-c15a.2 + bookshelf-7c5m)
Branch:
bd-bookshelf-c15a.2Scope: dedup detection presets + 5 criteria toggles (ISBN / External ID / Title+Author / Same Directory / Filename) + series-volume false-positive fix
Phase 0: Multi-user / Library Scope
Reviewed all five new/modified query functions in
internal/dedup/store.go.FindDuplicatesByISBN (modified): unchanged scoping, clean.
FindDuplicatesByTitleAuthor (modified — series-volume fix):
b2.library_id IN (%s)(inner sub),b3.library_id IN (%s)(grp sub),b.library_id IN (%s)(outer WHERE). Three parameterized scope anchors. Clean.FindDuplicatesByExternalID (new,
store.go:284): hardcover UNION ALL goodreads — both sub-selects haveb2.library_id IN (%s)/b3.library_id IN (%s). OuterWHERE b.library_id IN (%s). Three separate%sexpansions, all backed byuserLibraryIDs(never user-supplied). Fail-closed: emptyuserLibraryIDsshort-circuits at the top of the function before any SQL is built. Clean.FindDuplicatesByDirectory (new,
store.go:386): inner subqueryb2.library_id IN (%s), outerWHERE b.library_id IN (%s). TheONclause uses a correlated subquery onbf3with no explicit library filter —bf3.book_id = b.idandbis already constrained by theWHERE b.library_id IN (%s), sobf3can only resolve against books already in scope. Thegrpsubquery (dir-key grouping) is also restricted tob2.library_id IN (user scope). No cross-library directory key leakage. Clean.FindDuplicatesByFilename (new,
store.go:488): same two-anchor pattern as Directory —b2.library_id IN (%s)in the inner fname subquery,b.library_id IN (%s)in the outer WHERE. Correlatedbf3is again anchored onb.idwhich is already scoped. Clean.Handler scoping (
internal/dedup/handler.go:101-125):getUserLibraryIDsfires from authenticated session (extractUser(r).ID), result is narrowed tofilterToLibrary(userLibraryIDs, libraryID). The criteria parsing (parseScanRequest) runs after the scope check and only controls which boolean flags (ScanCriteria) are passed to the scan function. The criteria flags never widen the library scope — they only select which parameterized queries execute. Fail-closed:userID=0(unauthenticated) maps to no libraries →filterToLibraryreturns empty →ErrNotFoundbefore any scan. Clean.Phase 1: SQL Injection
All user-supplied inputs that reach SQL travel through parameterized
?placeholders:preset/ individual flag strings from JSON body / query params → decoded intoScanCriteria(Go struct booleans) inparseScanRequest. The string value ofpresetnever reaches SQL. Unknown preset falls back tobalancedinPresetCriteria(dedup.go:42). Zero injection surface.afterKey/afterID(cursor) → used as?bind parameters (ORDER BY … AND grp.key > ?). Never interpolated.libInplaceholder string is built fromlen(userLibraryIDs)literal?characters, not from any user string.LIKE '%%/%%'in the Directory inner query is a Gofmt.Sprintf-escaped literal that producesLIKE '%/%'— not derived from user input.ORDER BYcolumns are all hard-coded column references (grp.isbn_13 ASC,grp.provider_key ASC,grp.dir_key ASC,grp.fname ASC). No dynamic ORDER BY from user input.No SQL injection surface found.
Phase 2: XSS / Template
templates/pages/dedup_modal.html: new preset chip<button>elements and checkbox<input>elements. All attribute values are static strings. No{{.}}interpolation in the new HTML. Nostyle=attributes. Clean.templates/pages/dedup_results.html: the fallback branch{{else}}{{$g.MatchType}}{{end}}emitsMatchTypein a text node —html/templateauto-escapes it.MatchTypeoriginates from hard-coded SQL literal strings ('isbn','external_id', etc.) so attacker cannot inject arbitrary content even before escaping. Clean.static/js/controllers/find_duplicates_controller.js: noelement.style.assignments, noinnerHTMLwith user data, noeval. Criteria body isJSON.stringify({preset: ...})wherepresetis read from adata-presetattribute on a server-rendered button (static strings in the template). Clean.static/css/main.css: new CSS classes only, no inlinestyle=. CSP-compatible. Clean.Phase 3: No New Unauthenticated / Unscoped Surface
No new routes were added. The existing three dedup routes (
GET /libraries/{id}/dedup,POST /libraries/{id}/dedup/scan,POST /libraries/{id}/dedup/merge) already existed and required a mapped user. The wire-up (wire.go) only adds three new finder functions to the existingScanclosure — no new HTTP surface. Clean.Findings
None.
REVIEW VERDICT: 0 blocker, 0 major, 0 minor
CODE REVIEW: NOT APPROVED
Phase 0: DEMO Verification
No DEMO block required for this review (code-review agent, not bead-completion verification).
Phase 1: Spec Compliance
All spec requirements are present: Strict/Balanced/Aggressive/Custom presets, all 5 criteria toggles (ISBN, External ID, Title+Author, Same Directory, Filename), 7c5m series-volume guard in SQL HAVING, match badge updated for all 5 match types, browser test present, multi-user library scoping applied fail-closed to all criteria, migration 0036 adds hardcover/goodreads indexes. Structure follows project conventions.
Phase 2: Code Quality
[MAJOR] internal/dedup/store.go:334-338 — ExternalID: book with both hardcover_id AND goodreads_id silently destroys the hardcover duplicate group
The JOIN ON uses OR to match either provider key. If book A has both hardcover_id=hc-001 AND goodreads_id=gr-001, and both form independent duplicate groups (A+B share hc-001; A+C share gr-001), MySQL returns book A twice in the query output — once per matching grp row. The Scan.seen map in service.go:~92 marks A on its first encounter (e.g. gr:gr-001), then drops the second occurrence (hc:hc-001). The hc: group is left with only book B (single-book group), which is filtered out. A valid hardcover_id duplicate group (A+B) is silently destroyed.
Suggested fix: rewrite using UNION (separate hc and gr outer queries merged with UNION ALL), or add GROUP BY b.id, grp.provider_key before ORDER BY to collapse the OR-expanded duplicates within the query. Either approach ensures each (book_id, provider_key) appears at most once in the result set.
[MAJOR] internal/dedup/store.go:435-441, 529-534 — Directory and Filename criteria use per-row correlated subqueries with no covering index; violates 250k scale requirement
Both FindDuplicatesByDirectory and FindDuplicatesByFilename join the outer book row to grp via a correlated subquery in the ON clause:
JOIN grp ON (
SELECT SUBSTRING_INDEX(bf3.file_sub_path, '/', ...)
FROM book_file bf3
WHERE bf3.book_id = b.id AND bf3.is_book = 1
LIMIT 1
) = grp.dir_key
This executes once per outer book row. At 250k books that is 250k per-row correlated scans before LIMIT applies. The existing idx_book_additional_file_book_id covers only (book_id) and cannot serve the SUBSTRING_INDEX expression without fetching the row. Migration 0036 adds no index on book_file(book_id, is_book, file_sub_path). The project scale requirement says GET /books p99 < 100ms and all query paths must be designed for 250k books from day one; this query pattern violates that.
Suggested fix: (1) add a covering index (book_id, is_book, file_sub_path) to book_file in this migration, AND (2) rewrite as a derived table (like bf_main is already done) keyed by book_id so there is no per-row correlated subquery — join bf_dir (SELECT book_id, SUBSTRING_INDEX(...) AS dir_key FROM book_file WHERE is_book=1 GROUP BY book_id) and then join to grp on equi-join.
[MAJOR] e2e/api/dedup_test.go — No positive e2e test for Same Directory or Filename criteria
The only e2e test covering the Filename criterion ("custom preset with only filename returns empty", line ~283) seeds books with no book_file rows and asserts empty results. This is a negative-only test — a bug in the FindDuplicatesByDirectory or FindDuplicatesByFilename SQL (wrong SUBSTRING_INDEX, wrong column, wrong library filter) would not be caught by CI since the result is empty either way. There is no positive e2e test that seeds books with matching file_sub_path values and verifies a group is returned with the correct match_type.
Suggested fix: add seedDirectoryDuplicateLibrary() and seedFilenameDuplicateLibrary() helpers (analogous to seedDedupLibrary/seedEleriaLibrary) that insert book_file rows with matching paths, then assert the scan returns a group with match_type "same_directory" / "filename".
[MINOR] e2e/browser/find_duplicates_test.go:541-542 — Checkbox state verified via HTML attribute instead of DOM property; assertion is vacuous
checked, _ := titleAuthorCB.Attribute("checked")
Expect(checked).To(BeNil())
Attribute("checked") reads the HTML attribute, not the DOM .checked property. Since the template (dedup_modal.html) never renders a checked attribute on any checkbox, this always returns nil regardless of whether _applyPreset ran. The test passes even if the JS preset logic is completely broken.
Suggested fix: isChecked := titleAuthorCB.MustEval("() => this.checked").Bool() then Expect(isChecked).To(BeFalse()).
[MINOR] e2e/api/dedup_test.go:17 — Stray double blank line after import block (style nit)
REVIEW VERDICT: 0 blocker, 3 major, 2 minor
[MAJOR 1] ExternalID OR join: replace single OR-join with UNION ALL of two equi-joined sub-queries (hc + gr) so a book with both provider IDs appears exactly once per provider group. Change seen-map key to (bookID, groupKey) so both provider groups survive (old bookID-only key silently dropped the second group). Update cross-criteria tests to reflect new behavior (each distinct (book, group) pair gets its own group; exact duplicates still deduplicated). Update browser tests to use Strict preset where single-group count assertions are needed. [MAJOR 2] Directory/Filename correlated-subquery: replace per-outer-row correlated subquery with a derived-table EQUI-JOIN (bf_dir / bf_fname). Add migration 0037 adding covering index idx_book_file_dedup_dir (book_id, is_book, file_sub_path) so both queries are served from the index. Fix GROUP BY to use MIN(file_sub_path) to satisfy only_full_group_by. [MAJOR 3] Positive e2e tests: add seedDirectoryDuplicateLibrary, seedFilenameDuplicateLibrary, and both-ids external-ID scenario seeder + full Describe blocks for same_directory, filename, and external_id both-groups. [MINOR] Browser checkbox: use MustEval("() => this.checked").Bool() instead of Attribute("checked") (HTML attr is vacuous after JS toggles state). [MINOR] Stray double blank line in dedup_test.go removed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>CODE REVIEW (RE-REVIEW): APPROVED
Re-review of #567 (bd-bookshelf-c15a.2) against prior findings in comment 6689.
Prior Finding Verification
MAJOR 1: ExternalID OR-dup loss — RESOLVED
The fix rewrites FindDuplicatesByExternalID using two independent EQUI-JOIN subqueries (hardcover and goodreads) merged with UNION ALL (store.go:278-381). Each provider query uses an equi-join with no OR. A book with both hardcover_id and goodreads_id appears exactly once per provider group.
The service.go seen-map key is now a (bookID, groupKey) composite (service.go:92) — so book A can appear in both hc:HC001 and gr:GR001 groups without being dropped. The e2e test seeds exactly the A+B (hc) / A+C (gr) scenario and asserts HaveLen(2). Finding is resolved.
MAJOR 2: Directory/Filename correlated-subquery — RESOLVED
Both FindDuplicatesByDirectory and FindDuplicatesByFilename now use derived-table EQUI-JOINs with no per-row correlated subquery (store.go:453-493 for directory, store.go:551-595 for filename). Migration 0037 adds idx_book_file_dedup_dir (book_id, is_book, file_sub_path) on book_file. All ORDER BY clauses use a two-column total order with unique tiebreaker. Finding is resolved.
MAJOR 3: Positive e2e tests for same_directory, filename, both-ids external_id — RESOLVED
All three are positive tests. Finding is resolved.
MINOR: Browser checkbox MustEval bool read — RESOLVED
e2e/browser/find_duplicates_test.go now uses titleAuthorCB.MustEval("() => this.checked").Bool(). Finding is resolved.
MINOR: Stray double blank line — RESOLVED
The extra blank line after the import block is gone. Finding is resolved.
Fresh-Eyes Pass
Multi-user/library scoping: Both new directory and filename finders filter by userLibraryIDs in the inner grp subquery AND the outer WHERE clause. Empty-slice fail-closed guard present. Correct.
Scale: No N+1. All new queries use derived-table joins. ORDER BY uses total-order two-column keys. LIMIT bound on all queries. Covering index on book_file (migration 0037). Clean.
Grimmory compat: Migrations 0036 and 0037 add indexes only, no new columns. Correct.
CSP: No inline style= in templates. CSS classes used throughout. Clean.
[MINOR] e2e/browser/find_duplicates_test.go:494 — single It block contains multiple Expect assertions
The browser spec chains four independent assertions (balanced chip active, strict chip active, titleAuthor unchecked, badge text). Project convention requires one Expect per It. Split into four It blocks. Does not block merge.
REVIEW VERDICT: 0 blocker, 0 major, 1 minor
Security Review - PR #567 (bookshelf-c15a.2) - RE-REVIEW of fixes for comment 6689
Head SHA:
b14f04d5Scope: re-review of 3 MAJOR + 2 MINOR fixes from prior review comment 6689; adversarial check for new issues introduced by the fix commits.
Phase 0: Prior findings - verification
[MAJOR #1] ExternalID OR-join silent group destruction - FIXED. FindDuplicatesByExternalID now uses UNION ALL of two separate sub-selects (hardcover, goodreads), each with its own equi-join. A book with both hardcover_id and goodreads_id appears once per provider. The Scan seen-map key is now bookID+NUL+groupKey, allowing one book to appear in both the hc: and gr: provider groups. Verified in e2e/api/dedup_test.go:738.
[MAJOR #2] Directory/Filename per-row correlated subqueries - FIXED. Both finders now use derived tables (bf_dir, bf_fname) GROUP BY book_id ahead of the outer join - no per-row correlated subquery. Migration 0037 adds covering index idx_book_file_dedup_dir (book_id, is_book, file_sub_path) allowing index-only scans.
[MAJOR #3] No positive e2e tests for Directory/Filename - FIXED. seedDirectoryDuplicateLibrary() and seedFilenameDuplicateLibrary() added at e2e/api/dedup_test.go:533 and :585. Positive scan tests verify match_type == "same_directory" and "filename".
[MINOR #1] Checkbox checked attribute vs DOM property - FIXED. e2e/browser/find_duplicates_test.go:545 now uses titleAuthorCB.MustEval("() => this.checked").Bool().
[MINOR #2] Double blank line after import block - FIXED.
Phase 1: New findings in the fix commits
[MAJOR] internal/dedup/service.go:71-114, store.go:379,493,594 - Multi-criteria pagination cursor: single shared (afterKey, afterID) passed to all finders with incompatible key namespaces; second page silently broken for every preset with 2+ criteria
Scan encodes the cursor as a single {AfterKey string, AfterID int64} pair and passes the same values to every active finder. But each finder orders on a different group key namespace:
When page 1 ends on an ISBN group (cursor {AfterKey:"9780007525508", AfterID:42}), page 2 feeds that same cursor to TitleAuthor - which skips any normalized title+author key that sorts before "9780007525508" lexicographically, silently dropping those groups. ExternalID compares an ISBN string against "gr:...","hc:..." keys - a completely different namespace. The afterID tiebreaker inconsistency compounds this: ExternalID uses book_id while every other finder uses grp.min_id, so cursors cross-applied between criteria compare semantically different IDs.
The Balanced preset (default: ISBN + ExternalID + TitleAuthor) triggers this on every library large enough to exceed one page.
Suggested fix: tag group keys with a criterion prefix ("isbn:9780007525508", "ta:the martian andy weir", "hc:abc", "dir:/fantasy/tolkien", "fn:dune.epub") so a single global cursor can order across all criteria in one unified lexical space and each finder applies a prefix-range WHERE. Alternative: encode per-criterion cursors in ScanCursor so each finder advances independently. Either way: add a multi-page pagination e2e test seeded with enough groups to exceed one page under Balanced.
[MINOR] internal/dedup/handler.go:48 - json.NewDecoder(r.Body) has no http.MaxBytesReader guard
parseScanRequest decodes the JSON body without first wrapping r.Body in http.MaxBytesReader. Risk is low (bounded by server ReadTimeout: 5s); fix with http.MaxBytesReader(w, r.Body, 4096) before decoding.
REVIEW VERDICT: 0 blocker, 1 major, 1 minor
Security Re-Review — Pass 3 (SHA
60ab9641, PR #567, bead bookshelf-c15a.2)Prior findings — resolution status
[MAJOR] Cross-criterion cursor namespace collision — RESOLVED.
The new
ScanCursor.Positions map[string]CriterionPosgives each active criterion its own(afterKey, afterID). The handler readsuserLibraryIDsand buildsscopedIDsbefore the cursor is decoded, so the cursor has no influence on the library scope. Unknown keys inPositions(attacker-injected) are silently ignored via map-lookup — they cannot widen scope or cause a panic. TheafterKey/afterIDvalues are passed only as SQL parameters in a keyset WHERE clause; the library-ID filter (WHERE b.library_id IN (?)) is applied independently in every finder. Cross-user read via cursor is not possible.[MINOR] No
http.MaxBytesReader— RESOLVED.http.MaxBytesReader(w, r.Body, 4096)is applied inScanHandlerbeforeparseScanRequestis called (handler.go:128-130). The error-mapper handles*http.MaxBytesErrorviaerrors.As(checked beforeerrors.Is(ErrValidation)in the switch), so an oversized body returns 413 not 400. The handler unit test athandler_test.go:339-344confirms 413 is returned. ✓Fresh threat-model
SQL injection — all five finder queries (
FindDuplicatesByISBN,FindDuplicatesByExternalID,FindDuplicatesByTitleAuthor,FindDuplicatesByDirectory,FindDuplicatesByFilename) pass every user-influenced value (cursorafterKey/afterID,limit+1) as positional?parameters. Library IDs are server-computed from the session and passed the same way. TheLIKEpatterns use%%/%%(Gofmt.Sprintfescaping →%/%) which are string literals, not user-controlled. No injection surface found.ExternalID UNION ALL arg binding — manually verified: HC branch gets
libIn, afterKey, afterKey, afterID, limit+1, libIn; GR branch gets the same pattern; total args =4N + 8matching the four%sexpansions + eight cursor/limit?s in the SQL. Both branches are independently scoped byuserLibraryIDs. ✓Resource/DoS — request body capped at 4096 bytes via
MaxBytesReader. The cursor query param is bounded by Go's defaultMaxHeaderBytes(1 MB) and is a server-generated opaque blob with no unbounded sub-structure. Each finder's inner subquery limits tolimit+1groups (clamped to max 200 byclampLimit); the ExternalID combined outer query has no second LIMIT but is bounded by 2×(limit+1) groups, which is documented and handled correctly by the service-layer seen-map. No unbounded scan path found.Cursor decode safety —
DecodeCursorreturns a typed error (%w: %wwrappingErrValidation) on invalid base64 or malformed JSON; no panic path. ThePositionsmap is deserialized into a Go struct with typed fields; oversized maps are bounded by the cursor string length, itself bounded by HTTP headers.Multi-user fail-closed — every finder has
if len(userLibraryIDs) == 0 { return nil, nil }as the first guard.filterToLibraryrejects requests for a library not in the user's accessible set (returns 404).userLibraryIDs` is derived from the authenticated session, never from the request body or cursor.keptKeyspagination correctness — the seen2 loop correctly breaks onlen(seen2) > limit, excluding the peek group fromkeptKeys. ThelastPosupdate only iterateslabeledRowsinkeptKeys, so the cursor points to the last book of the last returned group, not the discarded peek group. ✓New findings
No new blockers or majors. One minor style note:
[MINOR] internal/dedup/handler.go:53 — double-wrapped error format
%w: %wwith ErrValidation firstfmt.Errorf("dedup scan: decode body: %w: %w", middleware.ErrValidation, err)is an unusual multi-error pattern. ThestatusForswitch checkserrors.As(err, &maxBytesErr)beforeerrors.Is(ErrValidation), so the 413 path wins correctly and the test confirms it. The double-colon separator is cosmetic and non-standard but causes no correctness or security issue. The same pattern exists inmergehandler, suggesting it is an established codebase convention. No fix required.REVIEW VERDICT: 0 blocker, 0 major, 1 minor
Security Re-Review — Pass 3 (SHA
60ab9641, PR #567, bead bookshelf-c15a.2)Prior findings — resolution status
[MAJOR] Cross-criterion cursor namespace collision — RESOLVED.
The new
ScanCursor.Positions map[string]CriterionPosgives each active criterion its own(afterKey, afterID). The handler readsuserLibraryIDsand buildsscopedIDsbefore the cursor is decoded, so the cursor has no influence on the library scope. Unknown keys inPositions(attacker-injected) are silently ignored via map-lookup — they cannot widen scope or cause a panic. TheafterKey/afterIDvalues are passed only as SQL parameters in a keyset WHERE clause; the library-ID filter (WHERE b.library_id IN (?)) is applied independently in every finder. Cross-user read via cursor is not possible.[MINOR] No
http.MaxBytesReader— RESOLVED.http.MaxBytesReader(w, r.Body, 4096)is applied inScanHandlerbeforeparseScanRequestis called (handler.go:128-130). The error-mapper handles*http.MaxBytesErrorviaerrors.As(checked beforeerrors.Is(ErrValidation)in the switch), so an oversized body returns 413 not 400. The handler unit test athandler_test.go:339-344confirms 413 is returned.Fresh threat-model
SQL injection — all five finder queries (
FindDuplicatesByISBN,FindDuplicatesByExternalID,FindDuplicatesByTitleAuthor,FindDuplicatesByDirectory,FindDuplicatesByFilename) pass every user-influenced value (cursorafterKey/afterID,limit+1) as positional?parameters. Library IDs are server-computed from the session and passed the same way. The LIKE patterns use%%/%%(Gofmt.Sprintfescaping to%/%) which are string literals, not user-controlled. No injection surface found.ExternalID UNION ALL arg binding — manually verified: HC branch gets
libIn, afterKey, afterKey, afterID, limit+1, libIn; GR branch gets the same pattern; total args =4N + 8matching the four%sexpansions + eight cursor/limit?s in the SQL. Both branches are independently scoped byuserLibraryIDs.Resource/DoS — request body capped at 4096 bytes via
MaxBytesReader. The cursor query param is bounded by Go's defaultMaxHeaderBytes(1 MB) and is a server-generated opaque blob with no unbounded sub-structure. Each finder's inner subquery limits tolimit+1groups (clamped to max 200 byclampLimit); the ExternalID combined outer query has no second LIMIT but is bounded by 2x(limit+1) groups, which is documented and handled correctly by the service-layer seen-map. No unbounded scan path found.Cursor decode safety —
DecodeCursorreturns a typed error (wrappingErrValidation) on invalid base64 or malformed JSON; no panic path. ThePositionsmap is deserialized into a Go struct with typed fields; oversized maps are bounded by the cursor string length, itself bounded by HTTP headers.Multi-user fail-closed — every finder has
if len(userLibraryIDs) == 0 { return nil, nil }as the first guard.filterToLibraryrejects requests for a library not in the user's accessible set (returns 404).userLibraryIDsis derived from the authenticated session, never from the request body or cursor.keptKeyspagination correctness — the seen2 loop correctly breaks onlen(seen2) > limit, excluding the peek group fromkeptKeys. ThelastPosupdate only iterateslabeledRowsinkeptKeys, so the cursor points to the last book of the last returned group, not the discarded peek group.New findings
No new blockers or majors. One minor style note:
[MINOR] internal/dedup/handler.go:53 — double-wrapped error format
%w: %wwith ErrValidation firstfmt.Errorf("dedup scan: decode body: %w: %w", middleware.ErrValidation, err)is an unusual multi-error pattern. ThestatusForswitch checkserrors.As(err, &maxBytesErr)beforeerrors.Is(ErrValidation), so the 413 path wins correctly and the test confirms it. The double-colon separator is cosmetic and non-standard but causes no correctness or security issue. The same pattern exists in the merge handler, suggesting it is an established codebase convention. No fix required.REVIEW VERDICT: 0 blocker, 0 major, 1 minor
Security Re-Review — Pass 3 (SHA
60ab9641, PR #567, bead bookshelf-c15a.2)Prior findings — resolution status
[MAJOR] Cross-criterion cursor namespace collision — RESOLVED.
The new
ScanCursor.Positions map[string]CriterionPosgives each active criterion its own(afterKey, afterID). The handler readsuserLibraryIDsand buildsscopedIDsbefore the cursor is decoded, so the cursor has no influence on the library scope. Unknown keys inPositions(attacker-injected) are silently ignored via map-lookup — they cannot widen scope or cause a panic. TheafterKey/afterIDvalues are passed only as SQL parameters in a keyset WHERE clause; the library-ID filter (WHERE b.library_id IN (?)) is applied independently in every finder. Cross-user read via cursor is not possible.[MINOR] No
http.MaxBytesReader— RESOLVED.http.MaxBytesReader(w, r.Body, 4096)is applied inScanHandlerbeforeparseScanRequestis called (handler.go:128-130). The error-mapper handles*http.MaxBytesErrorviaerrors.As(checked beforeerrors.Is(ErrValidation)in the switch), so an oversized body returns 413 not 400. The handler unit test athandler_test.go:339-344confirms 413 is returned.Fresh threat-model
SQL injection — all five finder queries (
FindDuplicatesByISBN,FindDuplicatesByExternalID,FindDuplicatesByTitleAuthor,FindDuplicatesByDirectory,FindDuplicatesByFilename) pass every user-influenced value (cursorafterKey/afterID,limit+1) as positional?parameters. Library IDs are server-computed from the session and passed the same way. The LIKE patterns use%%/%%(Gofmt.Sprintfescaping to%/%) which are string literals, not user-controlled. No injection surface found.ExternalID UNION ALL arg binding — manually verified: HC branch gets
libIn, afterKey, afterKey, afterID, limit+1, libIn; GR branch gets the same pattern; total args =4N + 8matching the four%sexpansions + eight cursor/limit?s in the SQL. Both branches are independently scoped byuserLibraryIDs.Resource/DoS — request body capped at 4096 bytes via
MaxBytesReader. The cursor query param is bounded by Go's defaultMaxHeaderBytes(1 MB) and is a server-generated opaque blob with no unbounded sub-structure. Each finder's inner subquery limits tolimit+1groups (clamped to max 200 byclampLimit); the ExternalID combined outer query has no second LIMIT but is bounded by 2x(limit+1) groups, which is documented and handled correctly by the service-layer seen-map. No unbounded scan path found.Cursor decode safety —
DecodeCursorreturns a typed error (wrappingErrValidation) on invalid base64 or malformed JSON; no panic path. ThePositionsmap is deserialized into a Go struct with typed fields; oversized maps are bounded by the cursor string length, itself bounded by HTTP headers.Multi-user fail-closed — every finder has
if len(userLibraryIDs) == 0 { return nil, nil }as the first guard.filterToLibraryrejects requests for a library not in the user's accessible set (returns 404).userLibraryIDsis derived from the authenticated session, never from the request body or cursor.keptKeyspagination correctness — the seen2 loop correctly breaks onlen(seen2) > limit, excluding the peek group fromkeptKeys. ThelastPosupdate only iterateslabeledRowsinkeptKeys, so the cursor points to the last book of the last returned group, not the discarded peek group.New findings
No new blockers or majors. One minor style note:
[MINOR] internal/dedup/handler.go:53 — double-wrapped error format
%w: %wwith ErrValidation firstfmt.Errorf("dedup scan: decode body: %w: %w", middleware.ErrValidation, err)is an unusual multi-error pattern. ThestatusForswitch checkserrors.As(err, &maxBytesErr)beforeerrors.Is(ErrValidation), so the 413 path wins correctly and the test confirms it. The double-colon separator is cosmetic and non-standard but causes no correctness or security issue. The same pattern exists in the merge handler, suggesting it is an established codebase convention. No fix required.REVIEW VERDICT: 0 blocker, 0 major, 1 minor
Security Re-Review — Pass 3 (SHA
60ab9641, PR #567, bead bookshelf-c15a.2)Prior findings — resolution status
[MAJOR] Cross-criterion cursor namespace collision — RESOLVED.
The new
ScanCursor.Positions map[string]CriterionPosgives each active criterion its own(afterKey, afterID). The handler readsuserLibraryIDsand buildsscopedIDsbefore the cursor is decoded, so the cursor has no influence on the library scope. Unknown keys inPositions(attacker-injected) are silently ignored via map-lookup — they cannot widen scope or cause a panic. TheafterKey/afterIDvalues are passed only as SQL parameters in a keyset WHERE clause; the library-ID filter (WHERE b.library_id IN (?)) is applied independently in every finder. Cross-user read via cursor is not possible.[MINOR] No
http.MaxBytesReader— RESOLVED.http.MaxBytesReader(w, r.Body, 4096)is applied inScanHandlerbeforeparseScanRequestis called (handler.go:128-130). The error-mapper handles*http.MaxBytesErrorviaerrors.As(checked beforeerrors.Is(ErrValidation)in the switch), so an oversized body returns 413 not 400. The handler unit test athandler_test.go:339-344confirms 413 is returned.Fresh threat-model
SQL injection — all five finder queries (
FindDuplicatesByISBN,FindDuplicatesByExternalID,FindDuplicatesByTitleAuthor,FindDuplicatesByDirectory,FindDuplicatesByFilename) pass every user-influenced value (cursorafterKey/afterID,limit+1) as positional?parameters. Library IDs are server-computed from the session and passed the same way. The LIKE patterns use%%/%%(Gofmt.Sprintfescaping to%/%) which are string literals, not user-controlled. No injection surface found.ExternalID UNION ALL arg binding — manually verified: HC branch gets
libIn, afterKey, afterKey, afterID, limit+1, libIn; GR branch gets the same pattern; total args =4N + 8matching the four%sexpansions + eight cursor/limit?s in the SQL. Both branches are independently scoped byuserLibraryIDs.Resource/DoS — request body capped at 4096 bytes via
MaxBytesReader. The cursor query param is bounded by Go's defaultMaxHeaderBytes(1 MB) and is a server-generated opaque blob with no unbounded sub-structure. Each finder's inner subquery limits tolimit+1groups (clamped to max 200 byclampLimit); the ExternalID combined outer query has no second LIMIT but is bounded by 2x(limit+1) groups, which is documented and handled correctly by the service-layer seen-map. No unbounded scan path found.Cursor decode safety —
DecodeCursorreturns a typed error (wrappingErrValidation) on invalid base64 or malformed JSON; no panic path. ThePositionsmap is deserialized into a Go struct with typed fields; oversized maps are bounded by the cursor string length, itself bounded by HTTP headers.Multi-user fail-closed — every finder has
if len(userLibraryIDs) == 0 { return nil, nil }as the first guard.filterToLibraryrejects requests for a library not in the user's accessible set (returns 404).userLibraryIDsis derived from the authenticated session, never from the request body or cursor.keptKeyspagination correctness — the seen2 loop correctly breaks onlen(seen2) > limit, excluding the peek group fromkeptKeys. ThelastPosupdate only iterateslabeledRowsinkeptKeys, so the cursor points to the last book of the last returned group, not the discarded peek group.New findings
No new blockers or majors. One minor style note:
[MINOR] internal/dedup/handler.go:53 — double-wrapped error format
%w: %wwith ErrValidation firstfmt.Errorf("dedup scan: decode body: %w: %w", middleware.ErrValidation, err)is an unusual multi-error pattern. ThestatusForswitch checkserrors.As(err, &maxBytesErr)beforeerrors.Is(ErrValidation), so the 413 path wins correctly and the test confirms it. The double-colon separator is cosmetic and non-standard but causes no correctness or security issue. The same pattern exists in the merge handler, suggesting it is an established codebase convention. No fix required.REVIEW VERDICT: 0 blocker, 0 major, 1 minor
CODE REVIEW (3rd pass) — HEAD SHA
60ab9641Phase 0: DEMO Verification
No DEMO block in the bead. This is a re-review of a fix commit; verifying the fix correctness via code analysis per the review request.
Fix Verification (per the four claims)
(a) Per-criterion ScanCursor — each active finder gets its own {key,id}; absent entry = start from beginning
internal/dedup/service.golines 113–127: EachcriterionFinderresolves its own cursor position fromcursor.Positions[cf.label]. When the label is absent (criterion was not in any prior kept page),afterKey=""andafterID=0— the finder starts from the beginning. Correct.(b) Group-count LIMIT in the SQL — groups are never split across a page boundary
All five finders now move the HAVING keyset cursor + ORDER BY + LIMIT inside the group subquery (e.g.
internal/dedup/store.golines 105–108 for ISBN, lines 242–246 for TitleAuthor, lines 345–348 for HC, lines 386–389 for GR, lines 505–508 for Directory, lines 609–613 for Filename). The outer query JOINs the group subquery and returns all member rows for each paginated group without an additional LIMIT. This is the correct fix: the DB returns whole groups, not raw rows, so a group is never split. Verified.(c) Next cursor uses only kept-group rows
internal/dedup/service.golines 151–186:keptKeysis built by scanningallRowsin order and collecting distinct(matchType, groupKey)pairs untillen(seen2) > limit. The inner checkif len(seen2) <= limit { keptKeys[...] = true }correctly excludes the (limit+1)th group.lastPosis then populated only fromlabeledRowsentries whoseMatchType+GroupKeyis inkeptKeys. This means the cursor reflects the last row of the last returned group, not any peek rows. Verified.(d) Multi-page e2e regression test non-vacuousness
e2e/api/dedup_test.golines 908–978: The test seeds 3 ISBN pairs and uses the Balanced preset (isbn + external_id + title_author) with limit=2. It paginates two pages and asserts all 3 ISBN groups appear exactly once across both pages. The test does verify ISBN cursor resumption across pages and would have failed with the old single-cursor bug.However: the seeded data only produces ISBN groups (books have distinct titles, no external IDs). ExternalID and TitleAuthor finders return empty. This means the test exercises only one active criterion — it does not demonstrate the cross-criterion cursor isolation fix (the specific bug that was fixed: a single shared cursor causing one criterion to skip groups in another criterion's keyspace). The function comment in
seedMultiPageDedupLibrary(line 828) incorrectly says "3 ISBN groups and 1 title+author duplicate group" — no title+author pair is seeded.Findings
[MINOR] e2e/api/dedup_test.go:828 — multi-page test exercises only one active criterion, comment incorrect
The
seedMultiPageDedupLibrarycomment says "3 ISBN groups and 1 title+author duplicate group" but only inserts ISBN pairs (distinct titles, no shared norm_key). ExternalID and TitleAuthor return empty on every page. The test verifies ISBN pagination, but the cross-criterion cursor-namespace isolation fix — where ISBN cursor must not pollute the TitleAuthor or ExternalID cursor position on the next page — is not covered. The old single-cursor bug would also fail this test (the previous behavior used a single shared{afterKey, afterID}for all criteria, so with only one active criterion the old code would have passed too). To make this a true regression test for the isolation fix, seed at least one TitleAuthor pair alongside the ISBN pairs and assert both appear across the expected pages.[MINOR] internal/dedup/service.go — Scan function is 132 lines (project metric: <30 lines per function)
The inner anonymous function returned by
Scanis the main concern at ~100 lines. This pre-existed the fix but the fix added complexity. Not a blocker; the structure is clear and well-commented. Consider extracting the keptKeys + lastPos computation into a named helper.Fresh-eyes checks (per review prompt)
ORDER BY <group_key> ASC, MIN(<id>) ASCon the group subquery, andORDER BY <group_key> ASC, b.id ASCon the outer query. Total order preserved. No flake risk.nil, nilwhenlen(userLibraryIDs) == 0. Inner and outer subqueries carryb.library_id IN (?)/b2.library_id IN (?)filters. Preserved.http.MaxBytesReader(w, r.Body, 4096)is applied beforejson.NewDecoder(r.Body).Decodeinhandler.golines 129–134. ThestatusForerror mapper checkserrors.As(err, &maxBytesErr)beforeErrValidation(error_mapper.go lines 29–31), so 413 is returned correctly even thoughparseScanRequestwraps the error withErrValidation. Handler test at handler_test.go:620 ("returns 413 for a JSON body exceeding 4096 bytes") covers this path.Describeblocks with one assertion each infind_duplicates_test.go(lines 828–943). Compliant.internal/db/migrations/0037_book_file_dedup_covering_index.up.sqladds only an index, no new columns. Grimmory-compatibility rule is satisfied.gr:sorts beforehc:lexicographically so all goodreads groups are always delivered before any hardcover group appears in a page. A cursor ending athc:Xcannot cause goodreads groups to be skipped because allgr:groups were already returned. Verified safe.REVIEW VERDICT: 0 blocker, 0 major, 2 minor
60ab9641d8260395ae12zombor referenced this pull request2026-06-18 11:07:54 +00:00