feat(dedup): detection presets + criteria toggles + series-volume guard (bookshelf-c15a.2, bookshelf-7c5m) #567

Merged
zombor merged 6 commits from bd-bookshelf-c15a.2 into main 2026-06-15 18:44:47 +00:00
Owner

Summary

  • Find Duplicates v2 (bookshelf-c15a.2): Adds 4 preset chips (Strict/Balanced/Aggressive/Custom) and 5 detection criteria toggles (ISBN, External ID, Title+Author, Same Directory, Filename) to the Find Duplicates modal. Backend extends the scan endpoint to accept a JSON criteria body and implements all 5 detection methods via indexed SQL queries. Migration 0036 adds indexes on book_metadata.hardcover_id and goodreads_id for the External ID finder.
  • Series-volume false-positive fix (bookshelf-7c5m): FindDuplicatesByTitleAuthor now excludes groups where books have different non-null series_number values (Eleria vol 1 ≠ vol 3 even though they share title+author). Fixed via HAVING COUNT(DISTINCT CASE WHEN series_number IS NOT NULL THEN series_number END) <= 1.

Test plan

  • Unit tests: Ginkgo specs for all 5 FindDuplicatesBy* finders, PresetCriteria(), parseScanRequest() (JSON body + query params including custom preset), Scan() with each criteria combination
  • API e2e: strict preset returns only isbn groups; custom+filename only returns empty; series-volume guard (Eleria case); true-positive (same series_number)
  • Browser e2e: preset chips show Balanced active by default; Strict chip disables Title+Author checkbox; Balanced scan finds ISBN match badge
  • Coverage gate: 100% on internal/dedup/ (excluding wire.go which is wiring)
  • make build + golangci-lint run ./internal/dedup/... clean

Closes bead bookshelf-c15a.2 and bookshelf-7c5m on merge.

## Summary - **Find Duplicates v2 (bookshelf-c15a.2):** Adds 4 preset chips (Strict/Balanced/Aggressive/Custom) and 5 detection criteria toggles (ISBN, External ID, Title+Author, Same Directory, Filename) to the Find Duplicates modal. Backend extends the scan endpoint to accept a JSON criteria body and implements all 5 detection methods via indexed SQL queries. Migration 0036 adds indexes on `book_metadata.hardcover_id` and `goodreads_id` for the External ID finder. - **Series-volume false-positive fix (bookshelf-7c5m):** `FindDuplicatesByTitleAuthor` now excludes groups where books have different non-null `series_number` values (Eleria vol 1 ≠ vol 3 even though they share title+author). Fixed via `HAVING COUNT(DISTINCT CASE WHEN series_number IS NOT NULL THEN series_number END) <= 1`. ## Test plan - [x] Unit tests: Ginkgo specs for all 5 `FindDuplicatesBy*` finders, `PresetCriteria()`, `parseScanRequest()` (JSON body + query params including custom preset), `Scan()` with each criteria combination - [x] API e2e: strict preset returns only isbn groups; custom+filename only returns empty; series-volume guard (Eleria case); true-positive (same series_number) - [x] Browser e2e: preset chips show Balanced active by default; Strict chip disables Title+Author checkbox; Balanced scan finds ISBN match badge - [x] Coverage gate: 100% on `internal/dedup/` (excluding `wire.go` which is wiring) - [x] `make build` + `golangci-lint run ./internal/dedup/...` clean Closes bead bookshelf-c15a.2 and bookshelf-7c5m on merge.
feat(dedup): detection presets + criteria toggles + series-volume guard (bookshelf-c15a.2, bookshelf-7c5m)
Some checks failed
/ Test (pull_request) Has been cancelled
/ Integration (pull_request) Has been cancelled
/ Lint (pull_request) Has been cancelled
/ E2E Browser (pull_request) Has been cancelled
/ JS Unit Tests (pull_request) Has been cancelled
/ E2E API (pull_request) Has been cancelled
6fa9657ed6
GOAL A (bookshelf-c15a.2): Find Duplicates v2 — presets + criteria
- Add ScanCriteria struct with ISBN/ExternalID/TitleAuthor/SameDirectory/Filename booleans
- Add PresetCriteria() returning Strict/Balanced/Aggressive/Custom preset definitions
- Extend Scan() to accept 5 FinderFuncs + ScanCriteria; iterates active criteria in priority order
- Add FindDuplicatesByExternalID (hardcover_id + goodreads_id via UNION ALL, prefixed group keys)
- Add FindDuplicatesByDirectory (SUBSTRING_INDEX dir extraction, LIKE '%/%' guard)
- Add FindDuplicatesByFilename (SUBSTRING_INDEX basename extraction)
- Migration 0036: indexes on book_metadata.hardcover_id and goodreads_id
- Handler parseScanRequest: JSON body (Content-Type: application/json) or query params
- UI: preset chips (Strict/Balanced/Aggressive/Custom), <details> advanced panel with 5 checkboxes
- JS controller: selectPreset() updates chips + applies preset to checkboxes; scan() sends JSON body
- CSS: .dedup-preset-chip, .dedup-preset-chip--active, .dedup-advanced, .dedup-criteria
- Match badge in results template handles all 5 MatchType values
- Remove dead seenBooks dedup from groupRowsByKey (Scan.seen already deduplicates)

GOAL B (bookshelf-7c5m): Series-volume false-positive fix
- FindDuplicatesByTitleAuthor HAVING clause: exclude groups where
  COUNT(DISTINCT CASE WHEN series_number IS NOT NULL THEN series_number END) > 1
- Fix bf_main derived table to use GROUP BY + MIN() instead of ORDER BY (ignored in subqueries)

Tests:
- Unit: Ginkgo tests for all 5 criteria, preset helpers, parseScanRequest (JSON + query params)
- API e2e: strict preset returns only isbn groups; custom+filename returns empty; 7c5m series guard
- Browser e2e: preset chips active state, Strict disables Title+Author, balanced scan finds ISBN group

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
zombor force-pushed bd-bookshelf-c15a.2 from 6fa9657ed6
Some checks failed
/ Test (pull_request) Has been cancelled
/ Integration (pull_request) Has been cancelled
/ Lint (pull_request) Has been cancelled
/ E2E Browser (pull_request) Has been cancelled
/ JS Unit Tests (pull_request) Has been cancelled
/ E2E API (pull_request) Has been cancelled
to 082b170c16
Some checks failed
/ Lint (pull_request) Successful in 1m53s
/ JS Unit Tests (pull_request) Successful in 51s
/ Test (pull_request) Successful in 3m35s
/ E2E Browser (pull_request) Failing after 3m56s
/ Integration (pull_request) Successful in 6m56s
/ E2E API (pull_request) Has been cancelled
2026-06-15 15:44:58 +00:00
Compare
fix(e2e): fix browser preset chip test — proper Describe wrapping + close Merge All It
Some checks failed
/ Lint (pull_request) Successful in 2m31s
/ JS Unit Tests (pull_request) Successful in 28s
/ Test (pull_request) Successful in 3m35s
/ E2E Browser (pull_request) Failing after 3m52s
/ Integration (pull_request) Successful in 6m9s
/ E2E API (pull_request) Successful in 7m30s
0c86713c65
The preset chip It block was incorrectly placed inside the Merge All It body
during rebase conflict resolution. Fix by:
- Properly closing the Merge All It block with })
- Closing the v3 Dismiss+Delete+MergeAll Describe with })
- Wrapping preset chip test in its own top-level Describe with BeforeEach/AfterEach
- Using openDedupModal() helper from the refactored main branch

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix(e2e): read chip className via Eval (Attribute returns *string, breaks ContainSubstring)
All checks were successful
/ JS Unit Tests (pull_request) Successful in 20s
/ Lint (pull_request) Successful in 2m5s
/ Test (pull_request) Successful in 3m51s
/ E2E Browser (pull_request) Successful in 4m27s
/ E2E API (pull_request) Successful in 6m12s
/ Integration (pull_request) Successful in 6m20s
1b25230162
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Author
Owner

Find Duplicates v2 — preset chips + criteria toggles

Screenshots from the browser e2e test:

Modal: preset chips (Strict / Balanced / Aggressive / Custom) + advanced criteria, Balanced active by default

presets ui

Strict selected — Title+Author criterion deselected, Advanced panel open

strict

Scan results with Balanced preset — ISBN Match group

scan results

### Find Duplicates v2 — preset chips + criteria toggles Screenshots from the browser e2e test: **Modal: preset chips (Strict / Balanced / Aggressive / Custom) + advanced criteria, Balanced active by default** ![presets ui](https://git.zombor.net/attachments/ee8f7b36-7d02-4f07-8fe9-d6866b7efeaa) **Strict selected — Title+Author criterion deselected, Advanced panel open** ![strict](https://git.zombor.net/attachments/f43a103a-fccc-4505-92f3-63949d9221cc) **Scan results with Balanced preset — ISBN Match group** ![scan results](https://git.zombor.net/attachments/28b0bae0-b962-4a2e-a76b-431df4928baf)
Author
Owner

Security Review — PR #567 (bookshelf-c15a.2 + bookshelf-7c5m)

Branch: bd-bookshelf-c15a.2
Scope: dedup detection presets + 5 criteria toggles (ISBN / External ID / Title+Author / Same Directory / Filename) + series-volume false-positive fix


Phase 0: Multi-user / Library Scope

Reviewed all five new/modified query functions in internal/dedup/store.go.

FindDuplicatesByISBN (modified): unchanged scoping, clean.

FindDuplicatesByTitleAuthor (modified — series-volume fix): b2.library_id IN (%s) (inner sub), b3.library_id IN (%s) (grp sub), b.library_id IN (%s) (outer WHERE). Three parameterized scope anchors. Clean.

FindDuplicatesByExternalID (new, store.go:284): hardcover UNION ALL goodreads — both sub-selects have b2.library_id IN (%s) / b3.library_id IN (%s). Outer WHERE b.library_id IN (%s). Three separate %s expansions, all backed by userLibraryIDs (never user-supplied). Fail-closed: empty userLibraryIDs short-circuits at the top of the function before any SQL is built. Clean.

FindDuplicatesByDirectory (new, store.go:386): inner subquery b2.library_id IN (%s), outer WHERE b.library_id IN (%s). The ON clause uses a correlated subquery on bf3 with no explicit library filter — bf3.book_id = b.id and b is already constrained by the WHERE b.library_id IN (%s), so bf3 can only resolve against books already in scope. The grp subquery (dir-key grouping) is also restricted to b2.library_id IN (user scope). No cross-library directory key leakage. Clean.

FindDuplicatesByFilename (new, store.go:488): same two-anchor pattern as Directory — b2.library_id IN (%s) in the inner fname subquery, b.library_id IN (%s) in the outer WHERE. Correlated bf3 is again anchored on b.id which is already scoped. Clean.

Handler scoping (internal/dedup/handler.go:101-125): getUserLibraryIDs fires from authenticated session (extractUser(r).ID), result is narrowed to filterToLibrary(userLibraryIDs, libraryID). The criteria parsing (parseScanRequest) runs after the scope check and only controls which boolean flags (ScanCriteria) are passed to the scan function. The criteria flags never widen the library scope — they only select which parameterized queries execute. Fail-closed: userID=0 (unauthenticated) maps to no libraries → filterToLibrary returns empty → ErrNotFound before any scan. Clean.


Phase 1: SQL Injection

All user-supplied inputs that reach SQL travel through parameterized ? placeholders:

  • preset / individual flag strings from JSON body / query params → decoded into ScanCriteria (Go struct booleans) in parseScanRequest. The string value of preset never reaches SQL. Unknown preset falls back to balanced in PresetCriteria (dedup.go:42). Zero injection surface.
  • afterKey / afterID (cursor) → used as ? bind parameters (ORDER BY … AND grp.key > ?). Never interpolated.
  • libIn placeholder string is built from len(userLibraryIDs) literal ? characters, not from any user string.
  • LIKE '%%/%%' in the Directory inner query is a Go fmt.Sprintf-escaped literal that produces LIKE '%/%' — not derived from user input.
  • ORDER BY columns are all hard-coded column references (grp.isbn_13 ASC, grp.provider_key ASC, grp.dir_key ASC, grp.fname ASC). No dynamic ORDER BY from user input.

No SQL injection surface found.


Phase 2: XSS / Template

  • templates/pages/dedup_modal.html: new preset chip <button> elements and checkbox <input> elements. All attribute values are static strings. No {{.}} interpolation in the new HTML. No style= attributes. Clean.
  • templates/pages/dedup_results.html: the fallback branch {{else}}{{$g.MatchType}}{{end}} emits MatchType in a text node — html/template auto-escapes it. MatchType originates from hard-coded SQL literal strings ('isbn', 'external_id', etc.) so attacker cannot inject arbitrary content even before escaping. Clean.
  • static/js/controllers/find_duplicates_controller.js: no element.style. assignments, no innerHTML with user data, no eval. Criteria body is JSON.stringify({preset: ...}) where preset is read from a data-preset attribute on a server-rendered button (static strings in the template). Clean.
  • static/css/main.css: new CSS classes only, no inline style=. CSP-compatible. Clean.

Phase 3: No New Unauthenticated / Unscoped Surface

No new routes were added. The existing three dedup routes (GET /libraries/{id}/dedup, POST /libraries/{id}/dedup/scan, POST /libraries/{id}/dedup/merge) already existed and required a mapped user. The wire-up (wire.go) only adds three new finder functions to the existing Scan closure — no new HTTP surface. Clean.


Findings

None.

REVIEW VERDICT: 0 blocker, 0 major, 0 minor

## Security Review — PR #567 (bookshelf-c15a.2 + bookshelf-7c5m) **Branch:** `bd-bookshelf-c15a.2` **Scope:** dedup detection presets + 5 criteria toggles (ISBN / External ID / Title+Author / Same Directory / Filename) + series-volume false-positive fix --- ### Phase 0: Multi-user / Library Scope Reviewed all five new/modified query functions in `internal/dedup/store.go`. **FindDuplicatesByISBN** (modified): unchanged scoping, clean. **FindDuplicatesByTitleAuthor** (modified — series-volume fix): `b2.library_id IN (%s)` (inner sub), `b3.library_id IN (%s)` (grp sub), `b.library_id IN (%s)` (outer WHERE). Three parameterized scope anchors. Clean. **FindDuplicatesByExternalID** (new, `store.go:284`): hardcover UNION ALL goodreads — both sub-selects have `b2.library_id IN (%s)` / `b3.library_id IN (%s)`. Outer `WHERE b.library_id IN (%s)`. Three separate `%s` expansions, all backed by `userLibraryIDs` (never user-supplied). Fail-closed: empty `userLibraryIDs` short-circuits at the top of the function before any SQL is built. Clean. **FindDuplicatesByDirectory** (new, `store.go:386`): inner subquery `b2.library_id IN (%s)`, outer `WHERE b.library_id IN (%s)`. The `ON` clause uses a correlated subquery on `bf3` with no explicit library filter — `bf3.book_id = b.id` and `b` is already constrained by the `WHERE b.library_id IN (%s)`, so `bf3` can only resolve against books already in scope. The `grp` subquery (dir-key grouping) is also restricted to `b2.library_id IN (user scope)`. No cross-library directory key leakage. Clean. **FindDuplicatesByFilename** (new, `store.go:488`): same two-anchor pattern as Directory — `b2.library_id IN (%s)` in the inner fname subquery, `b.library_id IN (%s)` in the outer WHERE. Correlated `bf3` is again anchored on `b.id` which is already scoped. Clean. **Handler scoping** (`internal/dedup/handler.go:101-125`): `getUserLibraryIDs` fires from authenticated session (`extractUser(r).ID`), result is narrowed to `filterToLibrary(userLibraryIDs, libraryID)`. The criteria parsing (`parseScanRequest`) runs *after* the scope check and only controls which boolean flags (`ScanCriteria`) are passed to the scan function. The criteria flags never widen the library scope — they only select which parameterized queries execute. Fail-closed: `userID=0` (unauthenticated) maps to no libraries → `filterToLibrary` returns empty → `ErrNotFound` before any scan. Clean. --- ### Phase 1: SQL Injection All user-supplied inputs that reach SQL travel through parameterized `?` placeholders: - `preset` / individual flag strings from JSON body / query params → decoded into `ScanCriteria` (Go struct booleans) in `parseScanRequest`. The string value of `preset` never reaches SQL. Unknown preset falls back to `balanced` in `PresetCriteria` (`dedup.go:42`). Zero injection surface. - `afterKey` / `afterID` (cursor) → used as `?` bind parameters (`ORDER BY … AND grp.key > ?`). Never interpolated. - `libIn` placeholder string is built from `len(userLibraryIDs)` literal `?` characters, not from any user string. - `LIKE '%%/%%'` in the Directory inner query is a Go `fmt.Sprintf`-escaped literal that produces `LIKE '%/%'` — not derived from user input. - `ORDER BY` columns are all hard-coded column references (`grp.isbn_13 ASC`, `grp.provider_key ASC`, `grp.dir_key ASC`, `grp.fname ASC`). No dynamic ORDER BY from user input. No SQL injection surface found. --- ### Phase 2: XSS / Template - `templates/pages/dedup_modal.html`: new preset chip `<button>` elements and checkbox `<input>` elements. All attribute values are static strings. No `{{.}}` interpolation in the new HTML. No `style=` attributes. Clean. - `templates/pages/dedup_results.html`: the fallback branch `{{else}}{{$g.MatchType}}{{end}}` emits `MatchType` in a text node — `html/template` auto-escapes it. `MatchType` originates from hard-coded SQL literal strings (`'isbn'`, `'external_id'`, etc.) so attacker cannot inject arbitrary content even before escaping. Clean. - `static/js/controllers/find_duplicates_controller.js`: no `element.style.` assignments, no `innerHTML` with user data, no `eval`. Criteria body is `JSON.stringify({preset: ...})` where `preset` is read from a `data-preset` attribute on a server-rendered button (static strings in the template). Clean. - `static/css/main.css`: new CSS classes only, no inline `style=`. CSP-compatible. Clean. --- ### Phase 3: No New Unauthenticated / Unscoped Surface No new routes were added. The existing three dedup routes (`GET /libraries/{id}/dedup`, `POST /libraries/{id}/dedup/scan`, `POST /libraries/{id}/dedup/merge`) already existed and required a mapped user. The wire-up (`wire.go`) only adds three new finder functions to the existing `Scan` closure — no new HTTP surface. Clean. --- ### Findings None. REVIEW VERDICT: 0 blocker, 0 major, 0 minor
Author
Owner

CODE REVIEW: NOT APPROVED

Phase 0: DEMO Verification

No DEMO block required for this review (code-review agent, not bead-completion verification).

Phase 1: Spec Compliance

All spec requirements are present: Strict/Balanced/Aggressive/Custom presets, all 5 criteria toggles (ISBN, External ID, Title+Author, Same Directory, Filename), 7c5m series-volume guard in SQL HAVING, match badge updated for all 5 match types, browser test present, multi-user library scoping applied fail-closed to all criteria, migration 0036 adds hardcover/goodreads indexes. Structure follows project conventions.

Phase 2: Code Quality


[MAJOR] internal/dedup/store.go:334-338 — ExternalID: book with both hardcover_id AND goodreads_id silently destroys the hardcover duplicate group

The JOIN ON uses OR to match either provider key. If book A has both hardcover_id=hc-001 AND goodreads_id=gr-001, and both form independent duplicate groups (A+B share hc-001; A+C share gr-001), MySQL returns book A twice in the query output — once per matching grp row. The Scan.seen map in service.go:~92 marks A on its first encounter (e.g. gr:gr-001), then drops the second occurrence (hc:hc-001). The hc: group is left with only book B (single-book group), which is filtered out. A valid hardcover_id duplicate group (A+B) is silently destroyed.

Suggested fix: rewrite using UNION (separate hc and gr outer queries merged with UNION ALL), or add GROUP BY b.id, grp.provider_key before ORDER BY to collapse the OR-expanded duplicates within the query. Either approach ensures each (book_id, provider_key) appears at most once in the result set.


[MAJOR] internal/dedup/store.go:435-441, 529-534 — Directory and Filename criteria use per-row correlated subqueries with no covering index; violates 250k scale requirement

Both FindDuplicatesByDirectory and FindDuplicatesByFilename join the outer book row to grp via a correlated subquery in the ON clause:

JOIN grp ON (
SELECT SUBSTRING_INDEX(bf3.file_sub_path, '/', ...)
FROM book_file bf3
WHERE bf3.book_id = b.id AND bf3.is_book = 1
LIMIT 1
) = grp.dir_key

This executes once per outer book row. At 250k books that is 250k per-row correlated scans before LIMIT applies. The existing idx_book_additional_file_book_id covers only (book_id) and cannot serve the SUBSTRING_INDEX expression without fetching the row. Migration 0036 adds no index on book_file(book_id, is_book, file_sub_path). The project scale requirement says GET /books p99 < 100ms and all query paths must be designed for 250k books from day one; this query pattern violates that.

Suggested fix: (1) add a covering index (book_id, is_book, file_sub_path) to book_file in this migration, AND (2) rewrite as a derived table (like bf_main is already done) keyed by book_id so there is no per-row correlated subquery — join bf_dir (SELECT book_id, SUBSTRING_INDEX(...) AS dir_key FROM book_file WHERE is_book=1 GROUP BY book_id) and then join to grp on equi-join.


[MAJOR] e2e/api/dedup_test.go — No positive e2e test for Same Directory or Filename criteria

The only e2e test covering the Filename criterion ("custom preset with only filename returns empty", line ~283) seeds books with no book_file rows and asserts empty results. This is a negative-only test — a bug in the FindDuplicatesByDirectory or FindDuplicatesByFilename SQL (wrong SUBSTRING_INDEX, wrong column, wrong library filter) would not be caught by CI since the result is empty either way. There is no positive e2e test that seeds books with matching file_sub_path values and verifies a group is returned with the correct match_type.

Suggested fix: add seedDirectoryDuplicateLibrary() and seedFilenameDuplicateLibrary() helpers (analogous to seedDedupLibrary/seedEleriaLibrary) that insert book_file rows with matching paths, then assert the scan returns a group with match_type "same_directory" / "filename".


[MINOR] e2e/browser/find_duplicates_test.go:541-542 — Checkbox state verified via HTML attribute instead of DOM property; assertion is vacuous

checked, _ := titleAuthorCB.Attribute("checked")
Expect(checked).To(BeNil())

Attribute("checked") reads the HTML attribute, not the DOM .checked property. Since the template (dedup_modal.html) never renders a checked attribute on any checkbox, this always returns nil regardless of whether _applyPreset ran. The test passes even if the JS preset logic is completely broken.

Suggested fix: isChecked := titleAuthorCB.MustEval("() => this.checked").Bool() then Expect(isChecked).To(BeFalse()).


[MINOR] e2e/api/dedup_test.go:17 — Stray double blank line after import block (style nit)


REVIEW VERDICT: 0 blocker, 3 major, 2 minor

CODE REVIEW: NOT APPROVED ## Phase 0: DEMO Verification No DEMO block required for this review (code-review agent, not bead-completion verification). ## Phase 1: Spec Compliance All spec requirements are present: Strict/Balanced/Aggressive/Custom presets, all 5 criteria toggles (ISBN, External ID, Title+Author, Same Directory, Filename), 7c5m series-volume guard in SQL HAVING, match badge updated for all 5 match types, browser test present, multi-user library scoping applied fail-closed to all criteria, migration 0036 adds hardcover/goodreads indexes. Structure follows project conventions. ## Phase 2: Code Quality --- [MAJOR] internal/dedup/store.go:334-338 — ExternalID: book with both hardcover_id AND goodreads_id silently destroys the hardcover duplicate group The JOIN ON uses OR to match either provider key. If book A has both hardcover_id=hc-001 AND goodreads_id=gr-001, and both form independent duplicate groups (A+B share hc-001; A+C share gr-001), MySQL returns book A twice in the query output — once per matching grp row. The Scan.seen map in service.go:~92 marks A on its first encounter (e.g. gr:gr-001), then drops the second occurrence (hc:hc-001). The hc: group is left with only book B (single-book group), which is filtered out. A valid hardcover_id duplicate group (A+B) is silently destroyed. Suggested fix: rewrite using UNION (separate hc and gr outer queries merged with UNION ALL), or add GROUP BY b.id, grp.provider_key before ORDER BY to collapse the OR-expanded duplicates within the query. Either approach ensures each (book_id, provider_key) appears at most once in the result set. --- [MAJOR] internal/dedup/store.go:435-441, 529-534 — Directory and Filename criteria use per-row correlated subqueries with no covering index; violates 250k scale requirement Both FindDuplicatesByDirectory and FindDuplicatesByFilename join the outer book row to grp via a correlated subquery in the ON clause: JOIN grp ON ( SELECT SUBSTRING_INDEX(bf3.file_sub_path, '/', ...) FROM book_file bf3 WHERE bf3.book_id = b.id AND bf3.is_book = 1 LIMIT 1 ) = grp.dir_key This executes once per outer book row. At 250k books that is 250k per-row correlated scans before LIMIT applies. The existing idx_book_additional_file_book_id covers only (book_id) and cannot serve the SUBSTRING_INDEX expression without fetching the row. Migration 0036 adds no index on book_file(book_id, is_book, file_sub_path). The project scale requirement says GET /books p99 < 100ms and all query paths must be designed for 250k books from day one; this query pattern violates that. Suggested fix: (1) add a covering index (book_id, is_book, file_sub_path) to book_file in this migration, AND (2) rewrite as a derived table (like bf_main is already done) keyed by book_id so there is no per-row correlated subquery — join bf_dir (SELECT book_id, SUBSTRING_INDEX(...) AS dir_key FROM book_file WHERE is_book=1 GROUP BY book_id) and then join to grp on equi-join. --- [MAJOR] e2e/api/dedup_test.go — No positive e2e test for Same Directory or Filename criteria The only e2e test covering the Filename criterion ("custom preset with only filename returns empty", line ~283) seeds books with no book_file rows and asserts empty results. This is a negative-only test — a bug in the FindDuplicatesByDirectory or FindDuplicatesByFilename SQL (wrong SUBSTRING_INDEX, wrong column, wrong library filter) would not be caught by CI since the result is empty either way. There is no positive e2e test that seeds books with matching file_sub_path values and verifies a group is returned with the correct match_type. Suggested fix: add seedDirectoryDuplicateLibrary() and seedFilenameDuplicateLibrary() helpers (analogous to seedDedupLibrary/seedEleriaLibrary) that insert book_file rows with matching paths, then assert the scan returns a group with match_type "same_directory" / "filename". --- [MINOR] e2e/browser/find_duplicates_test.go:541-542 — Checkbox state verified via HTML attribute instead of DOM property; assertion is vacuous checked, _ := titleAuthorCB.Attribute("checked") Expect(checked).To(BeNil()) Attribute("checked") reads the HTML attribute, not the DOM .checked property. Since the template (dedup_modal.html) never renders a checked attribute on any checkbox, this always returns nil regardless of whether _applyPreset ran. The test passes even if the JS preset logic is completely broken. Suggested fix: isChecked := titleAuthorCB.MustEval("() => this.checked").Bool() then Expect(isChecked).To(BeFalse()). --- [MINOR] e2e/api/dedup_test.go:17 — Stray double blank line after import block (style nit) --- REVIEW VERDICT: 0 blocker, 3 major, 2 minor
fix(dedup): address review majors — ExternalID OR-dup loss, Directory/Filename correlated-subquery→derived-join + covering index, positive Dir/Filename e2e
All checks were successful
/ JS Unit Tests (pull_request) Successful in 17s
/ Lint (pull_request) Successful in 1m49s
/ Test (pull_request) Successful in 3m0s
/ E2E Browser (pull_request) Successful in 4m0s
/ E2E API (pull_request) Successful in 4m42s
/ Integration (pull_request) Successful in 4m56s
b14f04d5a1
[MAJOR 1] ExternalID OR join: replace single OR-join with UNION ALL of two
equi-joined sub-queries (hc + gr) so a book with both provider IDs appears
exactly once per provider group. Change seen-map key to (bookID, groupKey)
so both provider groups survive (old bookID-only key silently dropped the
second group). Update cross-criteria tests to reflect new behavior (each
distinct (book, group) pair gets its own group; exact duplicates still
deduplicated). Update browser tests to use Strict preset where single-group
count assertions are needed.

[MAJOR 2] Directory/Filename correlated-subquery: replace per-outer-row
correlated subquery with a derived-table EQUI-JOIN (bf_dir / bf_fname).
Add migration 0037 adding covering index idx_book_file_dedup_dir
(book_id, is_book, file_sub_path) so both queries are served from the index.
Fix GROUP BY to use MIN(file_sub_path) to satisfy only_full_group_by.

[MAJOR 3] Positive e2e tests: add seedDirectoryDuplicateLibrary,
seedFilenameDuplicateLibrary, and both-ids external-ID scenario seeder +
full Describe blocks for same_directory, filename, and external_id both-groups.

[MINOR] Browser checkbox: use MustEval("() => this.checked").Bool() instead
of Attribute("checked") (HTML attr is vacuous after JS toggles state).

[MINOR] Stray double blank line in dedup_test.go removed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Author
Owner

CODE REVIEW (RE-REVIEW): APPROVED

Re-review of #567 (bd-bookshelf-c15a.2) against prior findings in comment 6689.

Prior Finding Verification

MAJOR 1: ExternalID OR-dup loss — RESOLVED

The fix rewrites FindDuplicatesByExternalID using two independent EQUI-JOIN subqueries (hardcover and goodreads) merged with UNION ALL (store.go:278-381). Each provider query uses an equi-join with no OR. A book with both hardcover_id and goodreads_id appears exactly once per provider group.

The service.go seen-map key is now a (bookID, groupKey) composite (service.go:92) — so book A can appear in both hc:HC001 and gr:GR001 groups without being dropped. The e2e test seeds exactly the A+B (hc) / A+C (gr) scenario and asserts HaveLen(2). Finding is resolved.

MAJOR 2: Directory/Filename correlated-subquery — RESOLVED

Both FindDuplicatesByDirectory and FindDuplicatesByFilename now use derived-table EQUI-JOINs with no per-row correlated subquery (store.go:453-493 for directory, store.go:551-595 for filename). Migration 0037 adds idx_book_file_dedup_dir (book_id, is_book, file_sub_path) on book_file. All ORDER BY clauses use a two-column total order with unique tiebreaker. Finding is resolved.

MAJOR 3: Positive e2e tests for same_directory, filename, both-ids external_id — RESOLVED

  • seedDirectoryDuplicateLibrary() + same_directory match Describe (e2e/api/dedup_test.go:555-618): seeds two books sharing a directory, asserts group with match_type same_directory.
  • seedFilenameDuplicateLibrary() + filename match (e2e/api/dedup_test.go:619-686): seeds two books with same basename in different directories.
  • external_id both-ids book keeps both groups (e2e/api/dedup_test.go:688-822): asserts two distinct provider groups.

All three are positive tests. Finding is resolved.

MINOR: Browser checkbox MustEval bool read — RESOLVED

e2e/browser/find_duplicates_test.go now uses titleAuthorCB.MustEval("() => this.checked").Bool(). Finding is resolved.

MINOR: Stray double blank line — RESOLVED

The extra blank line after the import block is gone. Finding is resolved.

Fresh-Eyes Pass

Multi-user/library scoping: Both new directory and filename finders filter by userLibraryIDs in the inner grp subquery AND the outer WHERE clause. Empty-slice fail-closed guard present. Correct.

Scale: No N+1. All new queries use derived-table joins. ORDER BY uses total-order two-column keys. LIMIT bound on all queries. Covering index on book_file (migration 0037). Clean.

Grimmory compat: Migrations 0036 and 0037 add indexes only, no new columns. Correct.

CSP: No inline style= in templates. CSS classes used throughout. Clean.

[MINOR] e2e/browser/find_duplicates_test.go:494 — single It block contains multiple Expect assertions
The browser spec chains four independent assertions (balanced chip active, strict chip active, titleAuthor unchecked, badge text). Project convention requires one Expect per It. Split into four It blocks. Does not block merge.

REVIEW VERDICT: 0 blocker, 0 major, 1 minor

CODE REVIEW (RE-REVIEW): APPROVED Re-review of #567 (bd-bookshelf-c15a.2) against prior findings in comment 6689. ## Prior Finding Verification ### MAJOR 1: ExternalID OR-dup loss — RESOLVED The fix rewrites FindDuplicatesByExternalID using two independent EQUI-JOIN subqueries (hardcover and goodreads) merged with UNION ALL (store.go:278-381). Each provider query uses an equi-join with no OR. A book with both hardcover_id and goodreads_id appears exactly once per provider group. The service.go seen-map key is now a (bookID, groupKey) composite (service.go:92) — so book A can appear in both hc:HC001 and gr:GR001 groups without being dropped. The e2e test seeds exactly the A+B (hc) / A+C (gr) scenario and asserts HaveLen(2). Finding is resolved. ### MAJOR 2: Directory/Filename correlated-subquery — RESOLVED Both FindDuplicatesByDirectory and FindDuplicatesByFilename now use derived-table EQUI-JOINs with no per-row correlated subquery (store.go:453-493 for directory, store.go:551-595 for filename). Migration 0037 adds idx_book_file_dedup_dir (book_id, is_book, file_sub_path) on book_file. All ORDER BY clauses use a two-column total order with unique tiebreaker. Finding is resolved. ### MAJOR 3: Positive e2e tests for same_directory, filename, both-ids external_id — RESOLVED - seedDirectoryDuplicateLibrary() + same_directory match Describe (e2e/api/dedup_test.go:555-618): seeds two books sharing a directory, asserts group with match_type same_directory. - seedFilenameDuplicateLibrary() + filename match (e2e/api/dedup_test.go:619-686): seeds two books with same basename in different directories. - external_id both-ids book keeps both groups (e2e/api/dedup_test.go:688-822): asserts two distinct provider groups. All three are positive tests. Finding is resolved. ### MINOR: Browser checkbox MustEval bool read — RESOLVED e2e/browser/find_duplicates_test.go now uses titleAuthorCB.MustEval("() => this.checked").Bool(). Finding is resolved. ### MINOR: Stray double blank line — RESOLVED The extra blank line after the import block is gone. Finding is resolved. ## Fresh-Eyes Pass Multi-user/library scoping: Both new directory and filename finders filter by userLibraryIDs in the inner grp subquery AND the outer WHERE clause. Empty-slice fail-closed guard present. Correct. Scale: No N+1. All new queries use derived-table joins. ORDER BY uses total-order two-column keys. LIMIT bound on all queries. Covering index on book_file (migration 0037). Clean. Grimmory compat: Migrations 0036 and 0037 add indexes only, no new columns. Correct. CSP: No inline style= in templates. CSS classes used throughout. Clean. [MINOR] e2e/browser/find_duplicates_test.go:494 — single It block contains multiple Expect assertions The browser spec chains four independent assertions (balanced chip active, strict chip active, titleAuthor unchecked, badge text). Project convention requires one Expect per It. Split into four It blocks. Does not block merge. REVIEW VERDICT: 0 blocker, 0 major, 1 minor
Author
Owner

Security Review - PR #567 (bookshelf-c15a.2) - RE-REVIEW of fixes for comment 6689

Head SHA: b14f04d5
Scope: re-review of 3 MAJOR + 2 MINOR fixes from prior review comment 6689; adversarial check for new issues introduced by the fix commits.

Phase 0: Prior findings - verification

[MAJOR #1] ExternalID OR-join silent group destruction - FIXED. FindDuplicatesByExternalID now uses UNION ALL of two separate sub-selects (hardcover, goodreads), each with its own equi-join. A book with both hardcover_id and goodreads_id appears once per provider. The Scan seen-map key is now bookID+NUL+groupKey, allowing one book to appear in both the hc: and gr: provider groups. Verified in e2e/api/dedup_test.go:738.

[MAJOR #2] Directory/Filename per-row correlated subqueries - FIXED. Both finders now use derived tables (bf_dir, bf_fname) GROUP BY book_id ahead of the outer join - no per-row correlated subquery. Migration 0037 adds covering index idx_book_file_dedup_dir (book_id, is_book, file_sub_path) allowing index-only scans.

[MAJOR #3] No positive e2e tests for Directory/Filename - FIXED. seedDirectoryDuplicateLibrary() and seedFilenameDuplicateLibrary() added at e2e/api/dedup_test.go:533 and :585. Positive scan tests verify match_type == "same_directory" and "filename".

[MINOR #1] Checkbox checked attribute vs DOM property - FIXED. e2e/browser/find_duplicates_test.go:545 now uses titleAuthorCB.MustEval("() => this.checked").Bool().

[MINOR #2] Double blank line after import block - FIXED.


Phase 1: New findings in the fix commits


[MAJOR] internal/dedup/service.go:71-114, store.go:379,493,594 - Multi-criteria pagination cursor: single shared (afterKey, afterID) passed to all finders with incompatible key namespaces; second page silently broken for every preset with 2+ criteria

Scan encodes the cursor as a single {AfterKey string, AfterID int64} pair and passes the same values to every active finder. But each finder orders on a different group key namespace:

  • FindDuplicatesByISBN: afterKey vs isbn_13 (e.g. "9780007525508"), afterID vs grp.min_id
  • FindDuplicatesByExternalID: afterKey vs "hc:" or "gr:", afterID vs book_id (NOT grp.min_id - inconsistent tiebreaker)
  • FindDuplicatesByTitleAuthor: afterKey vs normalized "title author", afterID vs grp.min_id
  • FindDuplicatesByDirectory: afterKey vs directory path, afterID vs grp.min_id
  • FindDuplicatesByFilename: afterKey vs basename string, afterID vs grp.min_id

When page 1 ends on an ISBN group (cursor {AfterKey:"9780007525508", AfterID:42}), page 2 feeds that same cursor to TitleAuthor - which skips any normalized title+author key that sorts before "9780007525508" lexicographically, silently dropping those groups. ExternalID compares an ISBN string against "gr:...","hc:..." keys - a completely different namespace. The afterID tiebreaker inconsistency compounds this: ExternalID uses book_id while every other finder uses grp.min_id, so cursors cross-applied between criteria compare semantically different IDs.

The Balanced preset (default: ISBN + ExternalID + TitleAuthor) triggers this on every library large enough to exceed one page.

Suggested fix: tag group keys with a criterion prefix ("isbn:9780007525508", "ta:the martian andy weir", "hc:abc", "dir:/fantasy/tolkien", "fn:dune.epub") so a single global cursor can order across all criteria in one unified lexical space and each finder applies a prefix-range WHERE. Alternative: encode per-criterion cursors in ScanCursor so each finder advances independently. Either way: add a multi-page pagination e2e test seeded with enough groups to exceed one page under Balanced.


[MINOR] internal/dedup/handler.go:48 - json.NewDecoder(r.Body) has no http.MaxBytesReader guard

parseScanRequest decodes the JSON body without first wrapping r.Body in http.MaxBytesReader. Risk is low (bounded by server ReadTimeout: 5s); fix with http.MaxBytesReader(w, r.Body, 4096) before decoding.


REVIEW VERDICT: 0 blocker, 1 major, 1 minor

## Security Review - PR #567 (bookshelf-c15a.2) - RE-REVIEW of fixes for comment 6689 **Head SHA:** b14f04d5 **Scope:** re-review of 3 MAJOR + 2 MINOR fixes from prior review comment 6689; adversarial check for new issues introduced by the fix commits. ### Phase 0: Prior findings - verification **[MAJOR #1] ExternalID OR-join silent group destruction** - FIXED. FindDuplicatesByExternalID now uses UNION ALL of two separate sub-selects (hardcover, goodreads), each with its own equi-join. A book with both hardcover_id and goodreads_id appears once per provider. The Scan seen-map key is now bookID+NUL+groupKey, allowing one book to appear in both the hc: and gr: provider groups. Verified in e2e/api/dedup_test.go:738. **[MAJOR #2] Directory/Filename per-row correlated subqueries** - FIXED. Both finders now use derived tables (bf_dir, bf_fname) GROUP BY book_id ahead of the outer join - no per-row correlated subquery. Migration 0037 adds covering index idx_book_file_dedup_dir (book_id, is_book, file_sub_path) allowing index-only scans. **[MAJOR #3] No positive e2e tests for Directory/Filename** - FIXED. seedDirectoryDuplicateLibrary() and seedFilenameDuplicateLibrary() added at e2e/api/dedup_test.go:533 and :585. Positive scan tests verify match_type == "same_directory" and "filename". **[MINOR #1] Checkbox checked attribute vs DOM property** - FIXED. e2e/browser/find_duplicates_test.go:545 now uses titleAuthorCB.MustEval("() => this.checked").Bool(). **[MINOR #2] Double blank line after import block** - FIXED. --- ### Phase 1: New findings in the fix commits --- [MAJOR] internal/dedup/service.go:71-114, store.go:379,493,594 - Multi-criteria pagination cursor: single shared (afterKey, afterID) passed to all finders with incompatible key namespaces; second page silently broken for every preset with 2+ criteria Scan encodes the cursor as a single {AfterKey string, AfterID int64} pair and passes the same values to every active finder. But each finder orders on a different group key namespace: - FindDuplicatesByISBN: afterKey vs isbn_13 (e.g. "9780007525508"), afterID vs grp.min_id - FindDuplicatesByExternalID: afterKey vs "hc:<id>" or "gr:<id>", afterID vs book_id (NOT grp.min_id - inconsistent tiebreaker) - FindDuplicatesByTitleAuthor: afterKey vs normalized "title author", afterID vs grp.min_id - FindDuplicatesByDirectory: afterKey vs directory path, afterID vs grp.min_id - FindDuplicatesByFilename: afterKey vs basename string, afterID vs grp.min_id When page 1 ends on an ISBN group (cursor {AfterKey:"9780007525508", AfterID:42}), page 2 feeds that same cursor to TitleAuthor - which skips any normalized title+author key that sorts before "9780007525508" lexicographically, silently dropping those groups. ExternalID compares an ISBN string against "gr:...","hc:..." keys - a completely different namespace. The afterID tiebreaker inconsistency compounds this: ExternalID uses book_id while every other finder uses grp.min_id, so cursors cross-applied between criteria compare semantically different IDs. The Balanced preset (default: ISBN + ExternalID + TitleAuthor) triggers this on every library large enough to exceed one page. Suggested fix: tag group keys with a criterion prefix ("isbn:9780007525508", "ta:the martian andy weir", "hc:abc", "dir:/fantasy/tolkien", "fn:dune.epub") so a single global cursor can order across all criteria in one unified lexical space and each finder applies a prefix-range WHERE. Alternative: encode per-criterion cursors in ScanCursor so each finder advances independently. Either way: add a multi-page pagination e2e test seeded with enough groups to exceed one page under Balanced. --- [MINOR] internal/dedup/handler.go:48 - json.NewDecoder(r.Body) has no http.MaxBytesReader guard parseScanRequest decodes the JSON body without first wrapping r.Body in http.MaxBytesReader. Risk is low (bounded by server ReadTimeout: 5s); fix with http.MaxBytesReader(w, r.Body, 4096) before decoding. --- REVIEW VERDICT: 0 blocker, 1 major, 1 minor
fix(dedup): per-criterion cursor namespacing, ExternalID tiebreaker, MaxBytesReader, one-Expect-per-It
Some checks failed
/ JS Unit Tests (pull_request) Successful in 27s
/ Lint (pull_request) Successful in 1m20s
/ Test (pull_request) Successful in 2m6s
/ E2E Browser (pull_request) Successful in 3m0s
/ E2E API (pull_request) Failing after 4m51s
/ Integration (pull_request) Successful in 5m3s
aa421a1dd1
Fixes three review findings from PR #567:

[MAJOR] cursor-namespace collision (bookshelf-c15a.2):
- Rename ScanCursor to carry Positions map[criterionLabel]CriterionPos
  so each active finder (isbn, external_id, title_author, same_directory,
  filename) resumes in its own keyspace on page 2+
- Scan() now threads each finder its own (afterKey, afterID) from
  cursor.Positions[cf.label] instead of a single shared cursor
- Next-page cursor encodes the last (groupKey, bookID) observed per
  criterion so subsequent pages resume correctly
- Export CriterionPos type so callers/tests can construct cursors

[MAJOR] ExternalID tiebreaker mismatch (store.go):
- FindDuplicatesByExternalID's outer pagination used `book_id` as the
  tiebreaker but all other finders use `grp.min_id`
- Add min_id to both UNION branches and use `min_id` in the outer
  WHERE/ORDER BY to match the consistent tiebreaker convention

[MINOR] MaxBytesReader in ScanHandler (handler.go):
- Wrap r.Body with http.MaxBytesReader(w, r.Body, 4096) before
  parseScanRequest; oversized bodies now return 413
- Add handler test for the 413 path

[MINOR] one-Expect-per-It (e2e/browser/find_duplicates_test.go):
- Split the 4-assertion It block into three separate Describe blocks
  each with its own JustBeforeEach interaction and single-Expect Its

Also add e2e/api multi-page pagination regression test seeding 3 ISBN
groups and paginating with limit=2 to assert no groups are dropped or
duplicated across pages (regression test for the cursor-namespace bug).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix(dedup): paginate at group count, not row count in all finders
All checks were successful
/ JS Unit Tests (pull_request) Successful in 23s
/ Lint (pull_request) Successful in 2m48s
/ Test (pull_request) Successful in 4m14s
/ E2E Browser (pull_request) Successful in 4m32s
/ E2E API (pull_request) Successful in 6m41s
/ Integration (pull_request) Successful in 6m49s
60ab9641d8
The previous SQL used LIMIT on the outer row-producing query. With groups
of N books each, a LIMIT of (page+1) rows only returns a fraction of a
group, causing the service to drop partial groups after groupRowsByKey
filters out single-book results. Pagination then silently skipped groups.

Restructure all five finders (isbn, title_author, external_id, directory,
filename) to move the keyset cursor and LIMIT inside the group subquery.
The outer query now fetches all member books for the paginated groups
without a second LIMIT, so the service always receives complete groups.

Also restructure service.go's cursor computation: when truncating to limit
groups, recompute per-criterion positions from only the rows belonging to
kept groups (via a keptKeys set), not from the peek row beyond the page.

Regression test (e2e/api/dedup_test.go): 3 ISBN dup groups + limit=2
now correctly returns 2 groups on page 1 and 1 group on page 2.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Author
Owner

Security Re-Review — Pass 3 (SHA 60ab9641, PR #567, bead bookshelf-c15a.2)

Prior findings — resolution status

[MAJOR] Cross-criterion cursor namespace collision — RESOLVED.
The new ScanCursor.Positions map[string]CriterionPos gives each active criterion its own (afterKey, afterID). The handler reads userLibraryIDs and builds scopedIDs before the cursor is decoded, so the cursor has no influence on the library scope. Unknown keys in Positions (attacker-injected) are silently ignored via map-lookup — they cannot widen scope or cause a panic. The afterKey/afterID values are passed only as SQL parameters in a keyset WHERE clause; the library-ID filter (WHERE b.library_id IN (?)) is applied independently in every finder. Cross-user read via cursor is not possible.

[MINOR] No http.MaxBytesReader — RESOLVED.
http.MaxBytesReader(w, r.Body, 4096) is applied in ScanHandler before parseScanRequest is called (handler.go:128-130). The error-mapper handles *http.MaxBytesError via errors.As (checked before errors.Is(ErrValidation) in the switch), so an oversized body returns 413 not 400. The handler unit test at handler_test.go:339-344 confirms 413 is returned. ✓


Fresh threat-model

SQL injection — all five finder queries (FindDuplicatesByISBN, FindDuplicatesByExternalID, FindDuplicatesByTitleAuthor, FindDuplicatesByDirectory, FindDuplicatesByFilename) pass every user-influenced value (cursor afterKey/afterID, limit+1) as positional ? parameters. Library IDs are server-computed from the session and passed the same way. The LIKE patterns use %%/%% (Go fmt.Sprintf escaping → %/%) which are string literals, not user-controlled. No injection surface found.

ExternalID UNION ALL arg binding — manually verified: HC branch gets libIn, afterKey, afterKey, afterID, limit+1, libIn; GR branch gets the same pattern; total args = 4N + 8 matching the four %s expansions + eight cursor/limit ?s in the SQL. Both branches are independently scoped by userLibraryIDs. ✓

Resource/DoS — request body capped at 4096 bytes via MaxBytesReader. The cursor query param is bounded by Go's default MaxHeaderBytes (1 MB) and is a server-generated opaque blob with no unbounded sub-structure. Each finder's inner subquery limits to limit+1 groups (clamped to max 200 by clampLimit); the ExternalID combined outer query has no second LIMIT but is bounded by 2×(limit+1) groups, which is documented and handled correctly by the service-layer seen-map. No unbounded scan path found.

Cursor decode safetyDecodeCursor returns a typed error (%w: %w wrapping ErrValidation) on invalid base64 or malformed JSON; no panic path. The Positions map is deserialized into a Go struct with typed fields; oversized maps are bounded by the cursor string length, itself bounded by HTTP headers.

Multi-user fail-closed — every finder has if len(userLibraryIDs) == 0 { return nil, nil } as the first guard. filterToLibrary rejects requests for a library not in the user's accessible set (returns 404). userLibraryIDs` is derived from the authenticated session, never from the request body or cursor.

keptKeys pagination correctness — the seen2 loop correctly breaks on len(seen2) > limit, excluding the peek group from keptKeys. The lastPos update only iterates labeledRows in keptKeys, so the cursor points to the last book of the last returned group, not the discarded peek group. ✓


New findings

No new blockers or majors. One minor style note:

[MINOR] internal/dedup/handler.go:53 — double-wrapped error format %w: %w with ErrValidation first
fmt.Errorf("dedup scan: decode body: %w: %w", middleware.ErrValidation, err) is an unusual multi-error pattern. The statusFor switch checks errors.As(err, &maxBytesErr) before errors.Is(ErrValidation), so the 413 path wins correctly and the test confirms it. The double-colon separator is cosmetic and non-standard but causes no correctness or security issue. The same pattern exists in merge handler, suggesting it is an established codebase convention. No fix required.


REVIEW VERDICT: 0 blocker, 0 major, 1 minor

## Security Re-Review — Pass 3 (SHA 60ab9641, PR #567, bead bookshelf-c15a.2) ### Prior findings — resolution status **[MAJOR] Cross-criterion cursor namespace collision — RESOLVED.** The new `ScanCursor.Positions map[string]CriterionPos` gives each active criterion its own `(afterKey, afterID)`. The handler reads `userLibraryIDs` and builds `scopedIDs` *before* the cursor is decoded, so the cursor has no influence on the library scope. Unknown keys in `Positions` (attacker-injected) are silently ignored via map-lookup — they cannot widen scope or cause a panic. The `afterKey`/`afterID` values are passed only as SQL *parameters* in a keyset WHERE clause; the library-ID filter (`WHERE b.library_id IN (?)`) is applied independently in every finder. Cross-user read via cursor is not possible. **[MINOR] No `http.MaxBytesReader` — RESOLVED.** `http.MaxBytesReader(w, r.Body, 4096)` is applied in `ScanHandler` before `parseScanRequest` is called (`handler.go:128-130`). The error-mapper handles `*http.MaxBytesError` via `errors.As` (checked before `errors.Is(ErrValidation)` in the switch), so an oversized body returns 413 not 400. The handler unit test at `handler_test.go:339-344` confirms 413 is returned. ✓ --- ### Fresh threat-model **SQL injection** — all five finder queries (`FindDuplicatesByISBN`, `FindDuplicatesByExternalID`, `FindDuplicatesByTitleAuthor`, `FindDuplicatesByDirectory`, `FindDuplicatesByFilename`) pass every user-influenced value (cursor `afterKey`/`afterID`, `limit+1`) as positional `?` parameters. Library IDs are server-computed from the session and passed the same way. The `LIKE` patterns use `%%/%%` (Go `fmt.Sprintf` escaping → `%/%`) which are string literals, not user-controlled. No injection surface found. **ExternalID UNION ALL arg binding** — manually verified: HC branch gets `libIn, afterKey, afterKey, afterID, limit+1, libIn`; GR branch gets the same pattern; total args = `4N + 8` matching the four `%s` expansions + eight cursor/limit `?`s in the SQL. Both branches are independently scoped by `userLibraryIDs`. ✓ **Resource/DoS** — request body capped at 4096 bytes via `MaxBytesReader`. The cursor query param is bounded by Go`'s default `MaxHeaderBytes` (1 MB) and is a server-generated opaque blob with no unbounded sub-structure. Each finder`'s inner subquery limits to `limit+1` groups (clamped to max 200 by `clampLimit`); the ExternalID combined outer query has no second LIMIT but is bounded by 2×(limit+1) groups, which is documented and handled correctly by the service-layer seen-map. No unbounded scan path found. **Cursor decode safety** — `DecodeCursor` returns a typed error (`%w: %w` wrapping `ErrValidation`) on invalid base64 or malformed JSON; no panic path. The `Positions` map is deserialized into a Go struct with typed fields; oversized maps are bounded by the cursor string length, itself bounded by HTTP headers. **Multi-user fail-closed** — every finder has `if len(userLibraryIDs) == 0 { return nil, nil }` as the first guard. `filterToLibrary` rejects requests for a library not in the user`'s accessible set (returns 404). `userLibraryIDs` is derived from the authenticated session, never from the request body or cursor. **`keptKeys` pagination correctness** — the seen2 loop correctly breaks on `len(seen2) > limit`, excluding the peek group from `keptKeys`. The `lastPos` update only iterates `labeledRows` in `keptKeys`, so the cursor points to the last book of the last *returned* group, not the discarded peek group. ✓ --- ### New findings No new blockers or majors. One minor style note: [MINOR] internal/dedup/handler.go:53 — double-wrapped error format `%w: %w` with ErrValidation first `fmt.Errorf("dedup scan: decode body: %w: %w", middleware.ErrValidation, err)` is an unusual multi-error pattern. The `statusFor` switch checks `errors.As(err, &maxBytesErr)` before `errors.Is(ErrValidation)`, so the 413 path wins correctly and the test confirms it. The double-colon separator is cosmetic and non-standard but causes no correctness or security issue. The same pattern exists in `merge` handler, suggesting it is an established codebase convention. No fix required. --- REVIEW VERDICT: 0 blocker, 0 major, 1 minor
Author
Owner

Security Re-Review — Pass 3 (SHA 60ab9641, PR #567, bead bookshelf-c15a.2)

Prior findings — resolution status

[MAJOR] Cross-criterion cursor namespace collision — RESOLVED.
The new ScanCursor.Positions map[string]CriterionPos gives each active criterion its own (afterKey, afterID). The handler reads userLibraryIDs and builds scopedIDs before the cursor is decoded, so the cursor has no influence on the library scope. Unknown keys in Positions (attacker-injected) are silently ignored via map-lookup — they cannot widen scope or cause a panic. The afterKey/afterID values are passed only as SQL parameters in a keyset WHERE clause; the library-ID filter (WHERE b.library_id IN (?)) is applied independently in every finder. Cross-user read via cursor is not possible.

[MINOR] No http.MaxBytesReader — RESOLVED.
http.MaxBytesReader(w, r.Body, 4096) is applied in ScanHandler before parseScanRequest is called (handler.go:128-130). The error-mapper handles *http.MaxBytesError via errors.As (checked before errors.Is(ErrValidation) in the switch), so an oversized body returns 413 not 400. The handler unit test at handler_test.go:339-344 confirms 413 is returned.


Fresh threat-model

SQL injection — all five finder queries (FindDuplicatesByISBN, FindDuplicatesByExternalID, FindDuplicatesByTitleAuthor, FindDuplicatesByDirectory, FindDuplicatesByFilename) pass every user-influenced value (cursor afterKey/afterID, limit+1) as positional ? parameters. Library IDs are server-computed from the session and passed the same way. The LIKE patterns use %%/%% (Go fmt.Sprintf escaping to %/%) which are string literals, not user-controlled. No injection surface found.

ExternalID UNION ALL arg binding — manually verified: HC branch gets libIn, afterKey, afterKey, afterID, limit+1, libIn; GR branch gets the same pattern; total args = 4N + 8 matching the four %s expansions + eight cursor/limit ?s in the SQL. Both branches are independently scoped by userLibraryIDs.

Resource/DoS — request body capped at 4096 bytes via MaxBytesReader. The cursor query param is bounded by Go's default MaxHeaderBytes (1 MB) and is a server-generated opaque blob with no unbounded sub-structure. Each finder's inner subquery limits to limit+1 groups (clamped to max 200 by clampLimit); the ExternalID combined outer query has no second LIMIT but is bounded by 2x(limit+1) groups, which is documented and handled correctly by the service-layer seen-map. No unbounded scan path found.

Cursor decode safetyDecodeCursor returns a typed error (wrapping ErrValidation) on invalid base64 or malformed JSON; no panic path. The Positions map is deserialized into a Go struct with typed fields; oversized maps are bounded by the cursor string length, itself bounded by HTTP headers.

Multi-user fail-closed — every finder has if len(userLibraryIDs) == 0 { return nil, nil } as the first guard. filterToLibrary rejects requests for a library not in the user's accessible set (returns 404). userLibraryIDs is derived from the authenticated session, never from the request body or cursor.

keptKeys pagination correctness — the seen2 loop correctly breaks on len(seen2) > limit, excluding the peek group from keptKeys. The lastPos update only iterates labeledRows in keptKeys, so the cursor points to the last book of the last returned group, not the discarded peek group.


New findings

No new blockers or majors. One minor style note:

[MINOR] internal/dedup/handler.go:53 — double-wrapped error format %w: %w with ErrValidation first
fmt.Errorf("dedup scan: decode body: %w: %w", middleware.ErrValidation, err) is an unusual multi-error pattern. The statusFor switch checks errors.As(err, &maxBytesErr) before errors.Is(ErrValidation), so the 413 path wins correctly and the test confirms it. The double-colon separator is cosmetic and non-standard but causes no correctness or security issue. The same pattern exists in the merge handler, suggesting it is an established codebase convention. No fix required.


REVIEW VERDICT: 0 blocker, 0 major, 1 minor

## Security Re-Review — Pass 3 (SHA 60ab9641, PR #567, bead bookshelf-c15a.2) ### Prior findings — resolution status **[MAJOR] Cross-criterion cursor namespace collision — RESOLVED.** The new `ScanCursor.Positions map[string]CriterionPos` gives each active criterion its own `(afterKey, afterID)`. The handler reads `userLibraryIDs` and builds `scopedIDs` *before* the cursor is decoded, so the cursor has no influence on the library scope. Unknown keys in `Positions` (attacker-injected) are silently ignored via map-lookup — they cannot widen scope or cause a panic. The `afterKey`/`afterID` values are passed only as SQL *parameters* in a keyset WHERE clause; the library-ID filter (`WHERE b.library_id IN (?)`) is applied independently in every finder. Cross-user read via cursor is not possible. **[MINOR] No `http.MaxBytesReader` — RESOLVED.** `http.MaxBytesReader(w, r.Body, 4096)` is applied in `ScanHandler` before `parseScanRequest` is called (`handler.go:128-130`). The error-mapper handles `*http.MaxBytesError` via `errors.As` (checked before `errors.Is(ErrValidation)` in the switch), so an oversized body returns 413 not 400. The handler unit test at `handler_test.go:339-344` confirms 413 is returned. --- ### Fresh threat-model **SQL injection** — all five finder queries (`FindDuplicatesByISBN`, `FindDuplicatesByExternalID`, `FindDuplicatesByTitleAuthor`, `FindDuplicatesByDirectory`, `FindDuplicatesByFilename`) pass every user-influenced value (cursor `afterKey`/`afterID`, `limit+1`) as positional `?` parameters. Library IDs are server-computed from the session and passed the same way. The LIKE patterns use `%%/%%` (Go `fmt.Sprintf` escaping to `%/%`) which are string literals, not user-controlled. No injection surface found. **ExternalID UNION ALL arg binding** — manually verified: HC branch gets `libIn, afterKey, afterKey, afterID, limit+1, libIn`; GR branch gets the same pattern; total args = `4N + 8` matching the four `%s` expansions + eight cursor/limit `?`s in the SQL. Both branches are independently scoped by `userLibraryIDs`. **Resource/DoS** — request body capped at 4096 bytes via `MaxBytesReader`. The cursor query param is bounded by Go's default `MaxHeaderBytes` (1 MB) and is a server-generated opaque blob with no unbounded sub-structure. Each finder's inner subquery limits to `limit+1` groups (clamped to max 200 by `clampLimit`); the ExternalID combined outer query has no second LIMIT but is bounded by 2x(limit+1) groups, which is documented and handled correctly by the service-layer seen-map. No unbounded scan path found. **Cursor decode safety** — `DecodeCursor` returns a typed error (wrapping `ErrValidation`) on invalid base64 or malformed JSON; no panic path. The `Positions` map is deserialized into a Go struct with typed fields; oversized maps are bounded by the cursor string length, itself bounded by HTTP headers. **Multi-user fail-closed** — every finder has `if len(userLibraryIDs) == 0 { return nil, nil }` as the first guard. `filterToLibrary` rejects requests for a library not in the user's accessible set (returns 404). `userLibraryIDs` is derived from the authenticated session, never from the request body or cursor. **`keptKeys` pagination correctness** — the seen2 loop correctly breaks on `len(seen2) > limit`, excluding the peek group from `keptKeys`. The `lastPos` update only iterates `labeledRows` in `keptKeys`, so the cursor points to the last book of the last *returned* group, not the discarded peek group. --- ### New findings No new blockers or majors. One minor style note: [MINOR] internal/dedup/handler.go:53 — double-wrapped error format `%w: %w` with ErrValidation first `fmt.Errorf("dedup scan: decode body: %w: %w", middleware.ErrValidation, err)` is an unusual multi-error pattern. The `statusFor` switch checks `errors.As(err, &maxBytesErr)` before `errors.Is(ErrValidation)`, so the 413 path wins correctly and the test confirms it. The double-colon separator is cosmetic and non-standard but causes no correctness or security issue. The same pattern exists in the merge handler, suggesting it is an established codebase convention. No fix required. --- REVIEW VERDICT: 0 blocker, 0 major, 1 minor
Author
Owner

Security Re-Review — Pass 3 (SHA 60ab9641, PR #567, bead bookshelf-c15a.2)

Prior findings — resolution status

[MAJOR] Cross-criterion cursor namespace collision — RESOLVED.
The new ScanCursor.Positions map[string]CriterionPos gives each active criterion its own (afterKey, afterID). The handler reads userLibraryIDs and builds scopedIDs before the cursor is decoded, so the cursor has no influence on the library scope. Unknown keys in Positions (attacker-injected) are silently ignored via map-lookup — they cannot widen scope or cause a panic. The afterKey/afterID values are passed only as SQL parameters in a keyset WHERE clause; the library-ID filter (WHERE b.library_id IN (?)) is applied independently in every finder. Cross-user read via cursor is not possible.

[MINOR] No http.MaxBytesReader — RESOLVED.
http.MaxBytesReader(w, r.Body, 4096) is applied in ScanHandler before parseScanRequest is called (handler.go:128-130). The error-mapper handles *http.MaxBytesError via errors.As (checked before errors.Is(ErrValidation) in the switch), so an oversized body returns 413 not 400. The handler unit test at handler_test.go:339-344 confirms 413 is returned.


Fresh threat-model

SQL injection — all five finder queries (FindDuplicatesByISBN, FindDuplicatesByExternalID, FindDuplicatesByTitleAuthor, FindDuplicatesByDirectory, FindDuplicatesByFilename) pass every user-influenced value (cursor afterKey/afterID, limit+1) as positional ? parameters. Library IDs are server-computed from the session and passed the same way. The LIKE patterns use %%/%% (Go fmt.Sprintf escaping to %/%) which are string literals, not user-controlled. No injection surface found.

ExternalID UNION ALL arg binding — manually verified: HC branch gets libIn, afterKey, afterKey, afterID, limit+1, libIn; GR branch gets the same pattern; total args = 4N + 8 matching the four %s expansions + eight cursor/limit ?s in the SQL. Both branches are independently scoped by userLibraryIDs.

Resource/DoS — request body capped at 4096 bytes via MaxBytesReader. The cursor query param is bounded by Go's default MaxHeaderBytes (1 MB) and is a server-generated opaque blob with no unbounded sub-structure. Each finder's inner subquery limits to limit+1 groups (clamped to max 200 by clampLimit); the ExternalID combined outer query has no second LIMIT but is bounded by 2x(limit+1) groups, which is documented and handled correctly by the service-layer seen-map. No unbounded scan path found.

Cursor decode safetyDecodeCursor returns a typed error (wrapping ErrValidation) on invalid base64 or malformed JSON; no panic path. The Positions map is deserialized into a Go struct with typed fields; oversized maps are bounded by the cursor string length, itself bounded by HTTP headers.

Multi-user fail-closed — every finder has if len(userLibraryIDs) == 0 { return nil, nil } as the first guard. filterToLibrary rejects requests for a library not in the user's accessible set (returns 404). userLibraryIDs is derived from the authenticated session, never from the request body or cursor.

keptKeys pagination correctness — the seen2 loop correctly breaks on len(seen2) > limit, excluding the peek group from keptKeys. The lastPos update only iterates labeledRows in keptKeys, so the cursor points to the last book of the last returned group, not the discarded peek group.


New findings

No new blockers or majors. One minor style note:

[MINOR] internal/dedup/handler.go:53 — double-wrapped error format %w: %w with ErrValidation first
fmt.Errorf("dedup scan: decode body: %w: %w", middleware.ErrValidation, err) is an unusual multi-error pattern. The statusFor switch checks errors.As(err, &maxBytesErr) before errors.Is(ErrValidation), so the 413 path wins correctly and the test confirms it. The double-colon separator is cosmetic and non-standard but causes no correctness or security issue. The same pattern exists in the merge handler, suggesting it is an established codebase convention. No fix required.


REVIEW VERDICT: 0 blocker, 0 major, 1 minor

## Security Re-Review — Pass 3 (SHA 60ab9641, PR #567, bead bookshelf-c15a.2) ### Prior findings — resolution status **[MAJOR] Cross-criterion cursor namespace collision — RESOLVED.** The new `ScanCursor.Positions map[string]CriterionPos` gives each active criterion its own `(afterKey, afterID)`. The handler reads `userLibraryIDs` and builds `scopedIDs` *before* the cursor is decoded, so the cursor has no influence on the library scope. Unknown keys in `Positions` (attacker-injected) are silently ignored via map-lookup — they cannot widen scope or cause a panic. The `afterKey`/`afterID` values are passed only as SQL *parameters* in a keyset WHERE clause; the library-ID filter (`WHERE b.library_id IN (?)`) is applied independently in every finder. Cross-user read via cursor is not possible. **[MINOR] No `http.MaxBytesReader` — RESOLVED.** `http.MaxBytesReader(w, r.Body, 4096)` is applied in `ScanHandler` before `parseScanRequest` is called (`handler.go:128-130`). The error-mapper handles `*http.MaxBytesError` via `errors.As` (checked before `errors.Is(ErrValidation)` in the switch), so an oversized body returns 413 not 400. The handler unit test at `handler_test.go:339-344` confirms 413 is returned. --- ### Fresh threat-model **SQL injection** — all five finder queries (`FindDuplicatesByISBN`, `FindDuplicatesByExternalID`, `FindDuplicatesByTitleAuthor`, `FindDuplicatesByDirectory`, `FindDuplicatesByFilename`) pass every user-influenced value (cursor `afterKey`/`afterID`, `limit+1`) as positional `?` parameters. Library IDs are server-computed from the session and passed the same way. The LIKE patterns use `%%/%%` (Go `fmt.Sprintf` escaping to `%/%`) which are string literals, not user-controlled. No injection surface found. **ExternalID UNION ALL arg binding** — manually verified: HC branch gets `libIn, afterKey, afterKey, afterID, limit+1, libIn`; GR branch gets the same pattern; total args = `4N + 8` matching the four `%s` expansions + eight cursor/limit `?`s in the SQL. Both branches are independently scoped by `userLibraryIDs`. **Resource/DoS** — request body capped at 4096 bytes via `MaxBytesReader`. The cursor query param is bounded by Go's default `MaxHeaderBytes` (1 MB) and is a server-generated opaque blob with no unbounded sub-structure. Each finder's inner subquery limits to `limit+1` groups (clamped to max 200 by `clampLimit`); the ExternalID combined outer query has no second LIMIT but is bounded by 2x(limit+1) groups, which is documented and handled correctly by the service-layer seen-map. No unbounded scan path found. **Cursor decode safety** — `DecodeCursor` returns a typed error (wrapping `ErrValidation`) on invalid base64 or malformed JSON; no panic path. The `Positions` map is deserialized into a Go struct with typed fields; oversized maps are bounded by the cursor string length, itself bounded by HTTP headers. **Multi-user fail-closed** — every finder has `if len(userLibraryIDs) == 0 { return nil, nil }` as the first guard. `filterToLibrary` rejects requests for a library not in the user's accessible set (returns 404). `userLibraryIDs` is derived from the authenticated session, never from the request body or cursor. **`keptKeys` pagination correctness** — the seen2 loop correctly breaks on `len(seen2) > limit`, excluding the peek group from `keptKeys`. The `lastPos` update only iterates `labeledRows` in `keptKeys`, so the cursor points to the last book of the last *returned* group, not the discarded peek group. --- ### New findings No new blockers or majors. One minor style note: [MINOR] internal/dedup/handler.go:53 — double-wrapped error format `%w: %w` with ErrValidation first `fmt.Errorf("dedup scan: decode body: %w: %w", middleware.ErrValidation, err)` is an unusual multi-error pattern. The `statusFor` switch checks `errors.As(err, &maxBytesErr)` before `errors.Is(ErrValidation)`, so the 413 path wins correctly and the test confirms it. The double-colon separator is cosmetic and non-standard but causes no correctness or security issue. The same pattern exists in the merge handler, suggesting it is an established codebase convention. No fix required. --- REVIEW VERDICT: 0 blocker, 0 major, 1 minor
Author
Owner

Security Re-Review — Pass 3 (SHA 60ab9641, PR #567, bead bookshelf-c15a.2)

Prior findings — resolution status

[MAJOR] Cross-criterion cursor namespace collision — RESOLVED.
The new ScanCursor.Positions map[string]CriterionPos gives each active criterion its own (afterKey, afterID). The handler reads userLibraryIDs and builds scopedIDs before the cursor is decoded, so the cursor has no influence on the library scope. Unknown keys in Positions (attacker-injected) are silently ignored via map-lookup — they cannot widen scope or cause a panic. The afterKey/afterID values are passed only as SQL parameters in a keyset WHERE clause; the library-ID filter (WHERE b.library_id IN (?)) is applied independently in every finder. Cross-user read via cursor is not possible.

[MINOR] No http.MaxBytesReader — RESOLVED.
http.MaxBytesReader(w, r.Body, 4096) is applied in ScanHandler before parseScanRequest is called (handler.go:128-130). The error-mapper handles *http.MaxBytesError via errors.As (checked before errors.Is(ErrValidation) in the switch), so an oversized body returns 413 not 400. The handler unit test at handler_test.go:339-344 confirms 413 is returned.


Fresh threat-model

SQL injection — all five finder queries (FindDuplicatesByISBN, FindDuplicatesByExternalID, FindDuplicatesByTitleAuthor, FindDuplicatesByDirectory, FindDuplicatesByFilename) pass every user-influenced value (cursor afterKey/afterID, limit+1) as positional ? parameters. Library IDs are server-computed from the session and passed the same way. The LIKE patterns use %%/%% (Go fmt.Sprintf escaping to %/%) which are string literals, not user-controlled. No injection surface found.

ExternalID UNION ALL arg binding — manually verified: HC branch gets libIn, afterKey, afterKey, afterID, limit+1, libIn; GR branch gets the same pattern; total args = 4N + 8 matching the four %s expansions + eight cursor/limit ?s in the SQL. Both branches are independently scoped by userLibraryIDs.

Resource/DoS — request body capped at 4096 bytes via MaxBytesReader. The cursor query param is bounded by Go's default MaxHeaderBytes (1 MB) and is a server-generated opaque blob with no unbounded sub-structure. Each finder's inner subquery limits to limit+1 groups (clamped to max 200 by clampLimit); the ExternalID combined outer query has no second LIMIT but is bounded by 2x(limit+1) groups, which is documented and handled correctly by the service-layer seen-map. No unbounded scan path found.

Cursor decode safetyDecodeCursor returns a typed error (wrapping ErrValidation) on invalid base64 or malformed JSON; no panic path. The Positions map is deserialized into a Go struct with typed fields; oversized maps are bounded by the cursor string length, itself bounded by HTTP headers.

Multi-user fail-closed — every finder has if len(userLibraryIDs) == 0 { return nil, nil } as the first guard. filterToLibrary rejects requests for a library not in the user's accessible set (returns 404). userLibraryIDs is derived from the authenticated session, never from the request body or cursor.

keptKeys pagination correctness — the seen2 loop correctly breaks on len(seen2) > limit, excluding the peek group from keptKeys. The lastPos update only iterates labeledRows in keptKeys, so the cursor points to the last book of the last returned group, not the discarded peek group.


New findings

No new blockers or majors. One minor style note:

[MINOR] internal/dedup/handler.go:53 — double-wrapped error format %w: %w with ErrValidation first
fmt.Errorf("dedup scan: decode body: %w: %w", middleware.ErrValidation, err) is an unusual multi-error pattern. The statusFor switch checks errors.As(err, &maxBytesErr) before errors.Is(ErrValidation), so the 413 path wins correctly and the test confirms it. The double-colon separator is cosmetic and non-standard but causes no correctness or security issue. The same pattern exists in the merge handler, suggesting it is an established codebase convention. No fix required.


REVIEW VERDICT: 0 blocker, 0 major, 1 minor

## Security Re-Review — Pass 3 (SHA 60ab9641, PR #567, bead bookshelf-c15a.2) ### Prior findings — resolution status **[MAJOR] Cross-criterion cursor namespace collision — RESOLVED.** The new `ScanCursor.Positions map[string]CriterionPos` gives each active criterion its own `(afterKey, afterID)`. The handler reads `userLibraryIDs` and builds `scopedIDs` *before* the cursor is decoded, so the cursor has no influence on the library scope. Unknown keys in `Positions` (attacker-injected) are silently ignored via map-lookup — they cannot widen scope or cause a panic. The `afterKey`/`afterID` values are passed only as SQL *parameters* in a keyset WHERE clause; the library-ID filter (`WHERE b.library_id IN (?)`) is applied independently in every finder. Cross-user read via cursor is not possible. **[MINOR] No `http.MaxBytesReader` — RESOLVED.** `http.MaxBytesReader(w, r.Body, 4096)` is applied in `ScanHandler` before `parseScanRequest` is called (`handler.go:128-130`). The error-mapper handles `*http.MaxBytesError` via `errors.As` (checked before `errors.Is(ErrValidation)` in the switch), so an oversized body returns 413 not 400. The handler unit test at `handler_test.go:339-344` confirms 413 is returned. --- ### Fresh threat-model **SQL injection** — all five finder queries (`FindDuplicatesByISBN`, `FindDuplicatesByExternalID`, `FindDuplicatesByTitleAuthor`, `FindDuplicatesByDirectory`, `FindDuplicatesByFilename`) pass every user-influenced value (cursor `afterKey`/`afterID`, `limit+1`) as positional `?` parameters. Library IDs are server-computed from the session and passed the same way. The LIKE patterns use `%%/%%` (Go `fmt.Sprintf` escaping to `%/%`) which are string literals, not user-controlled. No injection surface found. **ExternalID UNION ALL arg binding** — manually verified: HC branch gets `libIn, afterKey, afterKey, afterID, limit+1, libIn`; GR branch gets the same pattern; total args = `4N + 8` matching the four `%s` expansions + eight cursor/limit `?`s in the SQL. Both branches are independently scoped by `userLibraryIDs`. **Resource/DoS** — request body capped at 4096 bytes via `MaxBytesReader`. The cursor query param is bounded by Go's default `MaxHeaderBytes` (1 MB) and is a server-generated opaque blob with no unbounded sub-structure. Each finder's inner subquery limits to `limit+1` groups (clamped to max 200 by `clampLimit`); the ExternalID combined outer query has no second LIMIT but is bounded by 2x(limit+1) groups, which is documented and handled correctly by the service-layer seen-map. No unbounded scan path found. **Cursor decode safety** — `DecodeCursor` returns a typed error (wrapping `ErrValidation`) on invalid base64 or malformed JSON; no panic path. The `Positions` map is deserialized into a Go struct with typed fields; oversized maps are bounded by the cursor string length, itself bounded by HTTP headers. **Multi-user fail-closed** — every finder has `if len(userLibraryIDs) == 0 { return nil, nil }` as the first guard. `filterToLibrary` rejects requests for a library not in the user's accessible set (returns 404). `userLibraryIDs` is derived from the authenticated session, never from the request body or cursor. **`keptKeys` pagination correctness** — the seen2 loop correctly breaks on `len(seen2) > limit`, excluding the peek group from `keptKeys`. The `lastPos` update only iterates `labeledRows` in `keptKeys`, so the cursor points to the last book of the last *returned* group, not the discarded peek group. --- ### New findings No new blockers or majors. One minor style note: [MINOR] internal/dedup/handler.go:53 — double-wrapped error format `%w: %w` with ErrValidation first `fmt.Errorf("dedup scan: decode body: %w: %w", middleware.ErrValidation, err)` is an unusual multi-error pattern. The `statusFor` switch checks `errors.As(err, &maxBytesErr)` before `errors.Is(ErrValidation)`, so the 413 path wins correctly and the test confirms it. The double-colon separator is cosmetic and non-standard but causes no correctness or security issue. The same pattern exists in the merge handler, suggesting it is an established codebase convention. No fix required. --- REVIEW VERDICT: 0 blocker, 0 major, 1 minor
Author
Owner

CODE REVIEW (3rd pass) — HEAD SHA 60ab9641

Phase 0: DEMO Verification

No DEMO block in the bead. This is a re-review of a fix commit; verifying the fix correctness via code analysis per the review request.


Fix Verification (per the four claims)

(a) Per-criterion ScanCursor — each active finder gets its own {key,id}; absent entry = start from beginning

internal/dedup/service.go lines 113–127: Each criterionFinder resolves its own cursor position from cursor.Positions[cf.label]. When the label is absent (criterion was not in any prior kept page), afterKey="" and afterID=0 — the finder starts from the beginning. Correct.

(b) Group-count LIMIT in the SQL — groups are never split across a page boundary

All five finders now move the HAVING keyset cursor + ORDER BY + LIMIT inside the group subquery (e.g. internal/dedup/store.go lines 105–108 for ISBN, lines 242–246 for TitleAuthor, lines 345–348 for HC, lines 386–389 for GR, lines 505–508 for Directory, lines 609–613 for Filename). The outer query JOINs the group subquery and returns all member rows for each paginated group without an additional LIMIT. This is the correct fix: the DB returns whole groups, not raw rows, so a group is never split. Verified.

(c) Next cursor uses only kept-group rows

internal/dedup/service.go lines 151–186: keptKeys is built by scanning allRows in order and collecting distinct (matchType, groupKey) pairs until len(seen2) > limit. The inner check if len(seen2) <= limit { keptKeys[...] = true } correctly excludes the (limit+1)th group. lastPos is then populated only from labeledRows entries whose MatchType+GroupKey is in keptKeys. This means the cursor reflects the last row of the last returned group, not any peek rows. Verified.

(d) Multi-page e2e regression test non-vacuousness

e2e/api/dedup_test.go lines 908–978: The test seeds 3 ISBN pairs and uses the Balanced preset (isbn + external_id + title_author) with limit=2. It paginates two pages and asserts all 3 ISBN groups appear exactly once across both pages. The test does verify ISBN cursor resumption across pages and would have failed with the old single-cursor bug.

However: the seeded data only produces ISBN groups (books have distinct titles, no external IDs). ExternalID and TitleAuthor finders return empty. This means the test exercises only one active criterion — it does not demonstrate the cross-criterion cursor isolation fix (the specific bug that was fixed: a single shared cursor causing one criterion to skip groups in another criterion's keyspace). The function comment in seedMultiPageDedupLibrary (line 828) incorrectly says "3 ISBN groups and 1 title+author duplicate group" — no title+author pair is seeded.


Findings

[MINOR] e2e/api/dedup_test.go:828 — multi-page test exercises only one active criterion, comment incorrect
The seedMultiPageDedupLibrary comment says "3 ISBN groups and 1 title+author duplicate group" but only inserts ISBN pairs (distinct titles, no shared norm_key). ExternalID and TitleAuthor return empty on every page. The test verifies ISBN pagination, but the cross-criterion cursor-namespace isolation fix — where ISBN cursor must not pollute the TitleAuthor or ExternalID cursor position on the next page — is not covered. The old single-cursor bug would also fail this test (the previous behavior used a single shared {afterKey, afterID} for all criteria, so with only one active criterion the old code would have passed too). To make this a true regression test for the isolation fix, seed at least one TitleAuthor pair alongside the ISBN pairs and assert both appear across the expected pages.

[MINOR] internal/dedup/service.go — Scan function is 132 lines (project metric: <30 lines per function)
The inner anonymous function returned by Scan is the main concern at ~100 lines. This pre-existed the fix but the fix added complexity. Not a blocker; the structure is clear and well-commented. Consider extracting the keptKeys + lastPos computation into a named helper.


Fresh-eyes checks (per review prompt)

  • ORDER BY total-order tiebreaker: Every finder uses ORDER BY <group_key> ASC, MIN(<id>) ASC on the group subquery, and ORDER BY <group_key> ASC, b.id ASC on the outer query. Total order preserved. No flake risk.
  • Multi-user/library fail-closed scoping: All five finders early-return nil, nil when len(userLibraryIDs) == 0. Inner and outer subqueries carry b.library_id IN (?) / b2.library_id IN (?) filters. Preserved.
  • No N+1: All finders use JOIN-based batch queries. No per-row follow-up queries.
  • 413 / MaxBytesReader: http.MaxBytesReader(w, r.Body, 4096) is applied before json.NewDecoder(r.Body).Decode in handler.go lines 129–134. The statusFor error mapper checks errors.As(err, &maxBytesErr) before ErrValidation (error_mapper.go lines 29–31), so 413 is returned correctly even though parseScanRequest wraps the error with ErrValidation. Handler test at handler_test.go:620 ("returns 413 for a JSON body exceeding 4096 bytes") covers this path.
  • 4-assertion browser It split: The original single It was split into separate Describe blocks with one assertion each in find_duplicates_test.go (lines 828–943). Compliant.
  • Migration 0037 index-only: internal/db/migrations/0037_book_file_dedup_covering_index.up.sql adds only an index, no new columns. Grimmory-compatibility rule is satisfied.
  • ExternalID UNION ALL cursor boundary: gr: sorts before hc: lexicographically so all goodreads groups are always delivered before any hardcover group appears in a page. A cursor ending at hc:X cannot cause goodreads groups to be skipped because all gr: groups were already returned. Verified safe.

REVIEW VERDICT: 0 blocker, 0 major, 2 minor

## CODE REVIEW (3rd pass) — HEAD SHA 60ab9641 ### Phase 0: DEMO Verification No DEMO block in the bead. This is a re-review of a fix commit; verifying the fix correctness via code analysis per the review request. --- ### Fix Verification (per the four claims) **(a) Per-criterion ScanCursor — each active finder gets its own {key,id}; absent entry = start from beginning** `internal/dedup/service.go` lines 113–127: Each `criterionFinder` resolves its own cursor position from `cursor.Positions[cf.label]`. When the label is absent (criterion was not in any prior kept page), `afterKey=""` and `afterID=0` — the finder starts from the beginning. Correct. **(b) Group-count LIMIT in the SQL — groups are never split across a page boundary** All five finders now move the HAVING keyset cursor + ORDER BY + LIMIT inside the group subquery (e.g. `internal/dedup/store.go` lines 105–108 for ISBN, lines 242–246 for TitleAuthor, lines 345–348 for HC, lines 386–389 for GR, lines 505–508 for Directory, lines 609–613 for Filename). The outer query JOINs the group subquery and returns all member rows for each paginated group without an additional LIMIT. This is the correct fix: the DB returns whole groups, not raw rows, so a group is never split. Verified. **(c) Next cursor uses only kept-group rows** `internal/dedup/service.go` lines 151–186: `keptKeys` is built by scanning `allRows` in order and collecting distinct `(matchType, groupKey)` pairs until `len(seen2) > limit`. The inner check `if len(seen2) <= limit { keptKeys[...] = true }` correctly excludes the (limit+1)th group. `lastPos` is then populated only from `labeledRows` entries whose `MatchType+GroupKey` is in `keptKeys`. This means the cursor reflects the last row of the last returned group, not any peek rows. Verified. **(d) Multi-page e2e regression test non-vacuousness** `e2e/api/dedup_test.go` lines 908–978: The test seeds 3 ISBN pairs and uses the Balanced preset (isbn + external_id + title_author) with limit=2. It paginates two pages and asserts all 3 ISBN groups appear exactly once across both pages. The test **does** verify ISBN cursor resumption across pages and would have failed with the old single-cursor bug. However: the seeded data only produces ISBN groups (books have distinct titles, no external IDs). ExternalID and TitleAuthor finders return empty. This means the test exercises only one active criterion — it **does not** demonstrate the cross-criterion cursor isolation fix (the specific bug that was fixed: a single shared cursor causing one criterion to skip groups in another criterion's keyspace). The function comment in `seedMultiPageDedupLibrary` (line 828) incorrectly says "3 ISBN groups and 1 title+author duplicate group" — no title+author pair is seeded. --- ### Findings [MINOR] e2e/api/dedup_test.go:828 — multi-page test exercises only one active criterion, comment incorrect The `seedMultiPageDedupLibrary` comment says "3 ISBN groups and 1 title+author duplicate group" but only inserts ISBN pairs (distinct titles, no shared norm_key). ExternalID and TitleAuthor return empty on every page. The test verifies ISBN pagination, but the cross-criterion cursor-namespace isolation fix — where ISBN cursor must not pollute the TitleAuthor or ExternalID cursor position on the next page — is not covered. The old single-cursor bug would also fail this test (the previous behavior used a single shared `{afterKey, afterID}` for all criteria, so with only one active criterion the old code would have passed too). To make this a true regression test for the isolation fix, seed at least one TitleAuthor pair alongside the ISBN pairs and assert both appear across the expected pages. [MINOR] internal/dedup/service.go — Scan function is 132 lines (project metric: <30 lines per function) The inner anonymous function returned by `Scan` is the main concern at ~100 lines. This pre-existed the fix but the fix added complexity. Not a blocker; the structure is clear and well-commented. Consider extracting the keptKeys + lastPos computation into a named helper. --- ### Fresh-eyes checks (per review prompt) - **ORDER BY total-order tiebreaker:** Every finder uses `ORDER BY <group_key> ASC, MIN(<id>) ASC` on the group subquery, and `ORDER BY <group_key> ASC, b.id ASC` on the outer query. Total order preserved. No flake risk. - **Multi-user/library fail-closed scoping:** All five finders early-return `nil, nil` when `len(userLibraryIDs) == 0`. Inner and outer subqueries carry `b.library_id IN (?)` / `b2.library_id IN (?)` filters. Preserved. - **No N+1:** All finders use JOIN-based batch queries. No per-row follow-up queries. - **413 / MaxBytesReader:** `http.MaxBytesReader(w, r.Body, 4096)` is applied before `json.NewDecoder(r.Body).Decode` in `handler.go` lines 129–134. The `statusFor` error mapper checks `errors.As(err, &maxBytesErr)` before `ErrValidation` (error_mapper.go lines 29–31), so 413 is returned correctly even though `parseScanRequest` wraps the error with `ErrValidation`. Handler test at handler_test.go:620 (`"returns 413 for a JSON body exceeding 4096 bytes"`) covers this path. - **4-assertion browser It split:** The original single It was split into separate `Describe` blocks with one assertion each in `find_duplicates_test.go` (lines 828–943). Compliant. - **Migration 0037 index-only:** `internal/db/migrations/0037_book_file_dedup_covering_index.up.sql` adds only an index, no new columns. Grimmory-compatibility rule is satisfied. - **ExternalID UNION ALL cursor boundary:** `gr:` sorts before `hc:` lexicographically so all goodreads groups are always delivered before any hardcover group appears in a page. A cursor ending at `hc:X` cannot cause goodreads groups to be skipped because all `gr:` groups were already returned. Verified safe. --- REVIEW VERDICT: 0 blocker, 0 major, 2 minor
zombor force-pushed bd-bookshelf-c15a.2 from 60ab9641d8
All checks were successful
/ JS Unit Tests (pull_request) Successful in 23s
/ Lint (pull_request) Successful in 2m48s
/ Test (pull_request) Successful in 4m14s
/ E2E Browser (pull_request) Successful in 4m32s
/ E2E API (pull_request) Successful in 6m41s
/ Integration (pull_request) Successful in 6m49s
to 260395ae12
All checks were successful
/ Lint (pull_request) Successful in 2m13s
/ JS Unit Tests (pull_request) Successful in 22s
/ Test (pull_request) Successful in 3m29s
/ E2E Browser (pull_request) Successful in 3m56s
/ Integration (pull_request) Successful in 5m43s
/ E2E API (pull_request) Successful in 4m51s
2026-06-15 18:37:16 +00:00
Compare
zombor merged commit 423183f3ac into main 2026-06-15 18:44:47 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
zombor/pergamum!567
No description provided.