fix(scripts): distinguish absent vs failed title fetches; gate resume cursor (#236)#246
Merged
Merged
Conversation
… cursor (#236) fetchWithRateRetry returned the same null for a title that is legitimately absent at a release point (301/302/404/doc-not-found) and for a transient failure (5xx after retries, 429 exhaustion, network error after retries). downloadAndExtractXml logged both as the misleading info "Title not available", processBatch skipped them identically, and processReleasePoint advanced lastCompletedReleasePoint unconditionally — so a network blip permanently dropped a title's update from the historical record and --resume never retried it. The main loop also advanced the cursor on a thrown release point. - Introduce a TitleFetch discriminated union (ok | absent | failed); fetchWithRateRetry returns a FetchOutcome, downloadAndExtractXml returns TitleFetch. A present-but-corrupt archive / missing XML is 'failed', not 'absent'. - processBatch counts failedTitles; processReleasePoint only advances the resume cursor when failedTitles === 0 (it still saves manifests so a retry re-imports only the failed title — already-imported titles are no-op deltas). - The main loop no longer advances the cursor on a thrown release point and exits non-zero when any release point had unrecovered failures, so silent gaps surface instead of looking like a clean run. Classification verified via tsx (404/302/doc-not-found -> absent; 200 zip -> response; persistent 5xx / network error -> failed). scripts/ is not a workspace package so it has no CI test/typecheck; wiring it in is a follow-up. Closes #236 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
fetchWithRateRetryreturned the samenullfor a title that is legitimately absent at a release point (301/302/404/doc-not-found) and for a transient failure (5xx after retries, 429 exhaustion, network error after retries).downloadAndExtractXmllogged both as the misleading info line"Title not available",processBatchskipped them identically, andprocessReleasePointadvancedlastCompletedReleasePointunconditionally — so a network blip during a multi-hour import permanently dropped a title's update from the historical git record, and--resumenever retried it. The main loop also advanced the cursor on a thrown release point.Fix
TitleFetch = { ok } | { absent } | { failed }.fetchWithRateRetryreturns aFetchOutcome;downloadAndExtractXmlreturnsTitleFetch. A present-but-corrupt archive or missing XML entry isfailed(worth retrying/surfacing), notabsent.processBatchcountsfailedTitles;processReleasePointadvanceslastCompletedReleasePointonly whenfailedTitles === 0. On failure it still saves manifests, so--resumereprocesses the release point and re-imports only the failed title (already-imported titles are no-op deltas — cheap and idempotent).Verification
Classification verified via
tsx(404/302/doc-not-found →absent; 200 ZIP →response; persistent 5xx / network error →failed).Closes #236