Skip to content

[lake] Fix DV readable snapshot scanning and premature snapshot deletion#3519

Open
luoyuxia wants to merge 2 commits into
apache:mainfrom
luoyuxia:dv-snapshot-scan-options
Open

[lake] Fix DV readable snapshot scanning and premature snapshot deletion#3519
luoyuxia wants to merge 2 commits into
apache:mainfrom
luoyuxia:dv-snapshot-scan-options

Conversation

@luoyuxia

@luoyuxia luoyuxia commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Purpose

Linked issue: close #xxx

For Paimon deletion-vector (DV) enabled tables, DvTableReadableSnapshotRetriever.getReadableSnapshotAndOffsets computes, for a just-committed APPEND (tiered) snapshot, the readable snapshot + per-bucket readable offsets, and an earliestSnapshotIdToKeep telling Fluss which older lake snapshots are safe to delete. This PR fixes two correctness bugs in that path.

Bug 1 - scanning the wrong snapshot (ineffective options).
getBucketsWithoutL0AndWithL0(snapshot) decided which buckets have L0 by scanning via fileStoreTable.copy(scanOptions).store().newScan(), passing SCAN_SNAPSHOT_ID and BATCH_SCAN_MODE as table options. Those options are only consumed by table-level scans (DataTableBatchScan), not by store-level scans (AbstractFileStoreScan), so the scan always hit the latest snapshot instead of the requested compacted snapshot (ManifestsReader falls back to snapshotManager.latestSnapshot() when no snapshot is specified). Fixed by using the direct .withSnapshot(snapshot) API, consistent with getBucketsWithFlushedL0 / PaimonDvTableUtils.

Bug 2 - premature snapshot deletion (over-aggressive retention).
When the latest compacted snapshot had no L0 in any bucket, the code shortcut earliestSnapshotIdToKeep to that compacted snapshot's previous APPEND, assuming nothing earlier was needed. This is unsound: a bucket can be clean (no L0) in the current compacted snapshot yet still be anchored to an older snapshot - it was flushed earlier and has not been flushed since, so its base anchor (the previous APPEND of the latest snapshot that exactly holds its most-recently flushed L0) can be older than the compacted snapshot's previous APPEND. Once such a bucket later receives new L0, recomputation traces back to that older anchor; if it was already deleted, the retrieval returns null and readable offsets can no longer advance.

Concrete example (the scenario exercised by testGetReadableSnapshotAndOffsets, using Paimon snapshot ids; focus on bucket0):

  1. bucket0 is compacted at snapshot 7, which flushes its L0 into base. Its base anchor is snapshot 6 (the latest snapshot still holding that flushed L0; an APPEND, so it is the anchor directly).
  2. By snapshot 9 (the latest COMPACT at that point) all three buckets are fully compacted - no L0 in any bucket. The old code therefore set earliestSnapshotIdToKeep = snapshot 9's previous APPEND = snapshot 8, so Fluss deleted every lake snapshot < 8, including snapshot 6.
  3. bucket0 then receives new appends (new L0) and a later compaction produces snapshot 11. Now bucket0 has L0 again, so recomputation traces back to its most recent flush (snapshot 7) and needs its base anchor snapshot 6 to read the offset - but snapshot 6 was already deleted in step 2. getReadableSnapshotAndOffsets returns null, and readable offsets can no longer advance.

Bug 2 was masked by Bug 1: scanning the latest (appended) snapshot made freshly-appended buckets look like "with L0", so they went through the per-bucket flush traversal that retained the old anchors. Fixing the scan exposed the retention bug.

Brief change log

  • Scan the requested compacted snapshot via FileStoreScan.withSnapshot(snapshot) in getBucketsWithoutL0AndWithL0 instead of relying on table options that the store-level scan ignores.
  • Compute the base anchor for buckets without L0 as well, and lower earliestSnapshotIdToKeep to the minimum anchor across all buckets (a bucket's anchor only moves forward when it is flushed again, so until then it must be retained).
    • Best-effort: if a bucket's flush history has expired and its anchor cannot be determined, fall back to KEEP_ALL_PREVIOUS (keep everything) rather than risk deleting a still-needed snapshot. This never blocks the readable-offset advance, since these buckets' offsets are already resolved from the latest tiered snapshot.
  • Extract the shared anchor computation into findBaseAnchorAppendSnapshot, reused by both the with-L0 offset traversal and the new no-L0 retention pass.

Tests

  • DvTableReadableSnapshotRetrieverTest#testGetReadableSnapshotAndOffsets (non-partitioned) and #testGetReadableSnapshotAndOffsetsForPartitionedTable (partitioned) - both run against real Paimon tables and now pass with the corrected scan and retention logic.
  • The partitioned test's earliestSnapshotIdToKeep expectation, which encoded the old over-aggressive value, is updated to the correct minimum-anchor value (with an explanatory comment).

API and Format

No public API or storage format change.

Documentation

No documentation change; behavior-only fix.


Generative AI disclosure: Yes - authored with Claude Code (Claude Opus 4.8). Reviewed by a human developer.

🤖 Generated with Claude Code

luoyuxia and others added 2 commits June 24, 2026 12:02
…ue to ineffective options

getBucketsWithoutL0AndWithL0() passed SCAN_SNAPSHOT_ID and BATCH_SCAN_MODE
via table options to store().newScan(), but these options are only consumed
by table-level scans (DataTableBatchScan), not by store-level scans
(AbstractFileStoreScan). This means the scan always hit the latest snapshot
instead of the specified one.

Fix by using the direct .withSnapshot(snapshot) API on FileStoreScan,
consistent with getBucketsWithFlushedL0() in the same file.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
For Paimon DV tables, getReadableSnapshotAndOffsets returns an
earliestSnapshotIdToKeep telling Fluss which lake snapshots may be
deleted. When the latest compacted snapshot had no L0 in any bucket, the
code shortcut earliestSnapshotIdToKeep to that compacted snapshot's
previous APPEND, assuming nothing earlier was needed.

This was unsound: a bucket can be clean (no L0) in the current compacted
snapshot yet still be anchored to an older snapshot - it was flushed
earlier and has not been flushed since, so its base anchor (the previous
APPEND of the latest snapshot that exactly holds its most-recently
flushed L0) can be older than the compacted snapshot's previous APPEND.
Once such a bucket later receives new L0, recomputation traces back to
that older anchor; if it was already deleted, the retrieval returns null
and readable offsets can no longer advance.

Fix by also computing the base anchor for buckets without L0 and lowering
earliestSnapshotIdToKeep to the minimum anchor across all buckets. This
is best-effort: if a bucket's flush history has expired and its anchor
cannot be determined, keep all previous snapshots rather than risk
deleting a needed one; it never blocks the readable-offset advance.

The shared anchor computation is extracted into findBaseAnchorAppendSnapshot,
reused by both the with-L0 offset traversal and the new no-L0 retention
pass. The partitioned test expectation that encoded the old, over-aggressive
value is updated accordingly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@fresh-borzoni fresh-borzoni left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luoyuxia Thank you, looks good, left minor comments, PTAL

allAnchorsResolved = false;
}

return allAnchorsResolved ? earliestSnapshotIdToKeep : LakeCommitResult.KEEP_ALL_PREVIOUS;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If one bucket's flush history has aged out, this returns KEEP_ALL_PREVIOUS and throws away the bound the other buckets already gave us - so retention never shrinks while that bucket stays cold.

The unresolvable anchor is by definition older than earliestSnapshotId (already gone from Paimon), so keep-all can't recover it anyway, flooring at earliestSnapshotId is just as safe and lets retention track Paimon's expiry.

WDYT?

continue;
}
// The first flush encountered going backwards is the bucket's most recent flush.
for (PaimonPartitionBucket partitionBucket : getBucketsWithFlushedL0(currentSnapshot)) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This re-walks the same snapshots as the with-L0 loop above, so getBucketsWithFlushedL0 scans manifests twice per compacted snapshot. Could it fold into that loop: one pass, offsets for the with-L0 buckets, anchor-only for the rest?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants