[lake] Fix DV readable snapshot scanning and premature snapshot deletion by luoyuxia · Pull Request #3519 · apache/fluss

luoyuxia · 2026-06-24T07:28:14Z

Purpose

Linked issue: close #xxx

For Paimon deletion-vector (DV) enabled tables, DvTableReadableSnapshotRetriever.getReadableSnapshotAndOffsets computes, for a just-committed APPEND (tiered) snapshot, the readable snapshot + per-bucket readable offsets, and an earliestSnapshotIdToKeep telling Fluss which older lake snapshots are safe to delete. This PR fixes two correctness bugs in that path.

Bug 1 - scanning the wrong snapshot (ineffective options).
getBucketsWithoutL0AndWithL0(snapshot) decided which buckets have L0 by scanning via fileStoreTable.copy(scanOptions).store().newScan(), passing SCAN_SNAPSHOT_ID and BATCH_SCAN_MODE as table options. Those options are only consumed by table-level scans (DataTableBatchScan), not by store-level scans (AbstractFileStoreScan), so the scan always hit the latest snapshot instead of the requested compacted snapshot (ManifestsReader falls back to snapshotManager.latestSnapshot() when no snapshot is specified). Fixed by using the direct .withSnapshot(snapshot) API, consistent with getBucketsWithFlushedL0 / PaimonDvTableUtils.

Bug 2 - premature snapshot deletion (over-aggressive retention).
When the latest compacted snapshot had no L0 in any bucket, the code shortcut earliestSnapshotIdToKeep to that compacted snapshot's previous APPEND, assuming nothing earlier was needed. This is unsound: a bucket can be clean (no L0) in the current compacted snapshot yet still be anchored to an older snapshot - it was flushed earlier and has not been flushed since, so its base anchor (the previous APPEND of the latest snapshot that exactly holds its most-recently flushed L0) can be older than the compacted snapshot's previous APPEND. Once such a bucket later receives new L0, recomputation traces back to that older anchor; if it was already deleted, the retrieval returns null and readable offsets can no longer advance.

Concrete example (the scenario exercised by testGetReadableSnapshotAndOffsets, using Paimon snapshot ids; focus on bucket0):

bucket0 is compacted at snapshot 7, which flushes its L0 into base. Its base anchor is snapshot 6 (the latest snapshot still holding that flushed L0; an APPEND, so it is the anchor directly).
By snapshot 9 (the latest COMPACT at that point) all three buckets are fully compacted - no L0 in any bucket. The old code therefore set earliestSnapshotIdToKeep = snapshot 9's previous APPEND = snapshot 8, so Fluss deleted every lake snapshot < 8, including snapshot 6.
bucket0 then receives new appends (new L0) and a later compaction produces snapshot 11. Now bucket0 has L0 again, so recomputation traces back to its most recent flush (snapshot 7) and needs its base anchor snapshot 6 to read the offset - but snapshot 6 was already deleted in step 2. getReadableSnapshotAndOffsets returns null, and readable offsets can no longer advance.

Bug 2 was masked by Bug 1: scanning the latest (appended) snapshot made freshly-appended buckets look like "with L0", so they went through the per-bucket flush traversal that retained the old anchors. Fixing the scan exposed the retention bug.

Brief change log

Scan the requested compacted snapshot via FileStoreScan.withSnapshot(snapshot) in getBucketsWithoutL0AndWithL0 instead of relying on table options that the store-level scan ignores.
Compute the base anchor for buckets without L0 as well, and lower earliestSnapshotIdToKeep to the minimum anchor across all buckets (a bucket's anchor only moves forward when it is flushed again, so until then it must be retained).
- Best-effort: if a bucket's flush history has expired and its anchor cannot be determined, fall back to KEEP_ALL_PREVIOUS (keep everything) rather than risk deleting a still-needed snapshot. This never blocks the readable-offset advance, since these buckets' offsets are already resolved from the latest tiered snapshot.
Extract the shared anchor computation into findBaseAnchorAppendSnapshot, reused by both the with-L0 offset traversal and the new no-L0 retention pass.

Tests

DvTableReadableSnapshotRetrieverTest#testGetReadableSnapshotAndOffsets (non-partitioned) and #testGetReadableSnapshotAndOffsetsForPartitionedTable (partitioned) - both run against real Paimon tables and now pass with the corrected scan and retention logic.
The partitioned test's earliestSnapshotIdToKeep expectation, which encoded the old over-aggressive value, is updated to the correct minimum-anchor value (with an explanatory comment).

API and Format

No public API or storage format change.

Documentation

No documentation change; behavior-only fix.

Generative AI disclosure: Yes - authored with Claude Code (Claude Opus 4.8). Reviewed by a human developer.

🤖 Generated with Claude Code

…ue to ineffective options getBucketsWithoutL0AndWithL0() passed SCAN_SNAPSHOT_ID and BATCH_SCAN_MODE via table options to store().newScan(), but these options are only consumed by table-level scans (DataTableBatchScan), not by store-level scans (AbstractFileStoreScan). This means the scan always hit the latest snapshot instead of the specified one. Fix by using the direct .withSnapshot(snapshot) API on FileStoreScan, consistent with getBucketsWithFlushedL0() in the same file. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

For Paimon DV tables, getReadableSnapshotAndOffsets returns an earliestSnapshotIdToKeep telling Fluss which lake snapshots may be deleted. When the latest compacted snapshot had no L0 in any bucket, the code shortcut earliestSnapshotIdToKeep to that compacted snapshot's previous APPEND, assuming nothing earlier was needed. This was unsound: a bucket can be clean (no L0) in the current compacted snapshot yet still be anchored to an older snapshot - it was flushed earlier and has not been flushed since, so its base anchor (the previous APPEND of the latest snapshot that exactly holds its most-recently flushed L0) can be older than the compacted snapshot's previous APPEND. Once such a bucket later receives new L0, recomputation traces back to that older anchor; if it was already deleted, the retrieval returns null and readable offsets can no longer advance. Fix by also computing the base anchor for buckets without L0 and lowering earliestSnapshotIdToKeep to the minimum anchor across all buckets. This is best-effort: if a bucket's flush history has expired and its anchor cannot be determined, keep all previous snapshots rather than risk deleting a needed one; it never blocks the readable-offset advance. The shared anchor computation is extracted into findBaseAnchorAppendSnapshot, reused by both the with-L0 offset traversal and the new no-L0 retention pass. The partitioned test expectation that encoded the old, over-aggressive value is updated accordingly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

fresh-borzoni

@luoyuxia Thank you, looks good, left minor comments, PTAL

fresh-borzoni · 2026-06-25T10:06:32Z

+            allAnchorsResolved = false;
+        }
+
+        return allAnchorsResolved ? earliestSnapshotIdToKeep : LakeCommitResult.KEEP_ALL_PREVIOUS;


If one bucket's flush history has aged out, this returns KEEP_ALL_PREVIOUS and throws away the bound the other buckets already gave us - so retention never shrinks while that bucket stays cold.

The unresolvable anchor is by definition older than earliestSnapshotId (already gone from Paimon), so keep-all can't recover it anyway, flooring at earliestSnapshotId is just as safe and lets retention track Paimon's expiry.

WDYT?

fresh-borzoni · 2026-06-25T10:09:50Z

+                continue;
+            }
+            // The first flush encountered going backwards is the bucket's most recent flush.
+            for (PaimonPartitionBucket partitionBucket : getBucketsWithFlushedL0(currentSnapshot)) {


This re-walks the same snapshots as the with-L0 loop above, so getBucketsWithFlushedL0 scans manifests twice per compacted snapshot. Could it fold into that loop: one pass, offsets for the with-L0 buckets, anchor-only for the rest?

luoyuxia and others added 2 commits June 24, 2026 12:02

fresh-borzoni reviewed Jun 25, 2026

View reviewed changes

fresh-borzoni mentioned this pull request Jun 25, 2026

[lake] Fix DvTableReadableSnapshotRetriever scanning wrong snapshot #3396

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[lake] Fix DV readable snapshot scanning and premature snapshot deletion#3519

[lake] Fix DV readable snapshot scanning and premature snapshot deletion#3519
luoyuxia wants to merge 2 commits into
apache:mainfrom
luoyuxia:dv-snapshot-scan-options

luoyuxia commented Jun 24, 2026 •

edited

Loading

Uh oh!

fresh-borzoni left a comment

Uh oh!

fresh-borzoni Jun 25, 2026

Uh oh!

fresh-borzoni Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

luoyuxia commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Brief change log

Tests

API and Format

Documentation

Uh oh!

fresh-borzoni left a comment

Choose a reason for hiding this comment

Uh oh!

fresh-borzoni Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

fresh-borzoni Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

luoyuxia commented Jun 24, 2026 •

edited

Loading