feat: Add Native Support for In-Memory Cache by pchintar · Pull Request #4591 · apache/datafusion-comet

pchintar · 2026-06-04T13:56:13Z

Which issue does this PR close?

Closes #2391 .

Rationale for this change

Comet currently has limited support for Spark's in-memory cache.

When a table is cached and later read, the cached data cannot be consumed directly by Comet operators. Instead, the execution plan falls back to Spark's cache scan path and introduces an additional CometSparkColumnarToColumnar conversion before execution can continue in Comet.

This extra conversion adds overhead to cached table scans and prevents cached data from remaining on a native Comet execution path.

This PR adds native support for in-memory cached tables so that cached data written in a Comet-compatible format can be read directly by Comet operators.

What changes are included in this PR?

This PR introduces a native cache path for in-memory cached tables behind a new configuration:

spark.comet.exec.inMemoryCache.enabled

When enabled:

Cached data is stored using a Comet-specific cache serializer.
Cached data is represented as CometCachedBatch.
Cached tables are scanned using CometInMemoryTableScanExec.
Cached data can be consumed directly by Comet operators without introducing a CometSparkColumnarToColumnar conversion.

When disabled:

Spark's existing cache serializer continues to be used.
Existing cache scan behavior is preserved.

How are these changes tested?

Added CometInMemoryCacheSuite covering:

Comet-native cache scan over CometCachedBatch
Fallback behavior when native cache support is disabled
Multi-partition cached tables
Empty cached tables
Projection-only cache reads
Shuffle execution after cached table scans

Verified with:

./mvnw -pl spark -DskipTests test-compile

./mvnw test -pl spark \
  -DwildcardSuites=org.apache.comet.exec.CometInMemoryCacheSuite

pchintar · 2026-06-04T13:59:22Z

cc @andygrove @mbutrovich

andygrove · 2026-06-04T14:15:42Z

Thanks @pchintar. I will compare this to #4569 today

pchintar · 2026-06-10T11:59:40Z

Thanks @pchintar. I will compare this to #4569 today

Hi @andygrove Could you kindly, if possible, provide any update on this? thnx.

andygrove · 2026-06-10T14:38:16Z

Thanks @pchintar. I will compare this to #4569 today

Hi @andygrove Could you kindly, if possible, provide any update on this? thnx.

Hi @pchintar. I haven't had time to review yet, but I will. I am working on some more urgent items for the 0.17.0 release currently.

Unfortunately we have limited review bandwidth.

andygrove · 2026-06-19T21:48:59Z

Comparison of #4569 and #4591

These two PRs both close #2391 and take the same fundamental approach, so cross-linking a comparison here for visibility.

Shared goal and mechanics

Both solve the same problem: Comet does not treat InMemoryTableScanExec as native, so it inserts a CometSparkToColumnarExec and pays a JVM-to-Arrow conversion on every cached read. Both fix this by storing the cache as compressed Arrow IPC once at build time, so repeated scans feed native execution directly.

Both share the same core building blocks:

A custom CachedBatchSerializer that encodes batches with Comet's existing serializeBatches / decodeBatches (compressed Arrow IPC).
A CometCachedBatch payload holding the IPC bytes.
Installation of the serializer via CometDriverPlugin at startup.
A new config, off by default.
Decode back to CometVector-backed ColumnarBatch with column pruning, plus an InternalRow fallback for non-Comet consumers.
Roughly the same size and the same test layout (a serializer suite plus a plugin/exec suite).

Key differences

Dimension	#4569	#4591
Serializer base class	`SimpleMetricsCachedBatchSerializer`	plain `CachedBatchSerializer`
Batch-level pruning	Yes: stores a Spark-format per-column stats row (min/max/null/count), so `buildFilter` prunes batches	No: `stats = InternalRow.empty`, `buildFilter` is a pass-through no-op
How the scan stays native	No new operator. Reuses `CometSparkToColumnarExec` with a passthrough fast-path: if columns are already `CometVector`, forward without re-copy (adds `numPassthroughBatches` metric)	New operator `CometInMemoryTableScanExec` (a `CometExec` / `LeafExecNode`) plus a `CometOperatorSerde`, wired into `CometExecRule`'s `nativeExecs` map
Unsupported types	Delegates to `DefaultCachedBatchSerializer` (nested/complex) as an explicit drop-in	Delegates to `DefaultCachedBatchSerializer` by inspecting batch class in the convert methods
Serializer install policy	Sets `spark.sql.cache.serializer` only when enabled, and never overrides a user-provided non-default serializer	Always registers the serializer; the serializer decides at runtime whether to use Comet format or delegate
Config	`spark.comet.cache.serializer.enabled`	`spark.comet.exec.inMemoryCache.enabled` (EXEC category)
Plan-rule changes	Minimal (works through the existing SparkToColumnar path)	Adds an explicit `InMemoryTableScanExec` case with detailed fallback-reason messages (disabled / wrong-batch-class / empty-buffer)

Architectural distinction

The main difference is the integration strategy:

feat: add Comet CachedBatchSerializer for native in-memory cache #4569 is serializer-centric. It does the minimum on the plan-rule side and leans on the existing CometSparkToColumnarExec, teaching it to skip the copy when batches are already Arrow. It also invests in stats so filter pushdown prunes cached batches, and is careful to respect a user-set serializer.
feat: Add Native Support for In-Memory Cache #4591 is operator-centric. It introduces a dedicated CometInMemoryTableScanExec node and serde, giving cached scans a first-class place in the Comet operator framework with explicit fallback reasons. It currently skips column statistics, so cached filters do not get batch pruning.

Suggested path forward

A strong combined outcome would pair #4591's dedicated scan operator and explicit fallback reasons with #4569's stats-based buildFilter, passthrough fast-path, and respect-user-serializer install logic.

Add Comet-native in-memory cache scan support

d87014a

pchintar changed the title ~~Add native support for in-memory cache~~ feat: Add native support for in-memory cache Jun 4, 2026

pchintar changed the title ~~feat: Add native support for in-memory cache~~ feat: Add Native Support for In-Memory Cache Jun 4, 2026

pchintar mentioned this pull request Jun 4, 2026

Explore options for accelerating InMemoryTableScanExec #2391

Open

mbutrovich self-requested a review June 4, 2026 14:51

Merge branch 'main' into comet-native-in-memory-cache

502d5e5

andygrove added this to the 1.0.0 milestone Jun 11, 2026

andygrove mentioned this pull request Jun 19, 2026

feat: add Comet CachedBatchSerializer for native in-memory cache #4569

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Native Support for In-Memory Cache#4591

feat: Add Native Support for In-Memory Cache#4591
pchintar wants to merge 2 commits into
apache:mainfrom
pchintar:comet-native-in-memory-cache

pchintar commented Jun 4, 2026 •

edited

Loading

Uh oh!

pchintar commented Jun 4, 2026

Uh oh!

andygrove commented Jun 4, 2026

Uh oh!

pchintar commented Jun 10, 2026

Uh oh!

andygrove commented Jun 10, 2026

Uh oh!

andygrove commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pchintar commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

pchintar commented Jun 4, 2026

Uh oh!

andygrove commented Jun 4, 2026

Uh oh!

pchintar commented Jun 10, 2026

Uh oh!

andygrove commented Jun 10, 2026

Uh oh!

andygrove commented Jun 19, 2026

Comparison of #4569 and #4591

Shared goal and mechanics

Key differences

Architectural distinction

Suggested path forward

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pchintar commented Jun 4, 2026 •

edited

Loading