Skip to content

feat: Add Native Support for In-Memory Cache#4591

Open
pchintar wants to merge 2 commits into
apache:mainfrom
pchintar:comet-native-in-memory-cache
Open

feat: Add Native Support for In-Memory Cache#4591
pchintar wants to merge 2 commits into
apache:mainfrom
pchintar:comet-native-in-memory-cache

Conversation

@pchintar

@pchintar pchintar commented Jun 4, 2026

Copy link
Copy Markdown

Which issue does this PR close?

Closes #2391 .

Rationale for this change

Comet currently has limited support for Spark's in-memory cache.

When a table is cached and later read, the cached data cannot be consumed directly by Comet operators. Instead, the execution plan falls back to Spark's cache scan path and introduces an additional CometSparkColumnarToColumnar conversion before execution can continue in Comet.

This extra conversion adds overhead to cached table scans and prevents cached data from remaining on a native Comet execution path.

This PR adds native support for in-memory cached tables so that cached data written in a Comet-compatible format can be read directly by Comet operators.

What changes are included in this PR?

This PR introduces a native cache path for in-memory cached tables behind a new configuration:

spark.comet.exec.inMemoryCache.enabled

When enabled:

  • Cached data is stored using a Comet-specific cache serializer.
  • Cached data is represented as CometCachedBatch.
  • Cached tables are scanned using CometInMemoryTableScanExec.
  • Cached data can be consumed directly by Comet operators without introducing a CometSparkColumnarToColumnar conversion.

When disabled:

  • Spark's existing cache serializer continues to be used.
  • Existing cache scan behavior is preserved.

How are these changes tested?

Added CometInMemoryCacheSuite covering:

  • Comet-native cache scan over CometCachedBatch
  • Fallback behavior when native cache support is disabled
  • Multi-partition cached tables
  • Empty cached tables
  • Projection-only cache reads
  • Shuffle execution after cached table scans

Verified with:

./mvnw -pl spark -DskipTests test-compile

./mvnw test -pl spark \
  -DwildcardSuites=org.apache.comet.exec.CometInMemoryCacheSuite

@pchintar pchintar changed the title Add native support for in-memory cache feat: Add native support for in-memory cache Jun 4, 2026
@pchintar pchintar changed the title feat: Add native support for in-memory cache feat: Add Native Support for In-Memory Cache Jun 4, 2026
@pchintar

pchintar commented Jun 4, 2026

Copy link
Copy Markdown
Author

cc @andygrove @mbutrovich

@andygrove

Copy link
Copy Markdown
Member

Thanks @pchintar. I will compare this to #4569 today

@mbutrovich mbutrovich self-requested a review June 4, 2026 14:51
@pchintar

Copy link
Copy Markdown
Author

Thanks @pchintar. I will compare this to #4569 today

Hi @andygrove Could you kindly, if possible, provide any update on this? thnx.

@andygrove

Copy link
Copy Markdown
Member

Thanks @pchintar. I will compare this to #4569 today

Hi @andygrove Could you kindly, if possible, provide any update on this? thnx.

Hi @pchintar. I haven't had time to review yet, but I will. I am working on some more urgent items for the 0.17.0 release currently.

Unfortunately we have limited review bandwidth.

@andygrove andygrove added this to the 1.0.0 milestone Jun 11, 2026
@andygrove

Copy link
Copy Markdown
Member

Comparison of #4569 and #4591

These two PRs both close #2391 and take the same fundamental approach, so cross-linking a comparison here for visibility.

Shared goal and mechanics

Both solve the same problem: Comet does not treat InMemoryTableScanExec as native, so it inserts a CometSparkToColumnarExec and pays a JVM-to-Arrow conversion on every cached read. Both fix this by storing the cache as compressed Arrow IPC once at build time, so repeated scans feed native execution directly.

Both share the same core building blocks:

  • A custom CachedBatchSerializer that encodes batches with Comet's existing serializeBatches / decodeBatches (compressed Arrow IPC).
  • A CometCachedBatch payload holding the IPC bytes.
  • Installation of the serializer via CometDriverPlugin at startup.
  • A new config, off by default.
  • Decode back to CometVector-backed ColumnarBatch with column pruning, plus an InternalRow fallback for non-Comet consumers.
  • Roughly the same size and the same test layout (a serializer suite plus a plugin/exec suite).

Key differences

Dimension #4569 #4591
Serializer base class SimpleMetricsCachedBatchSerializer plain CachedBatchSerializer
Batch-level pruning Yes: stores a Spark-format per-column stats row (min/max/null/count), so buildFilter prunes batches No: stats = InternalRow.empty, buildFilter is a pass-through no-op
How the scan stays native No new operator. Reuses CometSparkToColumnarExec with a passthrough fast-path: if columns are already CometVector, forward without re-copy (adds numPassthroughBatches metric) New operator CometInMemoryTableScanExec (a CometExec / LeafExecNode) plus a CometOperatorSerde, wired into CometExecRule's nativeExecs map
Unsupported types Delegates to DefaultCachedBatchSerializer (nested/complex) as an explicit drop-in Delegates to DefaultCachedBatchSerializer by inspecting batch class in the convert methods
Serializer install policy Sets spark.sql.cache.serializer only when enabled, and never overrides a user-provided non-default serializer Always registers the serializer; the serializer decides at runtime whether to use Comet format or delegate
Config spark.comet.cache.serializer.enabled spark.comet.exec.inMemoryCache.enabled (EXEC category)
Plan-rule changes Minimal (works through the existing SparkToColumnar path) Adds an explicit InMemoryTableScanExec case with detailed fallback-reason messages (disabled / wrong-batch-class / empty-buffer)

Architectural distinction

The main difference is the integration strategy:

  • feat: add Comet CachedBatchSerializer for native in-memory cache #4569 is serializer-centric. It does the minimum on the plan-rule side and leans on the existing CometSparkToColumnarExec, teaching it to skip the copy when batches are already Arrow. It also invests in stats so filter pushdown prunes cached batches, and is careful to respect a user-set serializer.
  • feat: Add Native Support for In-Memory Cache #4591 is operator-centric. It introduces a dedicated CometInMemoryTableScanExec node and serde, giving cached scans a first-class place in the Comet operator framework with explicit fallback reasons. It currently skips column statistics, so cached filters do not get batch pruning.

Suggested path forward

A strong combined outcome would pair #4591's dedicated scan operator and explicit fallback reasons with #4569's stats-based buildFilter, passthrough fast-path, and respect-user-serializer install logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Explore options for accelerating InMemoryTableScanExec

3 participants