Skip to content

DRILL-8543: Add Support for Materialized Views#3036

Open
cgivre wants to merge 10 commits into
apache:masterfrom
cgivre:views
Open

DRILL-8543: Add Support for Materialized Views#3036
cgivre wants to merge 10 commits into
apache:masterfrom
cgivre:views

Conversation

@cgivre

@cgivre cgivre commented Feb 2, 2026

Copy link
Copy Markdown
Contributor

DRILL-8543: Add Support for Materialized Views

Description

This PR adds materialized view support to Apache Drill, enabling users to store pre-computed query results for improved query performance.

Features

  • SQL Commands: CREATE [OR REPLACE] MATERIALIZED VIEW, DROP MATERIALIZED VIEW, and REFRESH MATERIALIZED VIEW
  • Query Rewriting: Automatic query optimization using Calcite's SubstitutionVisitor to transparently rewrite queries to use materialized views when beneficial
  • Parquet Storage: MV data stored as Parquet files for efficient columnar access
  • Metastore Integration: Optional synchronization of MV metadata to Drill Metastore (Iceberg, RDBMS, MongoDB backends)

Implementation

  • New SQL parser classes for MV statements
  • MaterializedView data model with JSON serialization (.materialized_view.drill files)
  • MaterializedViewHandler for CREATE/DROP/REFRESH operations
  • MaterializedViewRewriter for query plan substitution
  • DrillMaterializedViewTable implementing Calcite's TranslatableTable
  • Metastore API extensions: MaterializedViews interface and MaterializedViewMetadataUnit
  • Iceberg metastore backend implementation for MV metadata

Configuration

  • planner.enable_materialized_view_rewrite (default: true) - Controls automatic query rewriting

Documentation

Added docs/dev/MaterializedViews.md with complete feature documentation

Testing

Added additional unit tests.

@cgivre cgivre self-assigned this Feb 2, 2026
@cgivre cgivre added enhancement PRs that add a new functionality to Drill doc-impacting PRs that affect the documentation performance PRs that Improve Performance major-update labels Feb 2, 2026

@letian-jiang letian-jiang left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Materialized view is a powerful feature for analytic engine. 🥳

Comment thread docs/dev/MaterializedViews.md
Comment thread docs/dev/MaterializedViews.md Outdated

@letian-jiang letian-jiang left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also add a plan-asserting test to ensure the query is correctly rewrite using MV.

@cgivre

cgivre commented Feb 8, 2026

Copy link
Copy Markdown
Contributor Author

@letian-jiang I believe I addressed your review comments. Could you please mark the review as complete so we can merge?
Thanks!

@cgivre cgivre requested a review from pjfanning February 8, 2026 15:16

@letian-jiang letian-jiang left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cgivre cgivre requested a review from rymarm June 15, 2026 13:28

@rymarm rymarm left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cgivre Thank you for implementing this feature! It looks great overall. I found a few issues from my point of view - please check them out. They relate to:

  • MV dataStoragePath: making its format strict and accessing it from the object instead of relying on duck typing.
  • Using hardcoded backticks during query building.
  • Optimizing code syntax.

Comment on lines +59 to +61
/** The relative path where the materialized data is stored (typically the view name) */
@JsonInclude(Include.NON_NULL)
private String dataStoragePath;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillMaterializedViewTable.java you specified, that the MV data is stored within {name}_mv_data/:

 * A materialized view stores:
 * <ul>
 *   <li>Definition file (.materialized_view.drill) - JSON with name, SQL, schema info</li>
 *   <li>Data directory ({name}_mv_data/) - Parquet files with pre-computed results</li>
 * </ul>

In the unit tests, you used the MV name for the data storage, but at the same time, you used {name}_mv_data in the following places:

  • exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillMaterializedViewTable.java
  • exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/MaterializedViewRewriter.java
  • exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/WorkspaceSchemaFactory.java

We need to agree on what name pattern to use: {name}_mv_data or simply {name}.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — these were inconsistent. The data is physically written to {name}_mv_data, so that's the pattern we keep. dataStoragePath now defaults to name + DATA_DIR_SUFFIX (new constant) and is the single source of truth — every reader/writer goes through getDataStoragePath() instead of re-deriving the suffix. The stray setDataStoragePath(viewName) that stored the wrong value is gone.

.map(f -> new View.Field(f.getName(), f.getType()))
.collect(Collectors.toList()),
workspaceSchemaPath,
name, // data storage path defaults to view name

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to agree on what name pattern to use by default: {name}_mv_data or simply {name}.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Settled on {name}_mv_data (where the data is actually written). The default is now name + DATA_DIR_SUFFIX and all call sites use getDataStoragePath().

* We explicitly select the MV's columns to ensure proper schema matching.
*/
private String buildDataScanSql() {
String dataTableName = materializedView.getName() + "_mv_data";

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe materializedView.getDataStoragePath() should be used there.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — buildDataScanSql() now uses materializedView.getDataStoragePath().

.collect(java.util.stream.Collectors.toList());
if (fieldNames.isEmpty()) {
// Fallback to SELECT * if no fields defined (shouldn't happen for non-dynamic MVs)
return "SELECT * FROM `" + dataTableName + "`";

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apache Drill allows to configure the identifier quote character. Won't it fail if planner.parser.quoting_identifiers is set to double quotes?
https://drill.apache.org/docs/lexical-structure#identifier-quotes

And doesn't it should include either the workspace name?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Added quoteIdentifier(Quoting, id) which wraps using the session's configured quoting char (escaping embedded quotes); the session Quoting is threaded in from WorkspaceSchemaFactory. On the workspace-name point: this SQL is expanded against materializedView.getWorkspaceSchemaPath(), so the table is already resolved within the correct workspace and doesn't need re-qualifying here.

}
sql.append("`").append(fieldNames.get(i)).append("`");
}
sql.append(" FROM `").append(dataTableName).append("`");

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same:

Won't it fail if planner.parser.quoting_identifiers is set to double quotes?
https://drill.apache.org/docs/lexical-structure#identifier-quotes

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same fix — now quoted via quoteIdentifier(...) using the session quoting character, so it no longer breaks under double-quote quoting.

@Override
public void dropMaterializedView(String viewName) throws IOException {
Path viewPath = getMaterializedViewPath(viewName);
Path dataPath = getMaterializedViewDataPath(viewName);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use materializedView.getDataStoragePath() instead.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. dropMaterializedView only has the name, so it reads the definition to get the data path (falling back to the default naming only if the definition is unreadable).

.build(logger);
}

Path dataPath = getMaterializedViewDataPath(viewName);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use materializedView.getDataStoragePath() instead.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — uses getMaterializedViewDataPath(materializedView) / getDataStoragePath().

Comment on lines +487 to +505
public CreateTableEntry createMaterializedViewDataWriter(String viewName) {
// Use Parquet format for storing materialized view data
FormatPlugin formatPlugin = plugin.getFormatPlugin("parquet");
if (formatPlugin == null) {
throw UserException.unsupportedError()
.message("Parquet format plugin not available for materialized view storage")
.build(logger);
}

// Store data in a directory with _mv_data suffix to avoid name collision
// with the materialized view lookup (which uses the same base name)
String dataLocation = config.getLocation() + Path.SEPARATOR + viewName + "_mv_data";
return new FileSystemCreateTableEntry(
(FileSystemConfig) plugin.getConfig(),
formatPlugin,
dataLocation,
Collections.emptyList(), // No partition columns for MVs
StorageStrategy.DEFAULT);
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, the method should have MaterializedView parameter and call materializedView.getDataStoragePath() to retrieve the dataLocation. Or at least this method shouldn't try to guess the data path and get the date path as argument.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — createMaterializedViewDataWriter now takes the MaterializedView and uses getDataStoragePath(); the interface in AbstractSchema and the handler caller were updated accordingly.

Comment on lines +521 to +525
for (DotDrillFile f : files) {
if (f.getType() == DotDrillType.MATERIALIZED_VIEW) {
return f.getMaterializedView(mapper);
}
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you want to replace it with a more declarative syntax to get first MATERIALIZED_VIEW:

return files.stream()
    .filter(f -> f.getType() == DotDrillType.MATERIALIZED_VIEW)
    .findFirst()
    .map(f -> f.getMaterializedView(mapper))
    .orElse(null);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used the declarative form, with one tweak: getMaterializedView(mapper) declares throws IOException, so it can't go inside .map(...). Kept filter(...).findFirst() and read the file just outside the stream so the checked exception still propagates.

Comment on lines +208 to +209
if (table instanceof DrillMaterializedViewTable) {
DrillMaterializedViewTable mvTable = (DrillMaterializedViewTable) table;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — using instanceof DrillMaterializedViewTable mvTable. Side note: the build was still pinned to -source 11 via the jdk9+ profile, which rejects this; bumped maven.compiler.* to 17 to match the existing requireJavaVersion [17,24) enforcer rule and the 17/21 CI matrix.

cgivre added 10 commits June 22, 2026 17:29
The MV rewriter had three bugs preventing query rewriting from working:
1. Schema discovery used lazy-loaded schema tree (always empty) - now
   iterates StoragePluginRegistry for FileSystemPlugin instances
2. SubstitutionVisitor arguments were swapped - now uses Calcite's
   RelOptMaterializations.useMaterializedViews() API which handles
   normalization and correct argument order internally
3. buildMvScanRel used SELECT * causing DYNAMIC_STAR type mismatch -
   now selects explicit columns from the MV field definitions

Also adds plan verification tests to both TestMaterializedViewSupport
and TestMaterializedViewRewriting to assert that query plans actually
reference _mv_data (Parquet) or region.json as expected.
- Make MaterializedView.dataStoragePath the single source of truth for the
  data directory ({name}_mv_data) and use getDataStoragePath() everywhere
  instead of hardcoding the _mv_data suffix.
- Quote generated identifiers using the session's configured quoting
  character so MV data scans work when planner.parser.quoting_identifiers
  is not the default backtick.
- createMaterializedViewDataWriter now takes the MaterializedView object.
- Simplify isTable check and use a declarative findFirst in
  getMaterializedView.
- Use pattern-matching instanceof in RecordCollector.
- Bump maven.compiler release/source/target 11 -> 17 (Drill no longer
  supports Java 11; matches the existing enforcer rule and CI matrix).
@cgivre

cgivre commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

@rymarm Thanks for the review. I believe I addressed all your comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc-impacting PRs that affect the documentation enhancement PRs that add a new functionality to Drill major-update performance PRs that Improve Performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants