DRILL-8543: Add Support for Materialized Views#3036
Conversation
letian-jiang
left a comment
There was a problem hiding this comment.
LGTM. Materialized view is a powerful feature for analytic engine. 🥳
letian-jiang
left a comment
There was a problem hiding this comment.
We could also add a plan-asserting test to ensure the query is correctly rewrite using MV.
|
@letian-jiang I believe I addressed your review comments. Could you please mark the review as complete so we can merge? |
rymarm
left a comment
There was a problem hiding this comment.
@cgivre Thank you for implementing this feature! It looks great overall. I found a few issues from my point of view - please check them out. They relate to:
- MV
dataStoragePath: making its format strict and accessing it from the object instead of relying on duck typing. - Using hardcoded backticks during query building.
- Optimizing code syntax.
| /** The relative path where the materialized data is stored (typically the view name) */ | ||
| @JsonInclude(Include.NON_NULL) | ||
| private String dataStoragePath; |
There was a problem hiding this comment.
In exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillMaterializedViewTable.java you specified, that the MV data is stored within {name}_mv_data/:
* A materialized view stores:
* <ul>
* <li>Definition file (.materialized_view.drill) - JSON with name, SQL, schema info</li>
* <li>Data directory ({name}_mv_data/) - Parquet files with pre-computed results</li>
* </ul>
In the unit tests, you used the MV name for the data storage, but at the same time, you used {name}_mv_data in the following places:
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillMaterializedViewTable.javaexec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/MaterializedViewRewriter.javaexec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/WorkspaceSchemaFactory.java
We need to agree on what name pattern to use: {name}_mv_data or simply {name}.
There was a problem hiding this comment.
Good catch — these were inconsistent. The data is physically written to {name}_mv_data, so that's the pattern we keep. dataStoragePath now defaults to name + DATA_DIR_SUFFIX (new constant) and is the single source of truth — every reader/writer goes through getDataStoragePath() instead of re-deriving the suffix. The stray setDataStoragePath(viewName) that stored the wrong value is gone.
| .map(f -> new View.Field(f.getName(), f.getType())) | ||
| .collect(Collectors.toList()), | ||
| workspaceSchemaPath, | ||
| name, // data storage path defaults to view name |
There was a problem hiding this comment.
We need to agree on what name pattern to use by default: {name}_mv_data or simply {name}.
There was a problem hiding this comment.
Settled on {name}_mv_data (where the data is actually written). The default is now name + DATA_DIR_SUFFIX and all call sites use getDataStoragePath().
| * We explicitly select the MV's columns to ensure proper schema matching. | ||
| */ | ||
| private String buildDataScanSql() { | ||
| String dataTableName = materializedView.getName() + "_mv_data"; |
There was a problem hiding this comment.
I believe materializedView.getDataStoragePath() should be used there.
There was a problem hiding this comment.
Done — buildDataScanSql() now uses materializedView.getDataStoragePath().
| .collect(java.util.stream.Collectors.toList()); | ||
| if (fieldNames.isEmpty()) { | ||
| // Fallback to SELECT * if no fields defined (shouldn't happen for non-dynamic MVs) | ||
| return "SELECT * FROM `" + dataTableName + "`"; |
There was a problem hiding this comment.
Apache Drill allows to configure the identifier quote character. Won't it fail if planner.parser.quoting_identifiers is set to double quotes?
https://drill.apache.org/docs/lexical-structure#identifier-quotes
And doesn't it should include either the workspace name?
There was a problem hiding this comment.
Fixed. Added quoteIdentifier(Quoting, id) which wraps using the session's configured quoting char (escaping embedded quotes); the session Quoting is threaded in from WorkspaceSchemaFactory. On the workspace-name point: this SQL is expanded against materializedView.getWorkspaceSchemaPath(), so the table is already resolved within the correct workspace and doesn't need re-qualifying here.
| } | ||
| sql.append("`").append(fieldNames.get(i)).append("`"); | ||
| } | ||
| sql.append(" FROM `").append(dataTableName).append("`"); |
There was a problem hiding this comment.
The same:
Won't it fail if planner.parser.quoting_identifiers is set to double quotes?
https://drill.apache.org/docs/lexical-structure#identifier-quotes
There was a problem hiding this comment.
Same fix — now quoted via quoteIdentifier(...) using the session quoting character, so it no longer breaks under double-quote quoting.
| @Override | ||
| public void dropMaterializedView(String viewName) throws IOException { | ||
| Path viewPath = getMaterializedViewPath(viewName); | ||
| Path dataPath = getMaterializedViewDataPath(viewName); |
There was a problem hiding this comment.
Use materializedView.getDataStoragePath() instead.
There was a problem hiding this comment.
Done. dropMaterializedView only has the name, so it reads the definition to get the data path (falling back to the default naming only if the definition is unreadable).
| .build(logger); | ||
| } | ||
|
|
||
| Path dataPath = getMaterializedViewDataPath(viewName); |
There was a problem hiding this comment.
Use materializedView.getDataStoragePath() instead.
There was a problem hiding this comment.
Done — uses getMaterializedViewDataPath(materializedView) / getDataStoragePath().
| public CreateTableEntry createMaterializedViewDataWriter(String viewName) { | ||
| // Use Parquet format for storing materialized view data | ||
| FormatPlugin formatPlugin = plugin.getFormatPlugin("parquet"); | ||
| if (formatPlugin == null) { | ||
| throw UserException.unsupportedError() | ||
| .message("Parquet format plugin not available for materialized view storage") | ||
| .build(logger); | ||
| } | ||
|
|
||
| // Store data in a directory with _mv_data suffix to avoid name collision | ||
| // with the materialized view lookup (which uses the same base name) | ||
| String dataLocation = config.getLocation() + Path.SEPARATOR + viewName + "_mv_data"; | ||
| return new FileSystemCreateTableEntry( | ||
| (FileSystemConfig) plugin.getConfig(), | ||
| formatPlugin, | ||
| dataLocation, | ||
| Collections.emptyList(), // No partition columns for MVs | ||
| StorageStrategy.DEFAULT); | ||
| } |
There was a problem hiding this comment.
I think, the method should have MaterializedView parameter and call materializedView.getDataStoragePath() to retrieve the dataLocation. Or at least this method shouldn't try to guess the data path and get the date path as argument.
There was a problem hiding this comment.
Done — createMaterializedViewDataWriter now takes the MaterializedView and uses getDataStoragePath(); the interface in AbstractSchema and the handler caller were updated accordingly.
| for (DotDrillFile f : files) { | ||
| if (f.getType() == DotDrillType.MATERIALIZED_VIEW) { | ||
| return f.getMaterializedView(mapper); | ||
| } | ||
| } |
There was a problem hiding this comment.
Don't you want to replace it with a more declarative syntax to get first MATERIALIZED_VIEW:
return files.stream()
.filter(f -> f.getType() == DotDrillType.MATERIALIZED_VIEW)
.findFirst()
.map(f -> f.getMaterializedView(mapper))
.orElse(null);There was a problem hiding this comment.
Used the declarative form, with one tweak: getMaterializedView(mapper) declares throws IOException, so it can't go inside .map(...). Kept filter(...).findFirst() and read the file just outside the stream so the checked exception still propagates.
| if (table instanceof DrillMaterializedViewTable) { | ||
| DrillMaterializedViewTable mvTable = (DrillMaterializedViewTable) table; |
There was a problem hiding this comment.
Use pattern matching feature: https://docs.oracle.com/en/java/javase/17/language/pattern-matching-instanceof.html. Drill use Java 17+
There was a problem hiding this comment.
Done — using instanceof DrillMaterializedViewTable mvTable. Side note: the build was still pinned to -source 11 via the jdk9+ profile, which rejects this; bumped maven.compiler.* to 17 to match the existing requireJavaVersion [17,24) enforcer rule and the 17/21 CI matrix.
The MV rewriter had three bugs preventing query rewriting from working: 1. Schema discovery used lazy-loaded schema tree (always empty) - now iterates StoragePluginRegistry for FileSystemPlugin instances 2. SubstitutionVisitor arguments were swapped - now uses Calcite's RelOptMaterializations.useMaterializedViews() API which handles normalization and correct argument order internally 3. buildMvScanRel used SELECT * causing DYNAMIC_STAR type mismatch - now selects explicit columns from the MV field definitions Also adds plan verification tests to both TestMaterializedViewSupport and TestMaterializedViewRewriting to assert that query plans actually reference _mv_data (Parquet) or region.json as expected.
- Make MaterializedView.dataStoragePath the single source of truth for the
data directory ({name}_mv_data) and use getDataStoragePath() everywhere
instead of hardcoding the _mv_data suffix.
- Quote generated identifiers using the session's configured quoting
character so MV data scans work when planner.parser.quoting_identifiers
is not the default backtick.
- createMaterializedViewDataWriter now takes the MaterializedView object.
- Simplify isTable check and use a declarative findFirst in
getMaterializedView.
- Use pattern-matching instanceof in RecordCollector.
- Bump maven.compiler release/source/target 11 -> 17 (Drill no longer
supports Java 11; matches the existing enforcer rule and CI matrix).
|
@rymarm Thanks for the review. I believe I addressed all your comments. |
DRILL-8543: Add Support for Materialized Views
Description
This PR adds materialized view support to Apache Drill, enabling users to store pre-computed query results for improved query performance.
Features
Implementation
Configuration
planner.enable_materialized_view_rewrite(default: true) - Controls automatic query rewritingDocumentation
Added docs/dev/MaterializedViews.md with complete feature documentation
Testing
Added additional unit tests.