HYPERFLEET-1109 - feat: Add after suite sweep to clean up resources#139
HYPERFLEET-1109 - feat: Add after suite sweep to clean up resources#139ma-hill wants to merge 1 commit into
Conversation
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Important Review skippedAuto reviews are limited based on label configuration. 🚫 Review skipped — only excluded labels are configured. (2)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Central YAML (base), Organization UI (inherited) Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughRUN_ID is required at suite startup and propagated through DeployAdapter as Helm labels/commonLabels. A new helm.Client wraps release listing by label selector; a new cleanup.go implements SweepTestResources, deleting Helm releases, Jobs/Deployments/ConfigMaps/AppliedManifestWork, and namespaces (with finalizer removal) scoped to RUN_ID. Suite teardown invokes this sweep. Test-data configs add runId params and run-id labels. env.ci/env.local switch image registry/repos to quay.io redhat-services-prod paths and add RUN_ID export. go.mod bumps several dependencies. Maestro negative tests reorder cleanup registration; adapter timeout default reduced to 1 minute. Estimated code review effort: 4 (Complex) | ~60 minutes Sequence Diagram(s)sequenceDiagram
participant Suite as pkg/e2e/suite.go
participant Helper as Helper.SweepTestResources
participant HelmClient as helm.Client
participant K8s as Kubernetes API
Suite->>Helper: AfterSuite calls SweepTestResources(ctx)
Helper->>Helper: read RUN_ID env, build label selector
Helper->>HelmClient: ListReleasesBySelector(labelSelector)
HelmClient->>K8s: list releases via Helm action
HelmClient-->>Helper: matching release names
Helper->>K8s: UninstallAdapter for each release
Helper->>K8s: delete Jobs/Deployments/ConfigMaps/AppliedManifestWork
Helper->>K8s: delete Namespaces by label selector
K8s-->>Helper: deletion errors (if any)
Helper->>K8s: patchRemoveFinalizers + retry delete on stuck namespace
Helper-->>Suite: cleanup error or success
Security notes:
Related PRs: None identified from the provided diff. Suggested labels: ci, dependencies, security-review-requested Suggested reviewers: platform/e2e-infra maintainers, security team (finalizer removal + kubeconfig/dynamic client usage) ✨ Finishing Touches🧪 Generate unit tests (beta)
✨ Simplify code
Comment |
b53834a to
37e70b8
Compare
|
@CodeRabbit review |
✅ Action performedReview finished.
|
37e70b8 to
c9f8d10
Compare
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@go.mod`:
- Line 40: Remove the vulnerable containerd dependency pin from the module graph
by updating the dependency set so `github.com/containerd/containerd` is no
longer selected at v1.7.30. Regenerate the module metadata via the normal
dependency management flow, and ensure any reference to
`github.com/containerd/containerd` in go.mod/go.sum is replaced with a
non-vulnerable resolved version outside the affected range. Verify the resulting
dependency graph no longer includes the vulnerable `containerd` module.
In `@pkg/helper/adapter.go`:
- Around line 88-90: The default deployment timeout in adapter helper logic was
reduced too far, causing the main Helm wait path to time out prematurely when
AdapterDeploymentOptions.Timeout is unset. In the timeout fallback inside helper
logic (the code that checks opts.Timeout in adapter.go), restore the previous
deploy budget by setting the default back to 5 minutes instead of 1 minute so
the existing adapter deployment flow remains stable on slower CI and image
pulls.
- Around line 198-206: The run-id label passed through Helm is using an
unescaped dotted key in the commonLabels --set path, so Helm parses it as nested
fields instead of the literal label name. Update the RUN_ID handling in
adapter.go to use the Helm release labeling plus a string-based set flag with an
escaped key path for commonLabels, referencing the existing RUN_ID block in the
adapter logic. Keep the literal e2e.hyperfleet.io/run-id label intact so cleanup
can match resources correctly.
In `@pkg/helper/cleanup.go`:
- Around line 19-155: SweepTestResources currently logs most cleanup errors but
still returns nil, so AfterSuite cannot detect failed teardown. Update
Helper.SweepTestResources and any called cleanup helpers it relies on, such as
uninstallHelmReleases and deleteNamespaces, to collect and return errors instead
of dropping them; make dynamic client init, list/delete failures, and namespace
cleanup failures propagate back to the caller. Keep any intentional
log-and-continue paths explicit with comments, and ensure pkg/e2e/suite.go can
fail the suite when cleanup does not complete successfully.
- Around line 197-205: The patchRemoveFinalizers helper is updating
metadata.finalizers on the main Namespace object, which won’t clear the
terminating spec.finalizers. Update this fallback to use the Namespace
finalization path via h.K8sClient.CoreV1().Namespaces().Finalize (the /finalize
subresource) so the stuck namespace can actually be unblocked.
- Around line 130-140: The cleanup path for AppliedManifestWork deletion is
missing the object namespace, so namespaced resources may not be removed
correctly. Update the deletion call in the applied manifest work loop in
cleanup.go to use the work’s namespace from work.GetNamespace() along with
work.GetName(), keeping the existing list/delete flow in cleanup of appliedWorks
and the dynamicClient.Resource(appliedManifestWorkGVR) call.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 3e40f1dc-72aa-4205-96e9-652be581535a
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum,!**/go.sum
📒 Files selected for processing (25)
e2e/adapter/adapter_failover.goe2e/adapter/adapter_with_maestro.goenv/env.cienv/env.localgo.modpkg/e2e/suite.gopkg/helper/adapter.gopkg/helper/cleanup.gopkg/helper/helm/helm.gotestdata/adapter-configs/cl-crash/adapter-task-config.yamltestdata/adapter-configs/cl-crash/values.yamltestdata/adapter-configs/cl-invalid-resource/adapter-task-config.yamltestdata/adapter-configs/cl-invalid-resource/values.yamltestdata/adapter-configs/cl-m-bad-api/adapter-task-config.yamltestdata/adapter-configs/cl-m-bad-api/values.yamltestdata/adapter-configs/cl-m-unreg-consumer/adapter-task-config.yamltestdata/adapter-configs/cl-m-unreg-consumer/values.yamltestdata/adapter-configs/cl-m-wrong-ds/adapter-task-config.yamltestdata/adapter-configs/cl-m-wrong-ds/values.yamltestdata/adapter-configs/cl-m-wrong-nest/adapter-task-config.yamltestdata/adapter-configs/cl-m-wrong-nest/values.yamltestdata/adapter-configs/cl-precondition-error/adapter-task-config.yamltestdata/adapter-configs/cl-precondition-error/values.yamltestdata/adapter-configs/cl-stuck/adapter-task-config.yamltestdata/adapter-configs/cl-stuck/values.yaml
🔗 Linked repositories identified
CodeRabbit considers these linked repositories for cross-repo context during reviews:
openshift-hyperfleet/architecture(manual)openshift-hyperfleet/hyperfleet-api(manual)openshift-hyperfleet/hyperfleet-sentinel(manual)openshift-hyperfleet/hyperfleet-adapter(manual)openshift-hyperfleet/hyperfleet-broker(manual)
| github.com/blang/semver/v4 v4.0.0 // indirect | ||
| github.com/cespare/xxhash/v2 v2.3.0 // indirect | ||
| github.com/chai2010/gettext-go v1.0.2 // indirect | ||
| github.com/containerd/containerd v1.7.30 // indirect |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major | 🏗️ Heavy lift
Remove the vulnerable containerd pin.
github.com/containerd/containerd v1.7.30 falls inside GO-2026-5378's affected range (v1.7.27 before v1.7.32), and the Go vuln DB also lists GO-2026-5758/CVE-2026-53488 and GO-2026-5338/CVE-2026-50195 for the module. Keeping it in the module graph drags a known-vulnerable supply-chain dependency into the build; please regenerate the dependency set so this release is no longer selected. (pkg.go.dev)
🧰 Tools
🪛 OSV Scanner (2.4.0)
[HIGH] 40-40: github.com/containerd/containerd 1.7.30: containerd CRI checkpoint restore CDI annotation smuggling in github.com/containerd/containerd
(GO-2026-5064)
[HIGH] 40-40: github.com/containerd/containerd 1.7.30: containerd: CRI checkpoint import allows local image tag poisoning in github.com/containerd/containerd
(GO-2026-5338)
[HIGH] 40-40: github.com/containerd/containerd 1.7.30: containerd user ID handling bypass allows runAsNonRoot evasion in github.com/containerd/containerd
(GO-2026-5378)
[HIGH] 40-40: github.com/containerd/containerd 1.7.30: containerd image-triggered runtime DoS via unbounded group parsing in github.com/containerd/containerd
(GO-2026-5475)
[HIGH] 40-40: github.com/containerd/containerd 1.7.30: Arbitrary host CRI log file read via symlink following in CRI checkpoint restore in github.com/containerd/containerd
(GO-2026-5622)
[HIGH] 40-40: github.com/containerd/containerd 1.7.30: containerd CRI — image-config LABEL flows to restart-monitor binary:// logger: host-root command execution from an image pull in github.com/containerd/containerd
(GO-2026-5758)
[HIGH] 40-40: github.com/containerd/containerd 1.7.30: containerd user ID handling bypass allows runAsNonRoot evasion
[HIGH] 40-40: github.com/containerd/containerd 1.7.30: containerd image-triggered runtime DoS via unbounded group parsing
[HIGH] 40-40: github.com/containerd/containerd 1.7.30: containerd CRI — image-config LABEL flows to restart-monitor binary:// logger: host-root command execution from an image pull
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@go.mod` at line 40, Remove the vulnerable containerd dependency pin from the
module graph by updating the dependency set so
`github.com/containerd/containerd` is no longer selected at v1.7.30. Regenerate
the module metadata via the normal dependency management flow, and ensure any
reference to `github.com/containerd/containerd` in go.mod/go.sum is replaced
with a non-vulnerable resolved version outside the affected range. Verify the
resulting dependency graph no longer includes the vulnerable `containerd`
module.
Source: Linters/SAST tools
| // Set default timeout if not specified | ||
| if opts.Timeout == 0 { | ||
| opts.Timeout = 5 * time.Minute | ||
| opts.Timeout = 1 * time.Minute |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Restore the previous deploy timeout budget.
Every adapter spec shown here leaves AdapterDeploymentOptions.Timeout unset, so this change cuts the default Helm --wait window from 5m to 1m on the main path. Fresh image pulls and rollout on shared CI clusters routinely exceed 60s, so this is a flake factory rather than a real signal.
Suggested fix
- if opts.Timeout == 0 {
- opts.Timeout = 1 * time.Minute
- }
+ if opts.Timeout == 0 {
+ opts.Timeout = 5 * time.Minute
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Set default timeout if not specified | |
| if opts.Timeout == 0 { | |
| opts.Timeout = 5 * time.Minute | |
| opts.Timeout = 1 * time.Minute | |
| // Set default timeout if not specified | |
| if opts.Timeout == 0 { | |
| opts.Timeout = 5 * time.Minute | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/helper/adapter.go` around lines 88 - 90, The default deployment timeout
in adapter helper logic was reduced too far, causing the main Helm wait path to
time out prematurely when AdapterDeploymentOptions.Timeout is unset. In the
timeout fallback inside helper logic (the code that checks opts.Timeout in
adapter.go), restore the previous deploy budget by setting the default back to 5
minutes instead of 1 minute so the existing adapter deployment flow remains
stable on slower CI and image pulls.
| // Add run-id label for resource tracking and cleanup | ||
| if runID := os.Getenv("RUN_ID"); runID != "" { | ||
| // Label the Helm release itself (for Helm SDK-based cleanup) | ||
| helmArgs = append(helmArgs, | ||
| "--labels", fmt.Sprintf("e2e.hyperfleet.io/run-id=%s", runID), | ||
| ) | ||
| helmArgs = append(helmArgs, | ||
| "--set", fmt.Sprintf("commonLabels.e2e.hyperfleet.io/run-id=%s", runID), | ||
| ) |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
printf 'File outline:\n'
ast-grep outline pkg/helper/adapter.go --view expanded || true
printf '\nRelevant lines:\n'
cat -n pkg/helper/adapter.go | sed -n '170,230p'
printf '\nSearch for Helm args and label handling:\n'
rg -n --hidden --no-messages 'RUN_ID|commonLabels|--labels|--set-string|--set ' pkg/helper pkg | sed -n '1,200p'Repository: openshift-hyperfleet/hyperfleet-e2e
Length of output: 5969
🏁 Script executed:
#!/bin/bash
set -euo pipefail
printf 'Helm CLI quoting/escaping references in repo:\n'
rg -n --hidden --no-messages 'helm.*--set|--set-string|commonLabels\.' .
printf '\nPotential chart templates/values using run-id label key:\n'
rg -n --hidden --no-messages 'e2e\.hyperfleet\.io/run-id|run-id' charts pkg testdata docs .Repository: openshift-hyperfleet/hyperfleet-e2e
Length of output: 8179
🏁 Script executed:
#!/bin/bash
set -euo pipefail
python3 - <<'PY'
import pathlib, re
p = pathlib.Path('pkg/helper/adapter.go')
text = p.read_text()
for i, line in enumerate(text.splitlines(), 1):
if 'RUN_ID' in line or 'commonLabels' in line or '--labels' in line or '--set' in line:
print(f"{i}: {line}")
PYRepository: openshift-hyperfleet/hyperfleet-e2e
Length of output: 725
🏁 Script executed:
#!/bin/bash
set -euo pipefail
printf 'adapter.go excerpt:\n'
cat -n pkg/helper/adapter.go | sed -n '190,215p'
printf '\nWhere commonLabels is consumed:\n'
rg -n --hidden --no-messages 'commonLabels|run-id' . --glob '!**/.git/**' --glob '!**/vendor/**'Repository: openshift-hyperfleet/hyperfleet-e2e
Length of output: 5049
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n --hidden --no-messages 'e2e\.hyperfleet\.io/run-id|RUN_ID|run-id' .
printf '\nHelm value wiring around commonLabels:\n'
rg -n --hidden --no-messages 'commonLabels' . --glob '!**/.git/**'Repository: openshift-hyperfleet/hyperfleet-e2e
Length of output: 6591
🌐 Web query:
Helm --set key with dots label keys escaped backslash commonLabels parser docs
💡 Result:
When using the Helm --set flag, keys containing dots are interpreted as nested objects by default [1][2]. To treat a dot as a literal character in a key (such as for labels or annotations like kubernetes.io/role), you must escape it using a backslash [1][2]. Correct Syntax for Dotted Keys To correctly set a value for a key containing dots, use the following syntax: helm install --set key."sub.key.with.dots"=value my-chart Key Requirements: 1. Quote the Key: Enclose the entire key (or the specific path segment containing the dots) in double quotes [1][3][4]. 2. Escape the Dot: Precede each literal dot with a backslash [1][2]. 3. Shell Considerations: Depending on your shell environment (e.g., Bash, Zsh, or running within a CI/CD pipeline like Jenkins), you may need to double-escape the backslash to ensure the backslash is passed through to the Helm parser [5][3][6]. For example, use \. if a single backslash is being stripped by the shell before reaching Helm [5][6]. Example If you need to set a commonLabel or annotation with a dot, such as kubernetes.io/role=master, use: helm install --set commonLabels."kubernetes.io/role"=master. If running in a shell that interprets backslashes, you might need: helm install --set commonLabels."kubernetes\.io/role"=master. Troubleshooting If you continue to face issues with complex values or deeply nested structures, consider using a values file (--values or -f) instead of --set [1][5][7]. Providing a YAML file is generally more robust and avoids the syntactic limitations and escaping complexities associated with the --set flag [5][7].
Citations:
- 1: https://helm.sh/docs/intro/using_helm
- 2: https://helm.sh/docs/intro/using_helm/
- 3: https://stackoverflow.com/questions/49522776/helm-how-to-override-value-with-periods-in-name
- 4: dots in value name is misinterpreted by toYaml helm/helm#5027
- 5: https://stackoverflow.com/questions/77745012/inline-setting-dotted-variables-in-helm-chart
- 6: https://stackoverflow.com/questions/69766554/helm-chart-how-to-pass-a-env-value-with-multiple-dots
- 7: Provide an option or alternative for --set and/or --set-string to take a value literally helm/helm#4030
Escape the dotted run-id key in the Helm --set path.
commonLabels.e2e.hyperfleet.io/run-id is parsed as nested values, so the chart won’t receive the literal label key and cleanup will miss those resources. Use --set-string with an escaped path like commonLabels.e2e\\.hyperfleet\\.io/run-id=%s.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/helper/adapter.go` around lines 198 - 206, The run-id label passed
through Helm is using an unescaped dotted key in the commonLabels --set path, so
Helm parses it as nested fields instead of the literal label name. Update the
RUN_ID handling in adapter.go to use the Helm release labeling plus a
string-based set flag with an escaped key path for commonLabels, referencing the
existing RUN_ID block in the adapter logic. Keep the literal
e2e.hyperfleet.io/run-id label intact so cleanup can match resources correctly.
| func (h *Helper) SweepTestResources(ctx context.Context) error { | ||
| runID := os.Getenv("RUN_ID") | ||
| if runID == "" { | ||
| logger.Warn("RUN_ID not set, skipping cleanup") | ||
| return nil | ||
| } | ||
|
|
||
| logger.Info("cleaning up resources with run-id", "run-id", runID) | ||
|
|
||
| labelSelector := fmt.Sprintf("e2e.hyperfleet.io/run-id=%s", runID) | ||
|
|
||
| // Uninstall orphaned helm releases with the run-id label | ||
| // This ensures proper cleanup of Helm-managed resources before orphan cleanup | ||
| h.uninstallHelmReleases(ctx, labelSelector) | ||
|
|
||
| // Get kubernetes config for dynamic client (for CRDs) | ||
| kubeconfig := clientcmd.NewDefaultClientConfigLoadingRules().GetDefaultFilename() | ||
| restConfig, err := clientcmd.BuildConfigFromFlags("", kubeconfig) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to build kubeconfig: %w", err) | ||
| } | ||
|
|
||
| // Create dynamic client for custom resources | ||
| dynamicClient, err := dynamic.NewForConfig(restConfig) | ||
| if err != nil { | ||
| logger.Error("failed to create dynamic client", "error", err) | ||
| } | ||
|
|
||
| // Run all resource deletions in parallel except namespaces | ||
| var wg sync.WaitGroup | ||
|
|
||
| // Delete jobs | ||
| wg.Add(1) | ||
| go func() { | ||
| defer wg.Done() | ||
| logger.Info("Starting deletion of orphaned jobs", "label", labelSelector) | ||
| jobs, err := h.K8sClient.BatchV1().Jobs("").List(ctx, metav1.ListOptions{ | ||
| LabelSelector: labelSelector, | ||
| }) | ||
| if err != nil { | ||
| logger.Error("failed to list jobs", "error", err) | ||
| return | ||
| } | ||
| for _, job := range jobs.Items { | ||
| logger.Info("deleting job", "name", job.Name, "namespace", job.Namespace) | ||
| propagationPolicy := metav1.DeletePropagationForeground | ||
| if err := h.K8sClient.BatchV1().Jobs(job.Namespace).Delete(ctx, job.Name, metav1.DeleteOptions{ | ||
| PropagationPolicy: &propagationPolicy, | ||
| }); err != nil { | ||
| logger.Error("failed to delete job", "namespace", job.Namespace, "name", job.Name, "error", err) | ||
| } | ||
|
|
||
| } | ||
| }() | ||
|
|
||
| // Delete deployments | ||
| wg.Add(1) | ||
| go func() { | ||
| defer wg.Done() | ||
| logger.Info("Starting deletion of orphaned deployments", "label", labelSelector) | ||
| deployments, err := h.K8sClient.AppsV1().Deployments("").List(ctx, metav1.ListOptions{ | ||
| LabelSelector: labelSelector, | ||
| }) | ||
| if err != nil { | ||
| logger.Error("failed to list deployments", "error", err) | ||
| return | ||
| } | ||
| for _, deployment := range deployments.Items { | ||
| logger.Info("deleting deployment", "name", deployment.Name, "namespace", deployment.Namespace) | ||
| propagationPolicy := metav1.DeletePropagationForeground | ||
| if err := h.K8sClient.AppsV1().Deployments(deployment.Namespace).Delete(ctx, deployment.Name, metav1.DeleteOptions{ | ||
| PropagationPolicy: &propagationPolicy, | ||
| }); err != nil { | ||
| logger.Error("failed to delete deployment", "namespace", deployment.Namespace, "name", deployment.Name, "error", err) | ||
| } | ||
|
|
||
| } | ||
| }() | ||
|
|
||
| // Delete configmaps | ||
| wg.Add(1) | ||
| go func() { | ||
| defer wg.Done() | ||
| logger.Info("Starting deletion of orphaned configmaps", "label", labelSelector) | ||
| configmaps, err := h.K8sClient.CoreV1().ConfigMaps("").List(ctx, metav1.ListOptions{ | ||
| LabelSelector: labelSelector, | ||
| }) | ||
| if err != nil { | ||
| logger.Error("failed to list configmaps", "error", err) | ||
| return | ||
| } | ||
| for _, cm := range configmaps.Items { | ||
| logger.Info("deleting configmap", "name", cm.Name, "namespace", cm.Namespace) | ||
| if err := h.K8sClient.CoreV1().ConfigMaps(cm.Namespace).Delete(ctx, cm.Name, metav1.DeleteOptions{}); err != nil { | ||
| logger.Error("failed to delete configmap", "namespace", cm.Namespace, "name", cm.Name, "error", err) | ||
| } | ||
| } | ||
| }() | ||
|
|
||
| // Delete AppliedManifestWorks | ||
| if dynamicClient != nil { | ||
| wg.Add(1) | ||
| go func() { | ||
| defer wg.Done() | ||
| logger.Info("Starting deletion of orphaned AppliedManifestWorks", "label", labelSelector) | ||
| appliedManifestWorkGVR := schema.GroupVersionResource{ | ||
| Group: "work.open-cluster-management.io", | ||
| Version: "v1", | ||
| Resource: "appliedmanifestworks", | ||
| } | ||
|
|
||
| appliedWorks, err := dynamicClient.Resource(appliedManifestWorkGVR).List(ctx, metav1.ListOptions{ | ||
| LabelSelector: labelSelector, | ||
| }) | ||
| if err != nil { | ||
| logger.Error("failed to list applied manifest works", "error", err) | ||
| return | ||
| } | ||
| for _, work := range appliedWorks.Items { | ||
| logger.Info("deleting applied manifest work", "name", work.GetName()) | ||
| if err := dynamicClient.Resource(appliedManifestWorkGVR).Delete(ctx, work.GetName(), metav1.DeleteOptions{}); err != nil { | ||
| logger.Error("failed to delete applied manifest work", "name", work.GetName(), "error", err) | ||
| } | ||
| } | ||
| }() | ||
| } | ||
|
|
||
| // Wait for all resource deletions to complete before deleting namespaces | ||
| // This prevents stuck namespaces due to AppliedManifestWork finalizers | ||
| wg.Wait() | ||
| logger.Info("all resources deleted, proceeding to namespace cleanup") | ||
|
|
||
| // Delete namespaces LAST - they can be deleted in parallel from each other | ||
| h.deleteNamespaces(ctx, labelSelector) | ||
|
|
||
| return nil | ||
| } |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift
Propagate cleanup failures back to AfterSuite.
SweepTestResources only returns kubeconfig-build failures. Helm uninstall failures, dynamic-client init failures, list/delete failures, and namespace cleanup failures are all logged and dropped, so pkg/e2e/suite.go sees nil and reports a clean teardown while orphaned resources remain. As per path instructions, "Log-and-continue MUST be intentional degradation with a comment, not a missing return."
Also applies to: 210-243
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/helper/cleanup.go` around lines 19 - 155, SweepTestResources currently
logs most cleanup errors but still returns nil, so AfterSuite cannot detect
failed teardown. Update Helper.SweepTestResources and any called cleanup helpers
it relies on, such as uninstallHelmReleases and deleteNamespaces, to collect and
return errors instead of dropping them; make dynamic client init, list/delete
failures, and namespace cleanup failures propagate back to the caller. Keep any
intentional log-and-continue paths explicit with comments, and ensure
pkg/e2e/suite.go can fail the suite when cleanup does not complete successfully.
Source: Path instructions
| appliedWorks, err := dynamicClient.Resource(appliedManifestWorkGVR).List(ctx, metav1.ListOptions{ | ||
| LabelSelector: labelSelector, | ||
| }) | ||
| if err != nil { | ||
| logger.Error("failed to list applied manifest works", "error", err) | ||
| return | ||
| } | ||
| for _, work := range appliedWorks.Items { | ||
| logger.Info("deleting applied manifest work", "name", work.GetName()) | ||
| if err := dynamicClient.Resource(appliedManifestWorkGVR).Delete(ctx, work.GetName(), metav1.DeleteOptions{}); err != nil { | ||
| logger.Error("failed to delete applied manifest work", "name", work.GetName(), "error", err) |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== cleanup.go around target lines =="
sed -n '1,260p' pkg/helper/cleanup.go
echo
echo "== search for appliedManifestWorkGVR and namespace usage =="
rg -n "appliedManifestWorkGVR|AppliedManifestWork|Namespace\\(" pkg/helper pkg -g '!**/*vendor*'
echo
echo "== locate resource definitions if any =="
rg -n "GroupVersionResource|appliedmanifestworks|ManifestWork" .Repository: openshift-hyperfleet/hyperfleet-e2e
Length of output: 38740
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== docs/debugging cleanup section =="
sed -n '270,340p' docs/debugging.md
echo
echo "== CleanupTestCluster and caller behavior =="
rg -n "CleanupTestCluster|SweepTestResources|AfterSuite|DeferCleanup" pkg e2e docs -g '!**/*vendor*'
echo
echo "== cleanup helper around error handling and namespace finalization =="
sed -n '1,260p' pkg/helper/helper.goRepository: openshift-hyperfleet/hyperfleet-e2e
Length of output: 20108
🌐 Web query:
Kubernetes dynamic client namespaced resource delete without Namespace() behavior appliedmanifestworks namespaced resource delete
💡 Result:
When using the Kubernetes dynamic client, calling Resource on a namespaced resource type without subsequently chaining the Namespace method will result in an error or unexpected behavior, as the client requires namespace scoping for namespaced resources [1][2][3]. The dynamic client's internal validation mechanism, specifically seen in the implementation of the Delete and DeleteCollection methods, explicitly checks for namespace validity when performing operations on namespaced resources [4][5]. If the resource is namespaced and no namespace is provided, the validation logic fails because it cannot construct the required URL path (e.g., /apis///namespaces///) [4][5]. Regarding ManifestWorks in Open Cluster Management (OCM), deleting a namespaced resource managed by a ManifestWork is handled by the OCM controller, not directly by a standard Kubernetes dynamic client call on the underlying resource [6][7]. ManifestWorks define a set of Kubernetes resources to be applied to a managed cluster [6][8]. When a ManifestWork is deleted (or when a resource is removed from its spec), the controller on the managed cluster automatically handles the deletion of the corresponding resources based on the specified DeleteOption (e.g., Foreground or Orphan propagation policies) [6][8]. You do not typically interact with the dynamic client to delete these individual "applied" resources manually; instead, you update or delete the ManifestWork object itself on the hub cluster [6][8].
Citations:
- 1: https://pkg.go.dev/k8s.io/cLient-go/1.5/dynamic
- 2: https://caioferreira.dev/posts/the-kubernetes-dynamic-client/
- 3: https://github.com/SuperMarioYL/cloudnative-cookbook/blob/main/content/kubernetes/12-client-go/04-dynamic-client.md
- 4: https://github.com/kubernetes/client-go/blob/master/dynamic/simple.go
- 5: https://github.com/kubernetes/client-go/blob/v0.35.3/dynamic/simple.go
- 6: https://github.com/open-cluster-management-io/api/blob/779879f4/work/v1/types.go
- 7: https://pkg.go.dev/open-cluster-management.io/api/work/v1
- 8: https://github.com/open-cluster-management-io/api/blob/779879f4/work/v1/0000_00_work.open-cluster-management.io_manifestworks.crd.yaml
Delete AppliedManifestWork with its namespace. dynamicClient.Resource(appliedManifestWorkGVR).Delete(...) omits Namespace(work.GetNamespace()), so this cleanup path can fail for namespaced objects and leave finalizers behind (CWE-772).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/helper/cleanup.go` around lines 130 - 140, The cleanup path for
AppliedManifestWork deletion is missing the object namespace, so namespaced
resources may not be removed correctly. Update the deletion call in the applied
manifest work loop in cleanup.go to use the work’s namespace from
work.GetNamespace() along with work.GetName(), keeping the existing list/delete
flow in cleanup of appliedWorks and the
dynamicClient.Resource(appliedManifestWorkGVR) call.
There was a problem hiding this comment.
They aren't namespaced objects
There was a problem hiding this comment.
@ma-hill Confirmed — AppliedManifestWork (work.open-cluster-management.io/v1) is cluster-scoped, not namespaced, so dynamicClient.Resource(appliedManifestWorkGVR).Delete(...) without .Namespace() is correct. My original flag was wrong; withdrawing it.
| func (h *Helper) patchRemoveFinalizers(ctx context.Context, namespace string) error { | ||
| // Patch to remove finalizers | ||
| patch := []byte(`{"metadata":{"finalizers":null}}`) | ||
| _, err := h.K8sClient.CoreV1().Namespaces().Patch( | ||
| ctx, | ||
| namespace, | ||
| "application/merge-patch+json", | ||
| patch, | ||
| metav1.PatchOptions{}, |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Map the relevant file and nearby symbols first.
ast-grep outline pkg/helper/cleanup.go --view expanded || true
printf '\n--- relevant lines ---\n'
sed -n '160,240p' pkg/helper/cleanup.goRepository: openshift-hyperfleet/hyperfleet-e2e
Length of output: 3559
🌐 Web query:
Kubernetes namespace finalizers finalize subresource spec.finalizers metadata.finalizers documentation
💡 Result:
In Kubernetes, finalizers allow resources to perform necessary cleanup tasks before being permanently deleted from etcd [1][2][3]. While most resources store finalizers in the metadata.finalizers field [1], Namespaces are unique in that they use a spec.finalizers field [4][2][5]. The spec.finalizers field on a Namespace is typically managed by the Namespace controller [4][2]. On a healthy cluster, this list usually contains only the "kubernetes" finalizer, which represents the namespace's own internal finalization logic [4][2]. The namespace controller is responsible for emptying this list once all contents of the namespace have been successfully deleted [4][2]. Crucially, the spec.finalizers field of a Namespace cannot be modified through standard update or patch requests to the resource itself [6][5]. To modify this field—specifically to force the deletion of a Namespace stuck in the Terminating state—one must use the dedicated /finalize subresource [4][2][7]. This subresource acts as the authorized API endpoint for the Namespace controller (or an administrator in an emergency) to update the spec.finalizers list [4][2][3]. For example, to manually force the removal of a stuck Namespace by clearing its finalizers, administrators often export the Namespace manifest, empty the spec.finalizers array, and push the update to the /finalize subresource using a tool like curl [7][8]. It is important to note that metadata.finalizers and spec.finalizers are distinct, and proposals to create a generic subresource for metadata.finalizers across all Kubernetes objects have been discussed but were not adopted [9]. Consequently, the behavior described for Namespaces remains a specialized mechanism within the Kubernetes API [4][2].
Citations:
- 1: https://kubernetes.io/docs/concepts/overview/working-with-objects/finalizers/
- 2: https://jorijn.com/en/knowledge-base/kubernetes/troubleshooting/kubernetes-namespace-stuck-terminating/
- 3: https://kubernetes.io/blog/2021/05/14/using-finalizers-to-control-deletion/
- 4: https://github.com/kubernetes/design-proposals-archive/blob/main/architecture/namespaces.md
- 5: https://www.redhat.com/en/blog/troubleshooting-terminating-namespaces
- 6: spec.finalizers cannot be modified via a patch or update to a namespace resource kubernetes/kubernetes#90438
- 7: https://cloud.google.com/kubernetes-engine/docs/troubleshooting/terminating-namespaces
- 8: deleting namespace stuck at "Terminating" state kubernetes/kubernetes#60807
- 9: KEP-3285: subresource for metadata.finalizers kubernetes/enhancements#3286
Use the Namespace /finalize subresource here. Patching metadata.finalizers on the main Namespace resource does not clear the terminating spec.finalizers, so this fallback cannot unblock a stuck namespace.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/helper/cleanup.go` around lines 197 - 205, The patchRemoveFinalizers
helper is updating metadata.finalizers on the main Namespace object, which won’t
clear the terminating spec.finalizers. Update this fallback to use the Namespace
finalization path via h.K8sClient.CoreV1().Namespaces().Finalize (the /finalize
subresource) so the stuck namespace can actually be unblocked.
|
@ma-hill: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| cfg := GetSuiteConfig() | ||
| if cfg != nil { | ||
| h := helper.New() | ||
| if h != nil { | ||
| logger.Info("cleaning up test resources by run-id label") | ||
| // Use a fresh context with 5-minute timeout for cleanup | ||
| ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute) | ||
| defer cancel() | ||
| if err := h.SweepTestResources(ctx); err != nil { | ||
| logger.Error("failed to cleanup test resources", "error", err) | ||
| } |
There was a problem hiding this comment.
helper.New() calls log.Fatalf on failure and never returns nil, so the if h != nil guard is dead code. If Helper creation fails during teardown (e.g., expired kubeconfig), the process will crash and test results may be lost. Consider either adding a non-fatal constructor variant or wrapping this in a recover block so cleanup failures degrade gracefully.
| kubeconfig := clientcmd.NewDefaultClientConfigLoadingRules().GetDefaultFilename() | ||
| restConfig, err := clientcmd.BuildConfigFromFlags("", kubeconfig) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to build kubeconfig: %w", err) | ||
| } | ||
|
|
||
| // Create dynamic client for custom resources | ||
| dynamicClient, err := dynamic.NewForConfig(restConfig) |
There was a problem hiding this comment.
Consider deriving the dynamic client from the existing K8sClient's REST config instead of re-reading kubeconfig from disk. This avoids an extra failure point and keeps client creation consistent with the rest of the method, which already uses h.K8sClient for typed resources.
| // Require RUN_ID to be set for resource cleanup tracking | ||
| runID := os.Getenv("RUN_ID") | ||
| if runID == "" { | ||
| log.Fatalf("RUN_ID environment variable must be set (e.g., export RUN_ID=test-run-$(date +%%s))") | ||
| } | ||
| logger.Info("test run identifier set", "run-id", runID) |
There was a problem hiding this comment.
The error message could mention sourcing env/env.local so developers know where RUN_ID gets set automatically.
| func (h *Helper) SweepTestResources(ctx context.Context) error { | ||
| runID := os.Getenv("RUN_ID") | ||
| if runID == "" { | ||
| logger.Warn("RUN_ID not set, skipping cleanup") |
There was a problem hiding this comment.
Consider adding unit tests for SweepTestResources. This function handles resource deletion and force-removes namespace finalizers, but there's no test coverage to validate error handling or edge cases (e.g., partial deletion failures, stuck namespaces, no matching resources).
| func (c *Client) ListReleasesBySelector(ctx context.Context, labelSelector string) ([]string, error) { | ||
| // Initialize action config for all namespaces (empty string means all) | ||
| actionConfig, err := c.initActionConfig() |
There was a problem hiding this comment.
Consider adding unit tests for ListReleasesBySelector. This would help verify the label selector filtering, namespace scoping, and state mask behavior without needing a real cluster.
| # Image Overrides | ||
| # ===================================================================== | ||
| export IMAGE_REGISTRY="${IMAGE_REGISTRY:-quay.io}" | ||
| export API_IMAGE_REPO="${API_IMAGE_REPO:-redhat-services-prod/hyperfleet-tenant/hyperfleet/hyperfleet-api}" | ||
| export API_IMAGE_TAG="${API_IMAGE_TAG:-latest}" |
There was a problem hiding this comment.
env.local defines RUN_ID but env.ci doesn't. Every other required variable is defined in both files. Consider adding export RUN_ID="${RUN_ID:-${NAMESPACE}}" to env.ci for consistency, so developers sourcing it locally don't hit the BeforeSuite fatal.
There was a problem hiding this comment.
Developers shouldn't be sourcing env.ci -- I believe env.ci should just be used for prow environment not user runs
There was a problem hiding this comment.
I added it to the e2e scripts in the openshift/release repo but maybe having it set here would be better, will update
Summary
Adds an AfterSuite cleanup hook that sweeps orphaned test resources by RUN_ID label. When tests fail or are interrupted, resources like namespaces, adapters, jobs, and deployments can remain in the cluster indefinitely. This new cleanup mechanism runs after all tests complete and removes any resources tagged with the current test run's ID, preventing resource accumulation in CI and local development environments.
HYPERFLEET-1109
Changes
SweepTestResourcesmethod to Helper that deletes all resources with thee2e.hyperfleet.io/run-idlabel, including jobs, deployments, configmaps, AppliedManifestWorks, and Helm releasespkg/helper/helm/helm.goto list and uninstall orphaned Helm releases by label selector instead of shelling out tohelm listadapterReleaseBEFORE deployment instead of after, ensuring DeferCleanup runs even if deployment failshelm.sh/helm/v3, Google Pub/Sub v2, and related transitive dependencies)registry.ci.openshift.orgtoquay.iofor CI and local environmentsvalues.yamlfiles to adapter config directories to provide default Helm values for test adaptersNotes
The
RUN_IDenvironment variable must be set before running tests (enforced in BeforeSuite). This is set in ci setup configs in the release repo and set to default as NAMESPACE inenv.localNamespaces are deleted LAST to avoid stuck finalizers. If a namespace remains stuck after normal deletion, the cleanup will patch out finalizers and retry.
The Helm SDK client uses the same uninstall approach as the existing
UninstallAdaptermethod to ensure consistency.Update IMAGE_REGISTRY and *_IMAGE_REPO to accurately specify images - in env.ci and env.local
In conjunction with this change: openshift/release#81351
Test Plan
make test-allpassesmake lintpasses