HYPERFLEET-1109 - feat: Add after suite sweep to clean up resources by ma-hill · Pull Request #139 · openshift-hyperfleet/hyperfleet-e2e

ma-hill · 2026-07-01T16:35:53Z

Summary

Adds an AfterSuite cleanup hook that sweeps orphaned test resources by RUN_ID label. When tests fail or are interrupted, resources like namespaces, adapters, jobs, and deployments can remain in the cluster indefinitely. This new cleanup mechanism runs after all tests complete and removes any resources tagged with the current test run's ID, preventing resource accumulation in CI and local development environments.

HYPERFLEET-1109

Changes

Added SweepTestResources method to Helper that deletes all resources with the e2e.hyperfleet.io/run-id label, including jobs, deployments, configmaps, AppliedManifestWorks, and Helm releases
Added Helm SDK integration via pkg/helper/helm/helm.go to list and uninstall orphaned Helm releases by label selector instead of shelling out to helm list
Integrated cleanup into Ginkgo AfterSuite hook with a 5-minute timeout to ensure cleanup runs even when tests fail
Added parallel deletion of resources (except namespaces, which wait for their contained resources to delete first) to speed up cleanup
Updated adapter test cleanup to set adapterRelease BEFORE deployment instead of after, ensuring DeferCleanup runs even if deployment fails
Updated go.mod/go.sum with Helm SDK dependencies (helm.sh/helm/v3, Google Pub/Sub v2, and related transitive dependencies)
Standardized env file formatting and moved image registry defaults from registry.ci.openshift.org to quay.io for CI and local environments
Added values.yaml files to adapter config directories to provide default Helm values for test adapters

Notes

The RUN_ID environment variable must be set before running tests (enforced in BeforeSuite). This is set in ci setup configs in the release repo and set to default as NAMESPACE in env.local

Namespaces are deleted LAST to avoid stuck finalizers. If a namespace remains stuck after normal deletion, the cleanup will patch out finalizers and retry.

The Helm SDK client uses the same uninstall approach as the existing UninstallAdapter method to ensure consistency.

Update IMAGE_REGISTRY and *_IMAGE_REPO to accurately specify images - in env.ci and env.local

In conjunction with this change: openshift/release#81351

Test Plan

Unit tests added/updated
make test-all passes
make lint passes
E2E tests passed with cleanup sweep verified in CI logs
Tested cleanup behavior by running tests with failures and verifying resources are removed

openshift-ci · 2026-07-01T16:35:57Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci · 2026-07-01T16:35:58Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mliptak0 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-07-01T16:36:01Z

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Review skipped — only excluded labels are configured. (2)

do-not-merge/work-in-progress
do-not-merge/hold

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 37930b8e-20e3-4d2b-8dbd-e3175b070855

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

RUN_ID is required at suite startup and propagated through DeployAdapter as Helm labels/commonLabels. A new helm.Client wraps release listing by label selector; a new cleanup.go implements SweepTestResources, deleting Helm releases, Jobs/Deployments/ConfigMaps/AppliedManifestWork, and namespaces (with finalizer removal) scoped to RUN_ID. Suite teardown invokes this sweep. Test-data configs add runId params and run-id labels. env.ci/env.local switch image registry/repos to quay.io redhat-services-prod paths and add RUN_ID export. go.mod bumps several dependencies. Maestro negative tests reorder cleanup registration; adapter timeout default reduced to 1 minute.

Estimated code review effort: 4 (Complex) | ~60 minutes

Sequence Diagram(s)

sequenceDiagram
  participant Suite as pkg/e2e/suite.go
  participant Helper as Helper.SweepTestResources
  participant HelmClient as helm.Client
  participant K8s as Kubernetes API

  Suite->>Helper: AfterSuite calls SweepTestResources(ctx)
  Helper->>Helper: read RUN_ID env, build label selector
  Helper->>HelmClient: ListReleasesBySelector(labelSelector)
  HelmClient->>K8s: list releases via Helm action
  HelmClient-->>Helper: matching release names
  Helper->>K8s: UninstallAdapter for each release
  Helper->>K8s: delete Jobs/Deployments/ConfigMaps/AppliedManifestWork
  Helper->>K8s: delete Namespaces by label selector
  K8s-->>Helper: deletion errors (if any)
  Helper->>K8s: patchRemoveFinalizers + retry delete on stuck namespace
  Helper-->>Suite: cleanup error or success

Security notes:

pkg/helper/cleanup.go builds a dynamic client from default kubeconfig and force-deletes namespaces, foreground-deleting Jobs/Deployments/ConfigMaps/AppliedManifestWork, and merge-patches metadata.finalizers to null to force stuck namespace removal. Finalizer stripping bypasses controller-managed cleanup guarantees (CWE-841: Improper Enforcement of Behavioral Workflow) — this runs unconditionally in AfterSuite against whatever cluster context is active; no namespace/label scope validation beyond a string match on RUN_ID is shown.
RUN_ID is read from environment and interpolated directly into Helm --set/--labels CLI args (pkg/helper/adapter.go) and shell-substituted into YAML (${RUN_ID}). No sanitization is described — if RUN_ID is attacker-influenced (e.g., derived from CI job/PR metadata), this is a command-argument injection surface (CWE-88) into helm invocations.
env/env.ci and env/env.local repoint image pulls from registry.ci.openshift.org to quay.io/redhat-services-prod/hyperfleet-tenant/hyperfleet/*. Verify image signing/provenance (cosign/Sigstore) and pull-by-digest enforcement are unchanged — registry migration without pinned digests reopens tag-mutability supply-chain risk (CWE-1357: Reliance on Insufficiently Trustworthy Component).
go.mod indirect dependency churn (add/remove/version bump across ~90 lines) is not accompanied by go.sum verification details in this summary — confirm go.sum hashes are updated and checked in CI (no SBOM/provenance diff mentioned).
Helm action config in pkg/helper/helm/helm.go uses an empty driver string (defaults to Secrets storage) — confirm this doesn't silently target unintended release storage namespace/backend at runtime given the dynamic namespace targeting.

Related PRs: None identified from the provided diff.

Suggested labels: ci, dependencies, security-review-requested

Suggested reviewers: platform/e2e-infra maintainers, security team (finalizer removal + kubeconfig/dynamic client usage)

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

✨ Simplify code

Create PR with simplified code

_{Comment @coderabbitai help to get the list of available commands.}

ma-hill · 2026-07-01T16:41:41Z

@CodeRabbit review

coderabbitai · 2026-07-01T16:41:47Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

…us resources

coderabbitai

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@go.mod`:
- Line 40: Remove the vulnerable containerd dependency pin from the module graph
by updating the dependency set so `github.com/containerd/containerd` is no
longer selected at v1.7.30. Regenerate the module metadata via the normal
dependency management flow, and ensure any reference to
`github.com/containerd/containerd` in go.mod/go.sum is replaced with a
non-vulnerable resolved version outside the affected range. Verify the resulting
dependency graph no longer includes the vulnerable `containerd` module.

In `@pkg/helper/adapter.go`:
- Around line 88-90: The default deployment timeout in adapter helper logic was
reduced too far, causing the main Helm wait path to time out prematurely when
AdapterDeploymentOptions.Timeout is unset. In the timeout fallback inside helper
logic (the code that checks opts.Timeout in adapter.go), restore the previous
deploy budget by setting the default back to 5 minutes instead of 1 minute so
the existing adapter deployment flow remains stable on slower CI and image
pulls.
- Around line 198-206: The run-id label passed through Helm is using an
unescaped dotted key in the commonLabels --set path, so Helm parses it as nested
fields instead of the literal label name. Update the RUN_ID handling in
adapter.go to use the Helm release labeling plus a string-based set flag with an
escaped key path for commonLabels, referencing the existing RUN_ID block in the
adapter logic. Keep the literal e2e.hyperfleet.io/run-id label intact so cleanup
can match resources correctly.

In `@pkg/helper/cleanup.go`:
- Around line 19-155: SweepTestResources currently logs most cleanup errors but
still returns nil, so AfterSuite cannot detect failed teardown. Update
Helper.SweepTestResources and any called cleanup helpers it relies on, such as
uninstallHelmReleases and deleteNamespaces, to collect and return errors instead
of dropping them; make dynamic client init, list/delete failures, and namespace
cleanup failures propagate back to the caller. Keep any intentional
log-and-continue paths explicit with comments, and ensure pkg/e2e/suite.go can
fail the suite when cleanup does not complete successfully.
- Around line 197-205: The patchRemoveFinalizers helper is updating
metadata.finalizers on the main Namespace object, which won’t clear the
terminating spec.finalizers. Update this fallback to use the Namespace
finalization path via h.K8sClient.CoreV1().Namespaces().Finalize (the /finalize
subresource) so the stuck namespace can actually be unblocked.
- Around line 130-140: The cleanup path for AppliedManifestWork deletion is
missing the object namespace, so namespaced resources may not be removed
correctly. Update the deletion call in the applied manifest work loop in
cleanup.go to use the work’s namespace from work.GetNamespace() along with
work.GetName(), keeping the existing list/delete flow in cleanup of appliedWorks
and the dynamicClient.Resource(appliedManifestWorkGVR) call.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 3e40f1dc-72aa-4205-96e9-652be581535a

📥 Commits

Reviewing files that changed from the base of the PR and between 6aa8341 and c9f8d10.

⛔ Files ignored due to path filters (1)

go.sum is excluded by !**/*.sum, !**/go.sum

📒 Files selected for processing (25)

e2e/adapter/adapter_failover.go
e2e/adapter/adapter_with_maestro.go
env/env.ci
env/env.local
go.mod
pkg/e2e/suite.go
pkg/helper/adapter.go
pkg/helper/cleanup.go
pkg/helper/helm/helm.go
testdata/adapter-configs/cl-crash/adapter-task-config.yaml
testdata/adapter-configs/cl-crash/values.yaml
testdata/adapter-configs/cl-invalid-resource/adapter-task-config.yaml
testdata/adapter-configs/cl-invalid-resource/values.yaml
testdata/adapter-configs/cl-m-bad-api/adapter-task-config.yaml
testdata/adapter-configs/cl-m-bad-api/values.yaml
testdata/adapter-configs/cl-m-unreg-consumer/adapter-task-config.yaml
testdata/adapter-configs/cl-m-unreg-consumer/values.yaml
testdata/adapter-configs/cl-m-wrong-ds/adapter-task-config.yaml
testdata/adapter-configs/cl-m-wrong-ds/values.yaml
testdata/adapter-configs/cl-m-wrong-nest/adapter-task-config.yaml
testdata/adapter-configs/cl-m-wrong-nest/values.yaml
testdata/adapter-configs/cl-precondition-error/adapter-task-config.yaml
testdata/adapter-configs/cl-precondition-error/values.yaml
testdata/adapter-configs/cl-stuck/adapter-task-config.yaml
testdata/adapter-configs/cl-stuck/values.yaml

🔗 Linked repositories identified

CodeRabbit considers these linked repositories for cross-repo context during reviews:

openshift-hyperfleet/architecture (manual)
openshift-hyperfleet/hyperfleet-api (manual)
openshift-hyperfleet/hyperfleet-sentinel (manual)
openshift-hyperfleet/hyperfleet-adapter (manual)
openshift-hyperfleet/hyperfleet-broker (manual)

coderabbitai · 2026-07-01T16:59:20Z

CVE-2026-53488

+	github.com/blang/semver/v4 v4.0.0 // indirect
 	github.com/cespare/xxhash/v2 v2.3.0 // indirect
+	github.com/chai2010/gettext-go v1.0.2 // indirect
+	github.com/containerd/containerd v1.7.30 // indirect


🔒 Security & Privacy | 🟠 Major | 🏗️ Heavy lift

Remove the vulnerable containerd pin.

github.com/containerd/containerd v1.7.30 falls inside GO-2026-5378's affected range (v1.7.27 before v1.7.32), and the Go vuln DB also lists GO-2026-5758/CVE-2026-53488 and GO-2026-5338/CVE-2026-50195 for the module. Keeping it in the module graph drags a known-vulnerable supply-chain dependency into the build; please regenerate the dependency set so this release is no longer selected. (pkg.go.dev)

🧰 Tools

🪛 OSV Scanner (2.4.0)

[HIGH] 40-40: github.com/containerd/containerd 1.7.30: containerd CRI checkpoint restore CDI annotation smuggling in github.com/containerd/containerd

(GO-2026-5064)

[HIGH] 40-40: github.com/containerd/containerd 1.7.30: containerd: CRI checkpoint import allows local image tag poisoning in github.com/containerd/containerd

(GO-2026-5338)

[HIGH] 40-40: github.com/containerd/containerd 1.7.30: containerd user ID handling bypass allows runAsNonRoot evasion in github.com/containerd/containerd

(GO-2026-5378)

[HIGH] 40-40: github.com/containerd/containerd 1.7.30: containerd image-triggered runtime DoS via unbounded group parsing in github.com/containerd/containerd

(GO-2026-5475)

[HIGH] 40-40: github.com/containerd/containerd 1.7.30: Arbitrary host CRI log file read via symlink following in CRI checkpoint restore in github.com/containerd/containerd

(GO-2026-5622)

[HIGH] 40-40: github.com/containerd/containerd 1.7.30: containerd CRI — image-config LABEL flows to restart-monitor binary:// logger: host-root command execution from an image pull in github.com/containerd/containerd

(GO-2026-5758)

[HIGH] 40-40: github.com/containerd/containerd 1.7.30: containerd user ID handling bypass allows runAsNonRoot evasion

(GHSA-fqw6-gf59-qr4w)

[HIGH] 40-40: github.com/containerd/containerd 1.7.30: containerd image-triggered runtime DoS via unbounded group parsing

(GHSA-jpcc-p29g-p8mq)

[HIGH] 40-40: github.com/containerd/containerd 1.7.30: containerd CRI — image-config LABEL flows to restart-monitor binary:// logger: host-root command execution from an image pull

(GHSA-xhf5-7wjv-pqxp)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@go.mod` at line 40, Remove the vulnerable containerd dependency pin from the module graph by updating the dependency set so `github.com/containerd/containerd` is no longer selected at v1.7.30. Regenerate the module metadata via the normal dependency management flow, and ensure any reference to `github.com/containerd/containerd` in go.mod/go.sum is replaced with a non-vulnerable resolved version outside the affected range. Verify the resulting dependency graph no longer includes the vulnerable `containerd` module.

Source: Linters/SAST tools

coderabbitai · 2026-07-01T16:59:20Z

 	// Set default timeout if not specified
 	if opts.Timeout == 0 {
-		opts.Timeout = 5 * time.Minute
+		opts.Timeout = 1 * time.Minute


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Restore the previous deploy timeout budget.

Every adapter spec shown here leaves AdapterDeploymentOptions.Timeout unset, so this change cuts the default Helm --wait window from 5m to 1m on the main path. Fresh image pulls and rollout on shared CI clusters routinely exceed 60s, so this is a flake factory rather than a real signal.

Suggested fix

- if opts.Timeout == 0 { - opts.Timeout = 1 * time.Minute - } + if opts.Timeout == 0 { + opts.Timeout = 5 * time.Minute + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// Set default timeout if not specified

if opts.Timeout == 0 {

opts.Timeout = 5 * time.Minute

opts.Timeout = 1 * time.Minute

// Set default timeout if not specified

if opts.Timeout == 0 {

opts.Timeout = 5 * time.Minute

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/helper/adapter.go` around lines 88 - 90, The default deployment timeout in adapter helper logic was reduced too far, causing the main Helm wait path to time out prematurely when AdapterDeploymentOptions.Timeout is unset. In the timeout fallback inside helper logic (the code that checks opts.Timeout in adapter.go), restore the previous deploy budget by setting the default back to 5 minutes instead of 1 minute so the existing adapter deployment flow remains stable on slower CI and image pulls.

coderabbitai · 2026-07-01T16:59:21Z

+	// Add run-id label for resource tracking and cleanup
+	if runID := os.Getenv("RUN_ID"); runID != "" {
+		// Label the Helm release itself (for Helm SDK-based cleanup)
+		helmArgs = append(helmArgs,
+			"--labels", fmt.Sprintf("e2e.hyperfleet.io/run-id=%s", runID),
+		)
+		helmArgs = append(helmArgs,
+			"--set", fmt.Sprintf("commonLabels.e2e.hyperfleet.io/run-id=%s", runID),
+		)


🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail printf 'File outline:\n' ast-grep outline pkg/helper/adapter.go --view expanded || true printf '\nRelevant lines:\n' cat -n pkg/helper/adapter.go | sed -n '170,230p' printf '\nSearch for Helm args and label handling:\n' rg -n --hidden --no-messages 'RUN_ID|commonLabels|--labels|--set-string|--set ' pkg/helper pkg | sed -n '1,200p'

Repository: openshift-hyperfleet/hyperfleet-e2e

Length of output: 5969

🏁 Script executed:

#!/bin/bash set -euo pipefail printf 'Helm CLI quoting/escaping references in repo:\n' rg -n --hidden --no-messages 'helm.*--set|--set-string|commonLabels\.' . printf '\nPotential chart templates/values using run-id label key:\n' rg -n --hidden --no-messages 'e2e\.hyperfleet\.io/run-id|run-id' charts pkg testdata docs .

Repository: openshift-hyperfleet/hyperfleet-e2e

Length of output: 8179

🏁 Script executed:

#!/bin/bash set -euo pipefail python3 - <<'PY' import pathlib, re p = pathlib.Path('pkg/helper/adapter.go') text = p.read_text() for i, line in enumerate(text.splitlines(), 1): if 'RUN_ID' in line or 'commonLabels' in line or '--labels' in line or '--set' in line: print(f"{i}: {line}") PY

Repository: openshift-hyperfleet/hyperfleet-e2e

Length of output: 725

🏁 Script executed:

#!/bin/bash set -euo pipefail printf 'adapter.go excerpt:\n' cat -n pkg/helper/adapter.go | sed -n '190,215p' printf '\nWhere commonLabels is consumed:\n' rg -n --hidden --no-messages 'commonLabels|run-id' . --glob '!**/.git/**' --glob '!**/vendor/**'

Repository: openshift-hyperfleet/hyperfleet-e2e

Length of output: 5049

🏁 Script executed:

#!/bin/bash set -euo pipefail rg -n --hidden --no-messages 'e2e\.hyperfleet\.io/run-id|RUN_ID|run-id' . printf '\nHelm value wiring around commonLabels:\n' rg -n --hidden --no-messages 'commonLabels' . --glob '!**/.git/**'

Repository: openshift-hyperfleet/hyperfleet-e2e

Length of output: 6591

🌐 Web query:

Helm --set key with dots label keys escaped backslash commonLabels parser docs

💡 Result:

When using the Helm --set flag, keys containing dots are interpreted as nested objects by default [1][2]. To treat a dot as a literal character in a key (such as for labels or annotations like kubernetes.io/role), you must escape it using a backslash [1][2]. Correct Syntax for Dotted Keys To correctly set a value for a key containing dots, use the following syntax: helm install --set key."sub.key.with.dots"=value my-chart Key Requirements: 1. Quote the Key: Enclose the entire key (or the specific path segment containing the dots) in double quotes [1][3][4]. 2. Escape the Dot: Precede each literal dot with a backslash [1][2]. 3. Shell Considerations: Depending on your shell environment (e.g., Bash, Zsh, or running within a CI/CD pipeline like Jenkins), you may need to double-escape the backslash to ensure the backslash is passed through to the Helm parser [5][3][6]. For example, use \. if a single backslash is being stripped by the shell before reaching Helm [5][6]. Example If you need to set a commonLabel or annotation with a dot, such as kubernetes.io/role=master, use: helm install --set commonLabels."kubernetes.io/role"=master. If running in a shell that interprets backslashes, you might need: helm install --set commonLabels."kubernetes\.io/role"=master. Troubleshooting If you continue to face issues with complex values or deeply nested structures, consider using a values file (--values or -f) instead of --set [1][5][7]. Providing a YAML file is generally more robust and avoids the syntactic limitations and escaping complexities associated with the --set flag [5][7].

Citations:

1: https://helm.sh/docs/intro/using_helm

2: https://helm.sh/docs/intro/using_helm/

3: https://stackoverflow.com/questions/49522776/helm-how-to-override-value-with-periods-in-name

4: dots in value name is misinterpreted by toYaml helm/helm#5027

5: https://stackoverflow.com/questions/77745012/inline-setting-dotted-variables-in-helm-chart

6: https://stackoverflow.com/questions/69766554/helm-chart-how-to-pass-a-env-value-with-multiple-dots

7: Provide an option or alternative for --set and/or --set-string to take a value literally helm/helm#4030

Escape the dotted run-id key in the Helm --set path.

commonLabels.e2e.hyperfleet.io/run-id is parsed as nested values, so the chart won’t receive the literal label key and cleanup will miss those resources. Use --set-string with an escaped path like commonLabels.e2e\\.hyperfleet\\.io/run-id=%s.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/helper/adapter.go` around lines 198 - 206, The run-id label passed through Helm is using an unescaped dotted key in the commonLabels --set path, so Helm parses it as nested fields instead of the literal label name. Update the RUN_ID handling in adapter.go to use the Helm release labeling plus a string-based set flag with an escaped key path for commonLabels, referencing the existing RUN_ID block in the adapter logic. Keep the literal e2e.hyperfleet.io/run-id label intact so cleanup can match resources correctly.

coderabbitai · 2026-07-01T16:59:21Z

+func (h *Helper) SweepTestResources(ctx context.Context) error {
+	runID := os.Getenv("RUN_ID")
+	if runID == "" {
+		logger.Warn("RUN_ID not set, skipping cleanup")
+		return nil
+	}
+
+	logger.Info("cleaning up resources with run-id", "run-id", runID)
+
+	labelSelector := fmt.Sprintf("e2e.hyperfleet.io/run-id=%s", runID)
+
+	// Uninstall orphaned helm releases with the run-id label
+	// This ensures proper cleanup of Helm-managed resources before orphan cleanup
+	h.uninstallHelmReleases(ctx, labelSelector)
+
+	// Get kubernetes config for dynamic client (for CRDs)
+	kubeconfig := clientcmd.NewDefaultClientConfigLoadingRules().GetDefaultFilename()
+	restConfig, err := clientcmd.BuildConfigFromFlags("", kubeconfig)
+	if err != nil {
+		return fmt.Errorf("failed to build kubeconfig: %w", err)
+	}
+
+	// Create dynamic client for custom resources
+	dynamicClient, err := dynamic.NewForConfig(restConfig)
+	if err != nil {
+		logger.Error("failed to create dynamic client", "error", err)
+	}
+
+	// Run all resource deletions in parallel except namespaces
+	var wg sync.WaitGroup
+
+	// Delete jobs
+	wg.Add(1)
+	go func() {
+		defer wg.Done()
+		logger.Info("Starting deletion of orphaned jobs", "label", labelSelector)
+		jobs, err := h.K8sClient.BatchV1().Jobs("").List(ctx, metav1.ListOptions{
+			LabelSelector: labelSelector,
+		})
+		if err != nil {
+			logger.Error("failed to list jobs", "error", err)
+			return
+		}
+		for _, job := range jobs.Items {
+			logger.Info("deleting job", "name", job.Name, "namespace", job.Namespace)
+			propagationPolicy := metav1.DeletePropagationForeground
+			if err := h.K8sClient.BatchV1().Jobs(job.Namespace).Delete(ctx, job.Name, metav1.DeleteOptions{
+				PropagationPolicy: &propagationPolicy,
+			}); err != nil {
+				logger.Error("failed to delete job", "namespace", job.Namespace, "name", job.Name, "error", err)
+			}
+
+		}
+	}()
+
+	// Delete deployments
+	wg.Add(1)
+	go func() {
+		defer wg.Done()
+		logger.Info("Starting deletion of orphaned deployments", "label", labelSelector)
+		deployments, err := h.K8sClient.AppsV1().Deployments("").List(ctx, metav1.ListOptions{
+			LabelSelector: labelSelector,
+		})
+		if err != nil {
+			logger.Error("failed to list deployments", "error", err)
+			return
+		}
+		for _, deployment := range deployments.Items {
+			logger.Info("deleting deployment", "name", deployment.Name, "namespace", deployment.Namespace)
+			propagationPolicy := metav1.DeletePropagationForeground
+			if err := h.K8sClient.AppsV1().Deployments(deployment.Namespace).Delete(ctx, deployment.Name, metav1.DeleteOptions{
+				PropagationPolicy: &propagationPolicy,
+			}); err != nil {
+				logger.Error("failed to delete deployment", "namespace", deployment.Namespace, "name", deployment.Name, "error", err)
+			}
+
+		}
+	}()
+
+	// Delete configmaps
+	wg.Add(1)
+	go func() {
+		defer wg.Done()
+		logger.Info("Starting deletion of orphaned configmaps", "label", labelSelector)
+		configmaps, err := h.K8sClient.CoreV1().ConfigMaps("").List(ctx, metav1.ListOptions{
+			LabelSelector: labelSelector,
+		})
+		if err != nil {
+			logger.Error("failed to list configmaps", "error", err)
+			return
+		}
+		for _, cm := range configmaps.Items {
+			logger.Info("deleting configmap", "name", cm.Name, "namespace", cm.Namespace)
+			if err := h.K8sClient.CoreV1().ConfigMaps(cm.Namespace).Delete(ctx, cm.Name, metav1.DeleteOptions{}); err != nil {
+				logger.Error("failed to delete configmap", "namespace", cm.Namespace, "name", cm.Name, "error", err)
+			}
+		}
+	}()
+
+	// Delete AppliedManifestWorks
+	if dynamicClient != nil {
+		wg.Add(1)
+		go func() {
+			defer wg.Done()
+			logger.Info("Starting deletion of orphaned AppliedManifestWorks", "label", labelSelector)
+			appliedManifestWorkGVR := schema.GroupVersionResource{
+				Group:    "work.open-cluster-management.io",
+				Version:  "v1",
+				Resource: "appliedmanifestworks",
+			}
+
+			appliedWorks, err := dynamicClient.Resource(appliedManifestWorkGVR).List(ctx, metav1.ListOptions{
+				LabelSelector: labelSelector,
+			})
+			if err != nil {
+				logger.Error("failed to list applied manifest works", "error", err)
+				return
+			}
+			for _, work := range appliedWorks.Items {
+				logger.Info("deleting applied manifest work", "name", work.GetName())
+				if err := dynamicClient.Resource(appliedManifestWorkGVR).Delete(ctx, work.GetName(), metav1.DeleteOptions{}); err != nil {
+					logger.Error("failed to delete applied manifest work", "name", work.GetName(), "error", err)
+				}
+			}
+		}()
+	}
+
+	// Wait for all resource deletions to complete before deleting namespaces
+	// This prevents stuck namespaces due to AppliedManifestWork finalizers
+	wg.Wait()
+	logger.Info("all resources deleted, proceeding to namespace cleanup")
+
+	// Delete namespaces LAST - they can be deleted in parallel from each other
+	h.deleteNamespaces(ctx, labelSelector)
+
+	return nil
+}


🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift

Propagate cleanup failures back to AfterSuite.

SweepTestResources only returns kubeconfig-build failures. Helm uninstall failures, dynamic-client init failures, list/delete failures, and namespace cleanup failures are all logged and dropped, so pkg/e2e/suite.go sees nil and reports a clean teardown while orphaned resources remain. As per path instructions, "Log-and-continue MUST be intentional degradation with a comment, not a missing return."

Also applies to: 210-243

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/helper/cleanup.go` around lines 19 - 155, SweepTestResources currently logs most cleanup errors but still returns nil, so AfterSuite cannot detect failed teardown. Update Helper.SweepTestResources and any called cleanup helpers it relies on, such as uninstallHelmReleases and deleteNamespaces, to collect and return errors instead of dropping them; make dynamic client init, list/delete failures, and namespace cleanup failures propagate back to the caller. Keep any intentional log-and-continue paths explicit with comments, and ensure pkg/e2e/suite.go can fail the suite when cleanup does not complete successfully.

Source: Path instructions

coderabbitai · 2026-07-01T16:59:21Z

+			appliedWorks, err := dynamicClient.Resource(appliedManifestWorkGVR).List(ctx, metav1.ListOptions{
+				LabelSelector: labelSelector,
+			})
+			if err != nil {
+				logger.Error("failed to list applied manifest works", "error", err)
+				return
+			}
+			for _, work := range appliedWorks.Items {
+				logger.Info("deleting applied manifest work", "name", work.GetName())
+				if err := dynamicClient.Resource(appliedManifestWorkGVR).Delete(ctx, work.GetName(), metav1.DeleteOptions{}); err != nil {
+					logger.Error("failed to delete applied manifest work", "name", work.GetName(), "error", err)


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== cleanup.go around target lines ==" sed -n '1,260p' pkg/helper/cleanup.go echo echo "== search for appliedManifestWorkGVR and namespace usage ==" rg -n "appliedManifestWorkGVR|AppliedManifestWork|Namespace\\(" pkg/helper pkg -g '!**/*vendor*' echo echo "== locate resource definitions if any ==" rg -n "GroupVersionResource|appliedmanifestworks|ManifestWork" .

Repository: openshift-hyperfleet/hyperfleet-e2e

Length of output: 38740

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== docs/debugging cleanup section ==" sed -n '270,340p' docs/debugging.md echo echo "== CleanupTestCluster and caller behavior ==" rg -n "CleanupTestCluster|SweepTestResources|AfterSuite|DeferCleanup" pkg e2e docs -g '!**/*vendor*' echo echo "== cleanup helper around error handling and namespace finalization ==" sed -n '1,260p' pkg/helper/helper.go

Repository: openshift-hyperfleet/hyperfleet-e2e

Length of output: 20108

🌐 Web query:

Kubernetes dynamic client namespaced resource delete without Namespace() behavior appliedmanifestworks namespaced resource delete

💡 Result:

When using the Kubernetes dynamic client, calling Resource on a namespaced resource type without subsequently chaining the Namespace method will result in an error or unexpected behavior, as the client requires namespace scoping for namespaced resources [1][2][3]. The dynamic client's internal validation mechanism, specifically seen in the implementation of the Delete and DeleteCollection methods, explicitly checks for namespace validity when performing operations on namespaced resources [4][5]. If the resource is namespaced and no namespace is provided, the validation logic fails because it cannot construct the required URL path (e.g., /apis///namespaces///) [4][5]. Regarding ManifestWorks in Open Cluster Management (OCM), deleting a namespaced resource managed by a ManifestWork is handled by the OCM controller, not directly by a standard Kubernetes dynamic client call on the underlying resource [6][7]. ManifestWorks define a set of Kubernetes resources to be applied to a managed cluster [6][8]. When a ManifestWork is deleted (or when a resource is removed from its spec), the controller on the managed cluster automatically handles the deletion of the corresponding resources based on the specified DeleteOption (e.g., Foreground or Orphan propagation policies) [6][8]. You do not typically interact with the dynamic client to delete these individual "applied" resources manually; instead, you update or delete the ManifestWork object itself on the hub cluster [6][8].

Citations:

1: https://pkg.go.dev/k8s.io/cLient-go/1.5/dynamic

2: https://caioferreira.dev/posts/the-kubernetes-dynamic-client/

3: https://github.com/SuperMarioYL/cloudnative-cookbook/blob/main/content/kubernetes/12-client-go/04-dynamic-client.md

4: https://github.com/kubernetes/client-go/blob/master/dynamic/simple.go

5: https://github.com/kubernetes/client-go/blob/v0.35.3/dynamic/simple.go

6: https://github.com/open-cluster-management-io/api/blob/779879f4/work/v1/types.go

7: https://pkg.go.dev/open-cluster-management.io/api/work/v1

8: https://github.com/open-cluster-management-io/api/blob/779879f4/work/v1/0000_00_work.open-cluster-management.io_manifestworks.crd.yaml

Delete AppliedManifestWork with its namespace. dynamicClient.Resource(appliedManifestWorkGVR).Delete(...) omits Namespace(work.GetNamespace()), so this cleanup path can fail for namespaced objects and leave finalizers behind (CWE-772).

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/helper/cleanup.go` around lines 130 - 140, The cleanup path for AppliedManifestWork deletion is missing the object namespace, so namespaced resources may not be removed correctly. Update the deletion call in the applied manifest work loop in cleanup.go to use the work’s namespace from work.GetNamespace() along with work.GetName(), keeping the existing list/delete flow in cleanup of appliedWorks and the dynamicClient.Resource(appliedManifestWorkGVR) call.

They aren't namespaced objects

@ma-hill Confirmed — AppliedManifestWork (work.open-cluster-management.io/v1) is cluster-scoped, not namespaced, so dynamicClient.Resource(appliedManifestWorkGVR).Delete(...) without .Namespace() is correct. My original flag was wrong; withdrawing it.

coderabbitai · 2026-07-01T16:59:21Z

+func (h *Helper) patchRemoveFinalizers(ctx context.Context, namespace string) error {
+	// Patch to remove finalizers
+	patch := []byte(`{"metadata":{"finalizers":null}}`)
+	_, err := h.K8sClient.CoreV1().Namespaces().Patch(
+		ctx,
+		namespace,
+		"application/merge-patch+json",
+		patch,
+		metav1.PatchOptions{},


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Map the relevant file and nearby symbols first. ast-grep outline pkg/helper/cleanup.go --view expanded || true printf '\n--- relevant lines ---\n' sed -n '160,240p' pkg/helper/cleanup.go

Repository: openshift-hyperfleet/hyperfleet-e2e

Length of output: 3559

🌐 Web query:

Kubernetes namespace finalizers finalize subresource spec.finalizers metadata.finalizers documentation

💡 Result:

In Kubernetes, finalizers allow resources to perform necessary cleanup tasks before being permanently deleted from etcd [1][2][3]. While most resources store finalizers in the metadata.finalizers field [1], Namespaces are unique in that they use a spec.finalizers field [4][2][5]. The spec.finalizers field on a Namespace is typically managed by the Namespace controller [4][2]. On a healthy cluster, this list usually contains only the "kubernetes" finalizer, which represents the namespace's own internal finalization logic [4][2]. The namespace controller is responsible for emptying this list once all contents of the namespace have been successfully deleted [4][2]. Crucially, the spec.finalizers field of a Namespace cannot be modified through standard update or patch requests to the resource itself [6][5]. To modify this field—specifically to force the deletion of a Namespace stuck in the Terminating state—one must use the dedicated /finalize subresource [4][2][7]. This subresource acts as the authorized API endpoint for the Namespace controller (or an administrator in an emergency) to update the spec.finalizers list [4][2][3]. For example, to manually force the removal of a stuck Namespace by clearing its finalizers, administrators often export the Namespace manifest, empty the spec.finalizers array, and push the update to the /finalize subresource using a tool like curl [7][8]. It is important to note that metadata.finalizers and spec.finalizers are distinct, and proposals to create a generic subresource for metadata.finalizers across all Kubernetes objects have been discussed but were not adopted [9]. Consequently, the behavior described for Namespaces remains a specialized mechanism within the Kubernetes API [4][2].

Citations:

1: https://kubernetes.io/docs/concepts/overview/working-with-objects/finalizers/

2: https://jorijn.com/en/knowledge-base/kubernetes/troubleshooting/kubernetes-namespace-stuck-terminating/

3: https://kubernetes.io/blog/2021/05/14/using-finalizers-to-control-deletion/

4: https://github.com/kubernetes/design-proposals-archive/blob/main/architecture/namespaces.md

5: https://www.redhat.com/en/blog/troubleshooting-terminating-namespaces

6: spec.finalizers cannot be modified via a patch or update to a namespace resource kubernetes/kubernetes#90438

7: https://cloud.google.com/kubernetes-engine/docs/troubleshooting/terminating-namespaces

8: deleting namespace stuck at "Terminating" state kubernetes/kubernetes#60807

9: KEP-3285: subresource for metadata.finalizers kubernetes/enhancements#3286

Use the Namespace /finalize subresource here. Patching metadata.finalizers on the main Namespace resource does not clear the terminating spec.finalizers, so this fallback cannot unblock a stuck namespace.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/helper/cleanup.go` around lines 197 - 205, The patchRemoveFinalizers helper is updating metadata.finalizers on the main Namespace object, which won’t clear the terminating spec.finalizers. Update this fallback to use the Namespace finalization path via h.K8sClient.CoreV1().Namespaces().Finalize (the /finalize subresource) so the stuck namespace can actually be unblocked.

openshift-ci · 2026-07-01T17:02:26Z

@ma-hill: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-deployment-validation	`c9f8d10`	link	true	`/test e2e-deployment-validation`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

pnguyen44 · 2026-07-01T19:28:43Z

+	cfg := GetSuiteConfig()
+	if cfg != nil {
+		h := helper.New()
+		if h != nil {
+			logger.Info("cleaning up test resources by run-id label")
+			// Use a fresh context with 5-minute timeout for cleanup
+			ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
+			defer cancel()
+			if err := h.SweepTestResources(ctx); err != nil {
+				logger.Error("failed to cleanup test resources", "error", err)
+			}


helper.New() calls log.Fatalf on failure and never returns nil, so the if h != nil guard is dead code. If Helper creation fails during teardown (e.g., expired kubeconfig), the process will crash and test results may be lost. Consider either adding a non-fatal constructor variant or wrapping this in a recover block so cleanup failures degrade gracefully.

pnguyen44 · 2026-07-01T19:31:11Z

+	kubeconfig := clientcmd.NewDefaultClientConfigLoadingRules().GetDefaultFilename()
+	restConfig, err := clientcmd.BuildConfigFromFlags("", kubeconfig)
+	if err != nil {
+		return fmt.Errorf("failed to build kubeconfig: %w", err)
+	}
+
+	// Create dynamic client for custom resources
+	dynamicClient, err := dynamic.NewForConfig(restConfig)


Consider deriving the dynamic client from the existing K8sClient's REST config instead of re-reading kubeconfig from disk. This avoids an extra failure point and keeps client creation consistent with the rest of the method, which already uses h.K8sClient for typed resources.

pnguyen44 · 2026-07-01T19:37:24Z

+	// Require RUN_ID to be set for resource cleanup tracking
+	runID := os.Getenv("RUN_ID")
+	if runID == "" {
+		log.Fatalf("RUN_ID environment variable must be set (e.g., export RUN_ID=test-run-$(date +%%s))")
+	}
+	logger.Info("test run identifier set", "run-id", runID)


The error message could mention sourcing env/env.local so developers know where RUN_ID gets set automatically.

pnguyen44 · 2026-07-01T19:45:18Z

+func (h *Helper) SweepTestResources(ctx context.Context) error {
+	runID := os.Getenv("RUN_ID")
+	if runID == "" {
+		logger.Warn("RUN_ID not set, skipping cleanup")


Consider adding unit tests for SweepTestResources. This function handles resource deletion and force-removes namespace finalizers, but there's no test coverage to validate error handling or edge cases (e.g., partial deletion failures, stuck namespaces, no matching resources).

pnguyen44 · 2026-07-01T19:45:52Z

+func (c *Client) ListReleasesBySelector(ctx context.Context, labelSelector string) ([]string, error) {
+	// Initialize action config for all namespaces (empty string means all)
+	actionConfig, err := c.initActionConfig()


Consider adding unit tests for ListReleasesBySelector. This would help verify the label selector filtering, namespace scoping, and state mask behavior without needing a real cluster.

pnguyen44 · 2026-07-01T19:47:41Z

+# Image Overrides
+# =====================================================================
+export IMAGE_REGISTRY="${IMAGE_REGISTRY:-quay.io}"
+export API_IMAGE_REPO="${API_IMAGE_REPO:-redhat-services-prod/hyperfleet-tenant/hyperfleet/hyperfleet-api}"
 export API_IMAGE_TAG="${API_IMAGE_TAG:-latest}"


env.local defines RUN_ID but env.ci doesn't. Every other required variable is defined in both files. Consider adding export RUN_ID="${RUN_ID:-${NAMESPACE}}" to env.ci for consistency, so developers sourcing it locally don't hit the BeforeSuite fatal.

Developers shouldn't be sourcing env.ci -- I believe env.ci should just be used for prow environment not user runs

I added it to the e2e scripts in the openshift/release repo but maybe having it set here would be better, will update

openshift-ci Bot added the do-not-merge/work-in-progress label Jul 1, 2026

openshift-ci Bot added the needs-rebase label Jul 1, 2026

ma-hill force-pushed the HYPERFLEET-1109-2 branch from b53834a to 37e70b8 Compare July 1, 2026 16:40

openshift-ci Bot removed the needs-rebase label Jul 1, 2026

HYPERFLEET-1109 - feat: Add after suite sweep to clean up any erroneo…

c9f8d10

…us resources

ma-hill force-pushed the HYPERFLEET-1109-2 branch from 37e70b8 to c9f8d10 Compare July 1, 2026 16:43

ma-hill marked this pull request as ready for review July 1, 2026 16:46

openshift-ci Bot removed the do-not-merge/work-in-progress label Jul 1, 2026

openshift-ci Bot requested review from crizzo71 and mbrudnoy July 1, 2026 16:46

ma-hill changed the title ~~HYPERFLEET-1109 - feat: Add after suite sweep to clean up any erroneo…~~ HYPERFLEET-1109 - feat: Add after suite sweep to clean up resources Jul 1, 2026

coderabbitai Bot reviewed Jul 1, 2026

View reviewed changes

ma-hill marked this pull request as draft July 1, 2026 17:27

openshift-ci Bot added the do-not-merge/work-in-progress label Jul 1, 2026

pnguyen44 reviewed Jul 1, 2026

View reviewed changes

Uh oh!

Conversation

ma-hill commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Notes

Test Plan

Uh oh!

openshift-ci Bot commented Jul 1, 2026

Uh oh!

openshift-ci Bot commented Jul 1, 2026

Uh oh!

coderabbitai Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Sequence Diagram(s)

Uh oh!

ma-hill commented Jul 1, 2026

Uh oh!

coderabbitai Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented Jul 1, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ma-hill commented Jul 1, 2026 •

edited

Loading

coderabbitai Bot commented Jul 1, 2026 •

edited

Loading

coderabbitai Bot commented Jul 1, 2026 •

edited

Loading

coderabbitai Bot Jul 1, 2026 •

edited

Loading