Skip to content

HYPERFLEET-1109 - feat: Add after suite sweep to clean up resources#139

Draft
ma-hill wants to merge 1 commit into
openshift-hyperfleet:mainfrom
ma-hill:HYPERFLEET-1109-2
Draft

HYPERFLEET-1109 - feat: Add after suite sweep to clean up resources#139
ma-hill wants to merge 1 commit into
openshift-hyperfleet:mainfrom
ma-hill:HYPERFLEET-1109-2

Conversation

@ma-hill

@ma-hill ma-hill commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds an AfterSuite cleanup hook that sweeps orphaned test resources by RUN_ID label. When tests fail or are interrupted, resources like namespaces, adapters, jobs, and deployments can remain in the cluster indefinitely. This new cleanup mechanism runs after all tests complete and removes any resources tagged with the current test run's ID, preventing resource accumulation in CI and local development environments.

HYPERFLEET-1109

Changes

  • Added SweepTestResources method to Helper that deletes all resources with the e2e.hyperfleet.io/run-id label, including jobs, deployments, configmaps, AppliedManifestWorks, and Helm releases
  • Added Helm SDK integration via pkg/helper/helm/helm.go to list and uninstall orphaned Helm releases by label selector instead of shelling out to helm list
  • Integrated cleanup into Ginkgo AfterSuite hook with a 5-minute timeout to ensure cleanup runs even when tests fail
  • Added parallel deletion of resources (except namespaces, which wait for their contained resources to delete first) to speed up cleanup
  • Updated adapter test cleanup to set adapterRelease BEFORE deployment instead of after, ensuring DeferCleanup runs even if deployment fails
  • Updated go.mod/go.sum with Helm SDK dependencies (helm.sh/helm/v3, Google Pub/Sub v2, and related transitive dependencies)
  • Standardized env file formatting and moved image registry defaults from registry.ci.openshift.org to quay.io for CI and local environments
  • Added values.yaml files to adapter config directories to provide default Helm values for test adapters

Notes

The RUN_ID environment variable must be set before running tests (enforced in BeforeSuite). This is set in ci setup configs in the release repo and set to default as NAMESPACE in env.local

Namespaces are deleted LAST to avoid stuck finalizers. If a namespace remains stuck after normal deletion, the cleanup will patch out finalizers and retry.

The Helm SDK client uses the same uninstall approach as the existing UninstallAdapter method to ensure consistency.

Update IMAGE_REGISTRY and *_IMAGE_REPO to accurately specify images - in env.ci and env.local

In conjunction with this change: openshift/release#81351

Test Plan

  • Unit tests added/updated
  • make test-all passes
  • make lint passes
  • E2E tests passed with cleanup sweep verified in CI logs
  • Tested cleanup behavior by running tests with failures and verifying resources are removed

@openshift-ci

openshift-ci Bot commented Jul 1, 2026

Copy link
Copy Markdown

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci

openshift-ci Bot commented Jul 1, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mliptak0 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Review skipped — only excluded labels are configured. (2)
  • do-not-merge/work-in-progress
  • do-not-merge/hold

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 37930b8e-20e3-4d2b-8dbd-e3175b070855

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

RUN_ID is required at suite startup and propagated through DeployAdapter as Helm labels/commonLabels. A new helm.Client wraps release listing by label selector; a new cleanup.go implements SweepTestResources, deleting Helm releases, Jobs/Deployments/ConfigMaps/AppliedManifestWork, and namespaces (with finalizer removal) scoped to RUN_ID. Suite teardown invokes this sweep. Test-data configs add runId params and run-id labels. env.ci/env.local switch image registry/repos to quay.io redhat-services-prod paths and add RUN_ID export. go.mod bumps several dependencies. Maestro negative tests reorder cleanup registration; adapter timeout default reduced to 1 minute.

Estimated code review effort: 4 (Complex) | ~60 minutes

Sequence Diagram(s)

sequenceDiagram
  participant Suite as pkg/e2e/suite.go
  participant Helper as Helper.SweepTestResources
  participant HelmClient as helm.Client
  participant K8s as Kubernetes API

  Suite->>Helper: AfterSuite calls SweepTestResources(ctx)
  Helper->>Helper: read RUN_ID env, build label selector
  Helper->>HelmClient: ListReleasesBySelector(labelSelector)
  HelmClient->>K8s: list releases via Helm action
  HelmClient-->>Helper: matching release names
  Helper->>K8s: UninstallAdapter for each release
  Helper->>K8s: delete Jobs/Deployments/ConfigMaps/AppliedManifestWork
  Helper->>K8s: delete Namespaces by label selector
  K8s-->>Helper: deletion errors (if any)
  Helper->>K8s: patchRemoveFinalizers + retry delete on stuck namespace
  Helper-->>Suite: cleanup error or success
Loading

Security notes:

  • pkg/helper/cleanup.go builds a dynamic client from default kubeconfig and force-deletes namespaces, foreground-deleting Jobs/Deployments/ConfigMaps/AppliedManifestWork, and merge-patches metadata.finalizers to null to force stuck namespace removal. Finalizer stripping bypasses controller-managed cleanup guarantees (CWE-841: Improper Enforcement of Behavioral Workflow) — this runs unconditionally in AfterSuite against whatever cluster context is active; no namespace/label scope validation beyond a string match on RUN_ID is shown.
  • RUN_ID is read from environment and interpolated directly into Helm --set/--labels CLI args (pkg/helper/adapter.go) and shell-substituted into YAML (${RUN_ID}). No sanitization is described — if RUN_ID is attacker-influenced (e.g., derived from CI job/PR metadata), this is a command-argument injection surface (CWE-88) into helm invocations.
  • env/env.ci and env/env.local repoint image pulls from registry.ci.openshift.org to quay.io/redhat-services-prod/hyperfleet-tenant/hyperfleet/*. Verify image signing/provenance (cosign/Sigstore) and pull-by-digest enforcement are unchanged — registry migration without pinned digests reopens tag-mutability supply-chain risk (CWE-1357: Reliance on Insufficiently Trustworthy Component).
  • go.mod indirect dependency churn (add/remove/version bump across ~90 lines) is not accompanied by go.sum verification details in this summary — confirm go.sum hashes are updated and checked in CI (no SBOM/provenance diff mentioned).
  • Helm action config in pkg/helper/helm/helm.go uses an empty driver string (defaults to Secrets storage) — confirm this doesn't silently target unintended release storage namespace/backend at runtime given the dynamic namespace targeting.

Related PRs: None identified from the provided diff.

Suggested labels: ci, dependencies, security-review-requested

Suggested reviewers: platform/e2e-infra maintainers, security team (finalizer removal + kubeconfig/dynamic client usage)

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
✨ Simplify code
  • Create PR with simplified code

Comment @coderabbitai help to get the list of available commands.

@ma-hill

ma-hill commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

@CodeRabbit review

@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@ma-hill ma-hill force-pushed the HYPERFLEET-1109-2 branch from 37e70b8 to c9f8d10 Compare July 1, 2026 16:43
@ma-hill ma-hill marked this pull request as ready for review July 1, 2026 16:46
@openshift-ci openshift-ci Bot requested review from crizzo71 and mbrudnoy July 1, 2026 16:46
@ma-hill ma-hill changed the title HYPERFLEET-1109 - feat: Add after suite sweep to clean up any erroneo… HYPERFLEET-1109 - feat: Add after suite sweep to clean up resources Jul 1, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@go.mod`:
- Line 40: Remove the vulnerable containerd dependency pin from the module graph
by updating the dependency set so `github.com/containerd/containerd` is no
longer selected at v1.7.30. Regenerate the module metadata via the normal
dependency management flow, and ensure any reference to
`github.com/containerd/containerd` in go.mod/go.sum is replaced with a
non-vulnerable resolved version outside the affected range. Verify the resulting
dependency graph no longer includes the vulnerable `containerd` module.

In `@pkg/helper/adapter.go`:
- Around line 88-90: The default deployment timeout in adapter helper logic was
reduced too far, causing the main Helm wait path to time out prematurely when
AdapterDeploymentOptions.Timeout is unset. In the timeout fallback inside helper
logic (the code that checks opts.Timeout in adapter.go), restore the previous
deploy budget by setting the default back to 5 minutes instead of 1 minute so
the existing adapter deployment flow remains stable on slower CI and image
pulls.
- Around line 198-206: The run-id label passed through Helm is using an
unescaped dotted key in the commonLabels --set path, so Helm parses it as nested
fields instead of the literal label name. Update the RUN_ID handling in
adapter.go to use the Helm release labeling plus a string-based set flag with an
escaped key path for commonLabels, referencing the existing RUN_ID block in the
adapter logic. Keep the literal e2e.hyperfleet.io/run-id label intact so cleanup
can match resources correctly.

In `@pkg/helper/cleanup.go`:
- Around line 19-155: SweepTestResources currently logs most cleanup errors but
still returns nil, so AfterSuite cannot detect failed teardown. Update
Helper.SweepTestResources and any called cleanup helpers it relies on, such as
uninstallHelmReleases and deleteNamespaces, to collect and return errors instead
of dropping them; make dynamic client init, list/delete failures, and namespace
cleanup failures propagate back to the caller. Keep any intentional
log-and-continue paths explicit with comments, and ensure pkg/e2e/suite.go can
fail the suite when cleanup does not complete successfully.
- Around line 197-205: The patchRemoveFinalizers helper is updating
metadata.finalizers on the main Namespace object, which won’t clear the
terminating spec.finalizers. Update this fallback to use the Namespace
finalization path via h.K8sClient.CoreV1().Namespaces().Finalize (the /finalize
subresource) so the stuck namespace can actually be unblocked.
- Around line 130-140: The cleanup path for AppliedManifestWork deletion is
missing the object namespace, so namespaced resources may not be removed
correctly. Update the deletion call in the applied manifest work loop in
cleanup.go to use the work’s namespace from work.GetNamespace() along with
work.GetName(), keeping the existing list/delete flow in cleanup of appliedWorks
and the dynamicClient.Resource(appliedManifestWorkGVR) call.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 3e40f1dc-72aa-4205-96e9-652be581535a

📥 Commits

Reviewing files that changed from the base of the PR and between 6aa8341 and c9f8d10.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum, !**/go.sum
📒 Files selected for processing (25)
  • e2e/adapter/adapter_failover.go
  • e2e/adapter/adapter_with_maestro.go
  • env/env.ci
  • env/env.local
  • go.mod
  • pkg/e2e/suite.go
  • pkg/helper/adapter.go
  • pkg/helper/cleanup.go
  • pkg/helper/helm/helm.go
  • testdata/adapter-configs/cl-crash/adapter-task-config.yaml
  • testdata/adapter-configs/cl-crash/values.yaml
  • testdata/adapter-configs/cl-invalid-resource/adapter-task-config.yaml
  • testdata/adapter-configs/cl-invalid-resource/values.yaml
  • testdata/adapter-configs/cl-m-bad-api/adapter-task-config.yaml
  • testdata/adapter-configs/cl-m-bad-api/values.yaml
  • testdata/adapter-configs/cl-m-unreg-consumer/adapter-task-config.yaml
  • testdata/adapter-configs/cl-m-unreg-consumer/values.yaml
  • testdata/adapter-configs/cl-m-wrong-ds/adapter-task-config.yaml
  • testdata/adapter-configs/cl-m-wrong-ds/values.yaml
  • testdata/adapter-configs/cl-m-wrong-nest/adapter-task-config.yaml
  • testdata/adapter-configs/cl-m-wrong-nest/values.yaml
  • testdata/adapter-configs/cl-precondition-error/adapter-task-config.yaml
  • testdata/adapter-configs/cl-precondition-error/values.yaml
  • testdata/adapter-configs/cl-stuck/adapter-task-config.yaml
  • testdata/adapter-configs/cl-stuck/values.yaml
🔗 Linked repositories identified

CodeRabbit considers these linked repositories for cross-repo context during reviews:

  • openshift-hyperfleet/architecture (manual)
  • openshift-hyperfleet/hyperfleet-api (manual)
  • openshift-hyperfleet/hyperfleet-sentinel (manual)
  • openshift-hyperfleet/hyperfleet-adapter (manual)
  • openshift-hyperfleet/hyperfleet-broker (manual)

Comment thread go.mod
github.com/blang/semver/v4 v4.0.0 // indirect
github.com/cespare/xxhash/v2 v2.3.0 // indirect
github.com/chai2010/gettext-go v1.0.2 // indirect
github.com/containerd/containerd v1.7.30 // indirect

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | 🏗️ Heavy lift

Remove the vulnerable containerd pin.

github.com/containerd/containerd v1.7.30 falls inside GO-2026-5378's affected range (v1.7.27 before v1.7.32), and the Go vuln DB also lists GO-2026-5758/CVE-2026-53488 and GO-2026-5338/CVE-2026-50195 for the module. Keeping it in the module graph drags a known-vulnerable supply-chain dependency into the build; please regenerate the dependency set so this release is no longer selected. (pkg.go.dev)

🧰 Tools
🪛 OSV Scanner (2.4.0)

[HIGH] 40-40: github.com/containerd/containerd 1.7.30: containerd CRI checkpoint restore CDI annotation smuggling in github.com/containerd/containerd

(GO-2026-5064)


[HIGH] 40-40: github.com/containerd/containerd 1.7.30: containerd: CRI checkpoint import allows local image tag poisoning in github.com/containerd/containerd

(GO-2026-5338)


[HIGH] 40-40: github.com/containerd/containerd 1.7.30: containerd user ID handling bypass allows runAsNonRoot evasion in github.com/containerd/containerd

(GO-2026-5378)


[HIGH] 40-40: github.com/containerd/containerd 1.7.30: containerd image-triggered runtime DoS via unbounded group parsing in github.com/containerd/containerd

(GO-2026-5475)


[HIGH] 40-40: github.com/containerd/containerd 1.7.30: Arbitrary host CRI log file read via symlink following in CRI checkpoint restore in github.com/containerd/containerd

(GO-2026-5622)


[HIGH] 40-40: github.com/containerd/containerd 1.7.30: containerd CRI — image-config LABEL flows to restart-monitor binary:// logger: host-root command execution from an image pull in github.com/containerd/containerd

(GO-2026-5758)


[HIGH] 40-40: github.com/containerd/containerd 1.7.30: containerd user ID handling bypass allows runAsNonRoot evasion

(GHSA-fqw6-gf59-qr4w)


[HIGH] 40-40: github.com/containerd/containerd 1.7.30: containerd image-triggered runtime DoS via unbounded group parsing

(GHSA-jpcc-p29g-p8mq)


[HIGH] 40-40: github.com/containerd/containerd 1.7.30: containerd CRI — image-config LABEL flows to restart-monitor binary:// logger: host-root command execution from an image pull

(GHSA-xhf5-7wjv-pqxp)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go.mod` at line 40, Remove the vulnerable containerd dependency pin from the
module graph by updating the dependency set so
`github.com/containerd/containerd` is no longer selected at v1.7.30. Regenerate
the module metadata via the normal dependency management flow, and ensure any
reference to `github.com/containerd/containerd` in go.mod/go.sum is replaced
with a non-vulnerable resolved version outside the affected range. Verify the
resulting dependency graph no longer includes the vulnerable `containerd`
module.

Source: Linters/SAST tools

Comment thread pkg/helper/adapter.go
Comment on lines 88 to +90
// Set default timeout if not specified
if opts.Timeout == 0 {
opts.Timeout = 5 * time.Minute
opts.Timeout = 1 * time.Minute

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Restore the previous deploy timeout budget.

Every adapter spec shown here leaves AdapterDeploymentOptions.Timeout unset, so this change cuts the default Helm --wait window from 5m to 1m on the main path. Fresh image pulls and rollout on shared CI clusters routinely exceed 60s, so this is a flake factory rather than a real signal.

Suggested fix
-	if opts.Timeout == 0 {
-		opts.Timeout = 1 * time.Minute
-	}
+	if opts.Timeout == 0 {
+		opts.Timeout = 5 * time.Minute
+	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Set default timeout if not specified
if opts.Timeout == 0 {
opts.Timeout = 5 * time.Minute
opts.Timeout = 1 * time.Minute
// Set default timeout if not specified
if opts.Timeout == 0 {
opts.Timeout = 5 * time.Minute
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/helper/adapter.go` around lines 88 - 90, The default deployment timeout
in adapter helper logic was reduced too far, causing the main Helm wait path to
time out prematurely when AdapterDeploymentOptions.Timeout is unset. In the
timeout fallback inside helper logic (the code that checks opts.Timeout in
adapter.go), restore the previous deploy budget by setting the default back to 5
minutes instead of 1 minute so the existing adapter deployment flow remains
stable on slower CI and image pulls.

Comment thread pkg/helper/adapter.go
Comment on lines +198 to +206
// Add run-id label for resource tracking and cleanup
if runID := os.Getenv("RUN_ID"); runID != "" {
// Label the Helm release itself (for Helm SDK-based cleanup)
helmArgs = append(helmArgs,
"--labels", fmt.Sprintf("e2e.hyperfleet.io/run-id=%s", runID),
)
helmArgs = append(helmArgs,
"--set", fmt.Sprintf("commonLabels.e2e.hyperfleet.io/run-id=%s", runID),
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf 'File outline:\n'
ast-grep outline pkg/helper/adapter.go --view expanded || true

printf '\nRelevant lines:\n'
cat -n pkg/helper/adapter.go | sed -n '170,230p'

printf '\nSearch for Helm args and label handling:\n'
rg -n --hidden --no-messages 'RUN_ID|commonLabels|--labels|--set-string|--set ' pkg/helper pkg | sed -n '1,200p'

Repository: openshift-hyperfleet/hyperfleet-e2e

Length of output: 5969


🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf 'Helm CLI quoting/escaping references in repo:\n'
rg -n --hidden --no-messages 'helm.*--set|--set-string|commonLabels\.' .

printf '\nPotential chart templates/values using run-id label key:\n'
rg -n --hidden --no-messages 'e2e\.hyperfleet\.io/run-id|run-id' charts pkg testdata docs .

Repository: openshift-hyperfleet/hyperfleet-e2e

Length of output: 8179


🏁 Script executed:

#!/bin/bash
set -euo pipefail

python3 - <<'PY'
import pathlib, re
p = pathlib.Path('pkg/helper/adapter.go')
text = p.read_text()
for i, line in enumerate(text.splitlines(), 1):
    if 'RUN_ID' in line or 'commonLabels' in line or '--labels' in line or '--set' in line:
        print(f"{i}: {line}")
PY

Repository: openshift-hyperfleet/hyperfleet-e2e

Length of output: 725


🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf 'adapter.go excerpt:\n'
cat -n pkg/helper/adapter.go | sed -n '190,215p'

printf '\nWhere commonLabels is consumed:\n'
rg -n --hidden --no-messages 'commonLabels|run-id' . --glob '!**/.git/**' --glob '!**/vendor/**'

Repository: openshift-hyperfleet/hyperfleet-e2e

Length of output: 5049


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n --hidden --no-messages 'e2e\.hyperfleet\.io/run-id|RUN_ID|run-id' .

printf '\nHelm value wiring around commonLabels:\n'
rg -n --hidden --no-messages 'commonLabels' . --glob '!**/.git/**'

Repository: openshift-hyperfleet/hyperfleet-e2e

Length of output: 6591


🌐 Web query:

Helm --set key with dots label keys escaped backslash commonLabels parser docs

💡 Result:

When using the Helm --set flag, keys containing dots are interpreted as nested objects by default [1][2]. To treat a dot as a literal character in a key (such as for labels or annotations like kubernetes.io/role), you must escape it using a backslash [1][2]. Correct Syntax for Dotted Keys To correctly set a value for a key containing dots, use the following syntax: helm install --set key."sub.key.with.dots"=value my-chart Key Requirements: 1. Quote the Key: Enclose the entire key (or the specific path segment containing the dots) in double quotes [1][3][4]. 2. Escape the Dot: Precede each literal dot with a backslash [1][2]. 3. Shell Considerations: Depending on your shell environment (e.g., Bash, Zsh, or running within a CI/CD pipeline like Jenkins), you may need to double-escape the backslash to ensure the backslash is passed through to the Helm parser [5][3][6]. For example, use \. if a single backslash is being stripped by the shell before reaching Helm [5][6]. Example If you need to set a commonLabel or annotation with a dot, such as kubernetes.io/role=master, use: helm install --set commonLabels."kubernetes.io/role"=master. If running in a shell that interprets backslashes, you might need: helm install --set commonLabels."kubernetes\.io/role"=master. Troubleshooting If you continue to face issues with complex values or deeply nested structures, consider using a values file (--values or -f) instead of --set [1][5][7]. Providing a YAML file is generally more robust and avoids the syntactic limitations and escaping complexities associated with the --set flag [5][7].

Citations:


Escape the dotted run-id key in the Helm --set path.

commonLabels.e2e.hyperfleet.io/run-id is parsed as nested values, so the chart won’t receive the literal label key and cleanup will miss those resources. Use --set-string with an escaped path like commonLabels.e2e\\.hyperfleet\\.io/run-id=%s.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/helper/adapter.go` around lines 198 - 206, The run-id label passed
through Helm is using an unescaped dotted key in the commonLabels --set path, so
Helm parses it as nested fields instead of the literal label name. Update the
RUN_ID handling in adapter.go to use the Helm release labeling plus a
string-based set flag with an escaped key path for commonLabels, referencing the
existing RUN_ID block in the adapter logic. Keep the literal
e2e.hyperfleet.io/run-id label intact so cleanup can match resources correctly.

Comment thread pkg/helper/cleanup.go
Comment on lines +19 to +155
func (h *Helper) SweepTestResources(ctx context.Context) error {
runID := os.Getenv("RUN_ID")
if runID == "" {
logger.Warn("RUN_ID not set, skipping cleanup")
return nil
}

logger.Info("cleaning up resources with run-id", "run-id", runID)

labelSelector := fmt.Sprintf("e2e.hyperfleet.io/run-id=%s", runID)

// Uninstall orphaned helm releases with the run-id label
// This ensures proper cleanup of Helm-managed resources before orphan cleanup
h.uninstallHelmReleases(ctx, labelSelector)

// Get kubernetes config for dynamic client (for CRDs)
kubeconfig := clientcmd.NewDefaultClientConfigLoadingRules().GetDefaultFilename()
restConfig, err := clientcmd.BuildConfigFromFlags("", kubeconfig)
if err != nil {
return fmt.Errorf("failed to build kubeconfig: %w", err)
}

// Create dynamic client for custom resources
dynamicClient, err := dynamic.NewForConfig(restConfig)
if err != nil {
logger.Error("failed to create dynamic client", "error", err)
}

// Run all resource deletions in parallel except namespaces
var wg sync.WaitGroup

// Delete jobs
wg.Add(1)
go func() {
defer wg.Done()
logger.Info("Starting deletion of orphaned jobs", "label", labelSelector)
jobs, err := h.K8sClient.BatchV1().Jobs("").List(ctx, metav1.ListOptions{
LabelSelector: labelSelector,
})
if err != nil {
logger.Error("failed to list jobs", "error", err)
return
}
for _, job := range jobs.Items {
logger.Info("deleting job", "name", job.Name, "namespace", job.Namespace)
propagationPolicy := metav1.DeletePropagationForeground
if err := h.K8sClient.BatchV1().Jobs(job.Namespace).Delete(ctx, job.Name, metav1.DeleteOptions{
PropagationPolicy: &propagationPolicy,
}); err != nil {
logger.Error("failed to delete job", "namespace", job.Namespace, "name", job.Name, "error", err)
}

}
}()

// Delete deployments
wg.Add(1)
go func() {
defer wg.Done()
logger.Info("Starting deletion of orphaned deployments", "label", labelSelector)
deployments, err := h.K8sClient.AppsV1().Deployments("").List(ctx, metav1.ListOptions{
LabelSelector: labelSelector,
})
if err != nil {
logger.Error("failed to list deployments", "error", err)
return
}
for _, deployment := range deployments.Items {
logger.Info("deleting deployment", "name", deployment.Name, "namespace", deployment.Namespace)
propagationPolicy := metav1.DeletePropagationForeground
if err := h.K8sClient.AppsV1().Deployments(deployment.Namespace).Delete(ctx, deployment.Name, metav1.DeleteOptions{
PropagationPolicy: &propagationPolicy,
}); err != nil {
logger.Error("failed to delete deployment", "namespace", deployment.Namespace, "name", deployment.Name, "error", err)
}

}
}()

// Delete configmaps
wg.Add(1)
go func() {
defer wg.Done()
logger.Info("Starting deletion of orphaned configmaps", "label", labelSelector)
configmaps, err := h.K8sClient.CoreV1().ConfigMaps("").List(ctx, metav1.ListOptions{
LabelSelector: labelSelector,
})
if err != nil {
logger.Error("failed to list configmaps", "error", err)
return
}
for _, cm := range configmaps.Items {
logger.Info("deleting configmap", "name", cm.Name, "namespace", cm.Namespace)
if err := h.K8sClient.CoreV1().ConfigMaps(cm.Namespace).Delete(ctx, cm.Name, metav1.DeleteOptions{}); err != nil {
logger.Error("failed to delete configmap", "namespace", cm.Namespace, "name", cm.Name, "error", err)
}
}
}()

// Delete AppliedManifestWorks
if dynamicClient != nil {
wg.Add(1)
go func() {
defer wg.Done()
logger.Info("Starting deletion of orphaned AppliedManifestWorks", "label", labelSelector)
appliedManifestWorkGVR := schema.GroupVersionResource{
Group: "work.open-cluster-management.io",
Version: "v1",
Resource: "appliedmanifestworks",
}

appliedWorks, err := dynamicClient.Resource(appliedManifestWorkGVR).List(ctx, metav1.ListOptions{
LabelSelector: labelSelector,
})
if err != nil {
logger.Error("failed to list applied manifest works", "error", err)
return
}
for _, work := range appliedWorks.Items {
logger.Info("deleting applied manifest work", "name", work.GetName())
if err := dynamicClient.Resource(appliedManifestWorkGVR).Delete(ctx, work.GetName(), metav1.DeleteOptions{}); err != nil {
logger.Error("failed to delete applied manifest work", "name", work.GetName(), "error", err)
}
}
}()
}

// Wait for all resource deletions to complete before deleting namespaces
// This prevents stuck namespaces due to AppliedManifestWork finalizers
wg.Wait()
logger.Info("all resources deleted, proceeding to namespace cleanup")

// Delete namespaces LAST - they can be deleted in parallel from each other
h.deleteNamespaces(ctx, labelSelector)

return nil
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift

Propagate cleanup failures back to AfterSuite.

SweepTestResources only returns kubeconfig-build failures. Helm uninstall failures, dynamic-client init failures, list/delete failures, and namespace cleanup failures are all logged and dropped, so pkg/e2e/suite.go sees nil and reports a clean teardown while orphaned resources remain. As per path instructions, "Log-and-continue MUST be intentional degradation with a comment, not a missing return."

Also applies to: 210-243

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/helper/cleanup.go` around lines 19 - 155, SweepTestResources currently
logs most cleanup errors but still returns nil, so AfterSuite cannot detect
failed teardown. Update Helper.SweepTestResources and any called cleanup helpers
it relies on, such as uninstallHelmReleases and deleteNamespaces, to collect and
return errors instead of dropping them; make dynamic client init, list/delete
failures, and namespace cleanup failures propagate back to the caller. Keep any
intentional log-and-continue paths explicit with comments, and ensure
pkg/e2e/suite.go can fail the suite when cleanup does not complete successfully.

Source: Path instructions

Comment thread pkg/helper/cleanup.go
Comment on lines +130 to +140
appliedWorks, err := dynamicClient.Resource(appliedManifestWorkGVR).List(ctx, metav1.ListOptions{
LabelSelector: labelSelector,
})
if err != nil {
logger.Error("failed to list applied manifest works", "error", err)
return
}
for _, work := range appliedWorks.Items {
logger.Info("deleting applied manifest work", "name", work.GetName())
if err := dynamicClient.Resource(appliedManifestWorkGVR).Delete(ctx, work.GetName(), metav1.DeleteOptions{}); err != nil {
logger.Error("failed to delete applied manifest work", "name", work.GetName(), "error", err)

@coderabbitai coderabbitai Bot Jul 1, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== cleanup.go around target lines =="
sed -n '1,260p' pkg/helper/cleanup.go

echo
echo "== search for appliedManifestWorkGVR and namespace usage =="
rg -n "appliedManifestWorkGVR|AppliedManifestWork|Namespace\\(" pkg/helper pkg -g '!**/*vendor*'

echo
echo "== locate resource definitions if any =="
rg -n "GroupVersionResource|appliedmanifestworks|ManifestWork" .

Repository: openshift-hyperfleet/hyperfleet-e2e

Length of output: 38740


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== docs/debugging cleanup section =="
sed -n '270,340p' docs/debugging.md

echo
echo "== CleanupTestCluster and caller behavior =="
rg -n "CleanupTestCluster|SweepTestResources|AfterSuite|DeferCleanup" pkg e2e docs -g '!**/*vendor*'

echo
echo "== cleanup helper around error handling and namespace finalization =="
sed -n '1,260p' pkg/helper/helper.go

Repository: openshift-hyperfleet/hyperfleet-e2e

Length of output: 20108


🌐 Web query:

Kubernetes dynamic client namespaced resource delete without Namespace() behavior appliedmanifestworks namespaced resource delete

💡 Result:

When using the Kubernetes dynamic client, calling Resource on a namespaced resource type without subsequently chaining the Namespace method will result in an error or unexpected behavior, as the client requires namespace scoping for namespaced resources [1][2][3]. The dynamic client's internal validation mechanism, specifically seen in the implementation of the Delete and DeleteCollection methods, explicitly checks for namespace validity when performing operations on namespaced resources [4][5]. If the resource is namespaced and no namespace is provided, the validation logic fails because it cannot construct the required URL path (e.g., /apis///namespaces///) [4][5]. Regarding ManifestWorks in Open Cluster Management (OCM), deleting a namespaced resource managed by a ManifestWork is handled by the OCM controller, not directly by a standard Kubernetes dynamic client call on the underlying resource [6][7]. ManifestWorks define a set of Kubernetes resources to be applied to a managed cluster [6][8]. When a ManifestWork is deleted (or when a resource is removed from its spec), the controller on the managed cluster automatically handles the deletion of the corresponding resources based on the specified DeleteOption (e.g., Foreground or Orphan propagation policies) [6][8]. You do not typically interact with the dynamic client to delete these individual "applied" resources manually; instead, you update or delete the ManifestWork object itself on the hub cluster [6][8].

Citations:


Delete AppliedManifestWork with its namespace. dynamicClient.Resource(appliedManifestWorkGVR).Delete(...) omits Namespace(work.GetNamespace()), so this cleanup path can fail for namespaced objects and leave finalizers behind (CWE-772).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/helper/cleanup.go` around lines 130 - 140, The cleanup path for
AppliedManifestWork deletion is missing the object namespace, so namespaced
resources may not be removed correctly. Update the deletion call in the applied
manifest work loop in cleanup.go to use the work’s namespace from
work.GetNamespace() along with work.GetName(), keeping the existing list/delete
flow in cleanup of appliedWorks and the
dynamicClient.Resource(appliedManifestWorkGVR) call.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They aren't namespaced objects

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ma-hill Confirmed — AppliedManifestWork (work.open-cluster-management.io/v1) is cluster-scoped, not namespaced, so dynamicClient.Resource(appliedManifestWorkGVR).Delete(...) without .Namespace() is correct. My original flag was wrong; withdrawing it.

Comment thread pkg/helper/cleanup.go
Comment on lines +197 to +205
func (h *Helper) patchRemoveFinalizers(ctx context.Context, namespace string) error {
// Patch to remove finalizers
patch := []byte(`{"metadata":{"finalizers":null}}`)
_, err := h.K8sClient.CoreV1().Namespaces().Patch(
ctx,
namespace,
"application/merge-patch+json",
patch,
metav1.PatchOptions{},

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Map the relevant file and nearby symbols first.
ast-grep outline pkg/helper/cleanup.go --view expanded || true

printf '\n--- relevant lines ---\n'
sed -n '160,240p' pkg/helper/cleanup.go

Repository: openshift-hyperfleet/hyperfleet-e2e

Length of output: 3559


🌐 Web query:

Kubernetes namespace finalizers finalize subresource spec.finalizers metadata.finalizers documentation

💡 Result:

In Kubernetes, finalizers allow resources to perform necessary cleanup tasks before being permanently deleted from etcd [1][2][3]. While most resources store finalizers in the metadata.finalizers field [1], Namespaces are unique in that they use a spec.finalizers field [4][2][5]. The spec.finalizers field on a Namespace is typically managed by the Namespace controller [4][2]. On a healthy cluster, this list usually contains only the "kubernetes" finalizer, which represents the namespace's own internal finalization logic [4][2]. The namespace controller is responsible for emptying this list once all contents of the namespace have been successfully deleted [4][2]. Crucially, the spec.finalizers field of a Namespace cannot be modified through standard update or patch requests to the resource itself [6][5]. To modify this field—specifically to force the deletion of a Namespace stuck in the Terminating state—one must use the dedicated /finalize subresource [4][2][7]. This subresource acts as the authorized API endpoint for the Namespace controller (or an administrator in an emergency) to update the spec.finalizers list [4][2][3]. For example, to manually force the removal of a stuck Namespace by clearing its finalizers, administrators often export the Namespace manifest, empty the spec.finalizers array, and push the update to the /finalize subresource using a tool like curl [7][8]. It is important to note that metadata.finalizers and spec.finalizers are distinct, and proposals to create a generic subresource for metadata.finalizers across all Kubernetes objects have been discussed but were not adopted [9]. Consequently, the behavior described for Namespaces remains a specialized mechanism within the Kubernetes API [4][2].

Citations:


Use the Namespace /finalize subresource here. Patching metadata.finalizers on the main Namespace resource does not clear the terminating spec.finalizers, so this fallback cannot unblock a stuck namespace.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/helper/cleanup.go` around lines 197 - 205, The patchRemoveFinalizers
helper is updating metadata.finalizers on the main Namespace object, which won’t
clear the terminating spec.finalizers. Update this fallback to use the Namespace
finalization path via h.K8sClient.CoreV1().Namespaces().Finalize (the /finalize
subresource) so the stuck namespace can actually be unblocked.

@openshift-ci

openshift-ci Bot commented Jul 1, 2026

Copy link
Copy Markdown

@ma-hill: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-deployment-validation c9f8d10 link true /test e2e-deployment-validation

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@ma-hill ma-hill marked this pull request as draft July 1, 2026 17:27
Comment thread pkg/e2e/suite.go
Comment on lines +53 to +63
cfg := GetSuiteConfig()
if cfg != nil {
h := helper.New()
if h != nil {
logger.Info("cleaning up test resources by run-id label")
// Use a fresh context with 5-minute timeout for cleanup
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
defer cancel()
if err := h.SweepTestResources(ctx); err != nil {
logger.Error("failed to cleanup test resources", "error", err)
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

helper.New() calls log.Fatalf on failure and never returns nil, so the if h != nil guard is dead code. If Helper creation fails during teardown (e.g., expired kubeconfig), the process will crash and test results may be lost. Consider either adding a non-fatal constructor variant or wrapping this in a recover block so cleanup failures degrade gracefully.

Comment thread pkg/helper/cleanup.go
Comment on lines +35 to +42
kubeconfig := clientcmd.NewDefaultClientConfigLoadingRules().GetDefaultFilename()
restConfig, err := clientcmd.BuildConfigFromFlags("", kubeconfig)
if err != nil {
return fmt.Errorf("failed to build kubeconfig: %w", err)
}

// Create dynamic client for custom resources
dynamicClient, err := dynamic.NewForConfig(restConfig)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider deriving the dynamic client from the existing K8sClient's REST config instead of re-reading kubeconfig from disk. This avoids an extra failure point and keeps client creation consistent with the rest of the method, which already uses h.K8sClient for typed resources.

Comment thread pkg/e2e/suite.go
Comment on lines +44 to +49
// Require RUN_ID to be set for resource cleanup tracking
runID := os.Getenv("RUN_ID")
if runID == "" {
log.Fatalf("RUN_ID environment variable must be set (e.g., export RUN_ID=test-run-$(date +%%s))")
}
logger.Info("test run identifier set", "run-id", runID)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message could mention sourcing env/env.local so developers know where RUN_ID gets set automatically.

Comment thread pkg/helper/cleanup.go
Comment on lines +19 to +22
func (h *Helper) SweepTestResources(ctx context.Context) error {
runID := os.Getenv("RUN_ID")
if runID == "" {
logger.Warn("RUN_ID not set, skipping cleanup")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding unit tests for SweepTestResources. This function handles resource deletion and force-removes namespace finalizers, but there's no test coverage to validate error handling or edge cases (e.g., partial deletion failures, stuck namespaces, no matching resources).

Comment thread pkg/helper/helm/helm.go
Comment on lines +58 to +60
func (c *Client) ListReleasesBySelector(ctx context.Context, labelSelector string) ([]string, error) {
// Initialize action config for all namespaces (empty string means all)
actionConfig, err := c.initActionConfig()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding unit tests for ListReleasesBySelector. This would help verify the label selector filtering, namespace scoping, and state mask behavior without needing a real cluster.

Comment thread env/env.local
Comment on lines +12 to 16
# Image Overrides
# =====================================================================
export IMAGE_REGISTRY="${IMAGE_REGISTRY:-quay.io}"
export API_IMAGE_REPO="${API_IMAGE_REPO:-redhat-services-prod/hyperfleet-tenant/hyperfleet/hyperfleet-api}"
export API_IMAGE_TAG="${API_IMAGE_TAG:-latest}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

env.local defines RUN_ID but env.ci doesn't. Every other required variable is defined in both files. Consider adding export RUN_ID="${RUN_ID:-${NAMESPACE}}" to env.ci for consistency, so developers sourcing it locally don't hit the BeforeSuite fatal.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Developers shouldn't be sourcing env.ci -- I believe env.ci should just be used for prow environment not user runs

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it to the e2e scripts in the openshift/release repo but maybe having it set here would be better, will update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants